Skip to content

Reverse Engineering

Reverse engineering is the process where an engineered artefact is deconstructed into its constituent parts, such that it reveals details about its design and architecture.

In terms of software, this is taking a program without source code or documentation and trying to establish what the program does and recovering details about its design and implementation.

There are many different representations of code for programs. We have a hierarchy:

  • Zero Level -- Machine Code (the Binary Representation)
  • Low Level -- Assembly Instructions (can be Reconstructed)
  • Intermediate Level -- 'Low Level' Compiled Programming Languages (e.g., C++)
  • High Level -- Interpreted Languages (e.g., Python)

Low Level Software Taxonomy

Development tools now isolate the software developers from the processor architectures and the assembly languages that are used. In most instances, working at the assembly level is too cumbersome for the programmer, so we allow automated tools to compile a more understandable representation into assembly, then ultimately the machine code.

Low level software (e.g., system software) is the layer that isolates software developers and the application programs from the physical hardware. E.g., C programs invoke the printf() call, which in turn calls the write() syscall.

In order to reverse engineer a program, a solid understanding of the low-level software and programming is required.

Machine Code

This is the binary code, in a sequence of bits that contain the instructions for the CPU to run. The CPU loads the memory of the program, usually in a separate stream to the data for the program, but dependent on instructions.

CPUs have a set of operation codes which are fed into specific registers of the CPU and tell the CPU what instructions to perform on data that is in the RAM or the registers.

Assembly

This is the next step up from machine code and is a textual representation of the bits in machine code. We give the opcodes special names to make them easier to understand and so that the human writing the code doesn't have to use a cumbersome look-up table for extracting the operation. We use, e.g., MOV for moving, and XCHG for exchanging.

These are called mnemonics. It is commonly confused that machine code and the assembly are the same thing. They are just two different representations of exactly the same thing. They will both run at the same speed.

Object Code

This is a sequence of opcodes and other numbers used in connection with opcodes to perform operations. CPUs read object code from memory, decode it, then act based on the instructions in the memory. Assemblers translate the textual language code into binary. A disassembler reverses this into binary code.

Compilers

Compilers take source files and generate corresponding machine code files. The source describes the program in a high-level language, which is then compiled typically to machine code for a specific architecture.

We can also tell the compiler to output assembly, which is our slightly human readable version of the machine code. The compiler can transpilate to a different architecture, which has a different set of opcodes and registers to standard x86.

If we want to compile for many architectures at one, e.g., LLVM intermediate representation, or JVM Bytecode, then we produce assembly which can be interpreted by a virtual machine.

We can also "lift" assembly from machine code into an intermediate representation, then tell a compiler to compile for a different architecture. Remill is one such project that comes to mind.

JVM

Java was originally such a successful language, as it was a write once, deploy anywhere system. Any architecture that has the Java Runtime Environment installed can execute any instructions from a compiled Java program, as Java compiles into bytecode, which is then translated to the specific architecture by the JVM.

Provided there exists a JVM for the specified architecture, a Java JAR can be run on any platform, regardless of when it was compiled.

Operating System

This is the program that manages the computer, both in terms of the hardware and software applications that run. It is responsible for translating the user's requests for things in hardware to work on that hardware, and also is responsible for disk access and resource allocation and sharing.

The operating system also provides lots of security features to ensure that programs do not hog system resources or cause damage where they shouldn't.

Overall Picture

graph LR
subgraph "Program Analysis Tools"
  A(Debugger);
  B(Decompiler);
  C(Disassembler);
end
D[Executable File] --> C;
D --> B;
E(Loader) --> F[System Process] --> A;
B --> G["`Source Code (Interpreted Execution)`"] --> H(Interpreter) --> F;
B --> I["`Source Code (Compiled Execution)`"] --> J(Compiler) --> K[Assembly Code] --> L(Assembler) --> M["`Object (Machine Code)`"] --> N(Linker) --> D;
O[Static Library] --> N;
P[Object Code] --> N;
B --> K;
D --> E;
Q[Dynamic Library] --> E;

Reverse Engineering Tooling

System Monitoring Tools

There are a variety of tools that allow us to observe the program that is being reverse engineering. Almost all parts of the program go through the operating system in one way or another, so we can instrument the runtime to tell us some more information about what it is doing.

System monitoring tools can check all sorts of things, such as the network activity, file accesses, registry access, etc. We can also track OS objects such as mutexes, events, etc.

Disassemblers

These are programs that take the executable binary as an input, and output textual human-readable code from the machine code. This is easy to do to recover the assembly, but as we go to higher level languages this can be quite tedious.

Example disassemblers include IDA Pro, Hopper Dissassembler, and the NSA's very own Ghidra. Assemblers can also be configured to output a control flow graph between different subroutines, which can be helpful for analysis after the fact.

Debuggers

These are programs that allow us to see what the program is doing whilst it runs. The most basic features that we include in a debugger are the breakpoints, and the ability to trace through code executions. The debugger is often much more powerful than adding a print statement to the code.

Tracing is the process of stepping through the program one line at a time. Debuggers typically allow us to set the breakpoints at specific lines in the source code, and this is done by instrumenting the binary that we test with some special instructions.

Reverses use debuggers in a disassembly mode, where the code is disassembled on-the-fly. We can then step through the "virtual" disassembled code, in a similar way to the normal debugger flow, but on the code that the disassembler thinks the assembly generates.

OllyDbg

Disassembled code is given as executed, then stored to the debugging program for later analysis. The registers and their current values are stored and the program allows us to modify the values stored here.

The debuggers also often give the current memory being used by the program (including the heap and the stack).

Decompilers

These are essentially disassemblers that attempt to reproduce the initial high level language code that is used. We try to reverse the compilation process to obtain the original code or something similar. We normally can't fully recover the original source, as the compiler does lots of optimizations, and things such as function names are usually missing, unless the binary has been compiled with symbols enabled.

9Rays Spices .NET Decompiler

This is a decompiler that was mentioned in the lectures. It allows us to get .NET code back from the compiled binaries. Most decompilers also allow us to rename methods and variables to the decompiled code, so that we can assign semantic meaning to code we get back.

Reversal Process

There are two main phases to a reversing process. We have the system-level, where we take a large scale observation of a program. The code level reversing process is the second part, and is usually more in-depth than the system level.

System-Level Reversing

These help us to determine the general structure of the program, and areas of interest within the program. We use tools on the program and then use the instrumentation provided by the system to obtain the information, inspection and tracking of the I/O from a program.

This is doable, because the program is controlled by the operating system and thus uses OS-level APIs when it makes calls to the outside world through the operating system.

Code-Level Reversing

This is where we have a more detailed look into a selected chunk of code, extracting the design concepts and algorithms that are used by that part of the code. To do this, we need to know reversal techniques available to us to generate pseudocode (assembly or pseudo-C), then understand how the CPU and the operating system actually work to produce the final result.

We observe at a very low-level to see all the details of the program. These are typically automatically generated by the compiler, and the actual programmer's code is usually much simpler.

Applications of Reverse Engineering

Reverse engineering can be split into two main reasons. The first is security-related reversing, for e.g., finding vulnerabilities in software which we can use for malicious software, reversing cryptographic algorithms (e.g., the proprietary TETRA algorithms), DRM bypass, and auditing proprietary binaries that we did not compile ourselves from the initial source code.

The second is software development-related, so that we can increase interoperability with other programs, developing competing software, and for evaluating the quality and robustness of the software (commonly known as black-box testing).

Malicious Software

As the prevalence of internet-connected devices increases, bad actors want to exploit vulnerabilities in programs, so that they can spread their malware and attack their target systems. Malware analysts reverse engineer every malware file that they get, tracing every step the malware takes and assess the damage that the malware might cause, removal steps, and patching steps to prevent the malware from continuing to exploit systems further.

Cryptographic Algorithms

For restricted cryptographic algorithms, where there is no key negotiation between devices, the algorithm itself is often the secret. According to Kerckhoffs's principle, all parts of an algorithm and the encrypted plaintext should be able to be exposed to an attacker, without affecting the security of the message itself, provided that the key hasn't been leaked.

Once the reverser sees the algorithm, then there is only a matter of time until any messages using that encryption scheme are no longer secure. This, again, was an issue with TETRA, as although different users made use of different keys, the algorithms that were used were not truly secure.

For key-based algorithms, the only way we can decrypt is to obtain the keys, brute force, or look for a flaw that can be employed to extract the key or original message.

Digital Rights Management

DRM is the process of securing digital content on a computer that the content owner does not have control of. It is very easy to move the information for a film or other media around, and this can be very easily duplicated. Piracy is therefore easy to do from a consumer standpoint, as they simply download the file, then copy it and redistribute them.

Historically, copy-protection software was employed to stop end users from copying the software or other media. Crackers employ reverse engineering techniques to learn how the technology works and allow modification to the program to disable the copy protection.

Auditing Program Binaries

Open source software is more dependable and secure. This is because engineers are free to inspect and approve the software. They can file bug reports or contact maintainers to allow them to see issues that can cause vulnerabilities with the software.

Proprietary software doesn't have source code available, but engineers still want to see what the program is doing. Although reversal doesn't yield accessible and readable software for any decently sized project, some analysis can still allow us to see where security risks in the software may occur.

Interoperability

Software engineers may wish to make software that interacts with proprietary or partially documented software that is provided. When working with a proprietary software library or API, then there is quite often not enough documentation.

In instances the source code is open sourced, we can look at the expected I/O that the program needs to see to work properly. With a binary that doesn't have available source code, this process becomes much more difficult, unless we are able to see a version of the code that is running within the 3rd party binary.

Competing Software

It wouldn't make sense to reverse engineer a competitor's entire software stack. It is a lot easier to design and develop your own competing system. You can then look at the complicated aspects of the competitor's system, reverse them and reimplement them in your own product, but we have to bear in mind that there are some legal implications, as the code would technically be their own proprietary intellectual property.

Evaluating Quality and Robustness

In addition to security audits on a binary, we can also sample the binary to see roughly the quality of the underlying codebase that was used to compile the code. Trusting a vendor might be an issue with something that is safety critical and we'd want to audit the software before we start using it.

There is a lot of legal debate about reverse engineering. We look at the impact that reverse engineering has on society as a whole. It is almost always worth getting legal counsel before we look at whether a particular project is illegal.

DMCA

DMCA is a legal framework to prevent reverse engineering of copyright protection systems. If we reverse copyright protection, then we are removing the copyright restriction. We are not allowed to circumvent DRM, even for personal use. We cannot also make available any product or technology that circumvents a DRM technology.

Exemptions

We are allowed to reverse engineer or circumvent protection systems to allow interoperability between computer programs, to enable or simplify development of new and improved technologies.

We can also circumvent for encryption research, security testing, educational institutions and libraries, government investigations, and protection of privacy.

Precautions to Take

When reverse engineering an unknown binary, we should take several steps to ensure that the binary doesn't cause damage to our host machine. The main points are:

  • Use a virtualized environment
  • Isolate the running environment
  • Capture traffic through APIs
  • Use dedicated hardware if needed
  • Make copies of the initial system state
  • Don't connect to the internet