What is software reverse engineering?

Introduction and a few observations

Apr 23, 2023

Let’s take a look at the examples of a function description. Each of them is written in a different language. Don’t worry if you don’t understand some of them, they are just illustrative examples.

int add (int a, int b) {
    int result = 0;
    if (b<0) { a = -a; b = -b; }
    while (b >= 0) {
        result += a;
    }
    return result;
}

int add (int a, int b) {
return a*b;
}

add(int, int): # @add(int, int)
        push rbp
        mov rbp, rsp
        mov dword ptr [rbp - 4], edi
        mov dword ptr [rbp - 8], esi
        mov esi, dword ptr [rbp - 4]
        imul esi, dword ptr [rbp - 8]
        mov eax, esi
        pop rbp
        ret

.method public hidebysig static int32 add(int32 a,
                                        int32 b) cil managed
  {
    //
    .maxstack 2
    .locals init (int32 V_0)
    IL_0000: nop
    IL_0001: ldarg.0
    IL_0002: ldarg.1
    IL_0003: mul
    IL_0004: stloc.0
    IL_0005: br.s IL_0007
    IL_0007: ldloc.0
    IL_0008: ret
  } // end of method Program::multiply

Function “add” takes two integers as arguments and returns the result of a multiplication of these integers.

add(int, int):
        sub sp, sp, #16
        str w0, [sp, 12]
        str w1, [sp, 8]
        ldr w1, [sp, 12]
        ldr w0, [sp, 8]
        mul w0, w1, w0
        add sp, sp, 16
        ret

All these snippets describe a function which accepts two arguments and returns a number. Even though they are implemented differently, or - in one case - there is no actual implementation, they are all equivalent. If we take the same two arguments we will always get the same result. Some of the solutions may be more optimal than the others, but the end result is the same.

The main purpose of reverse engineering is to document that function equivalence and describe the code in the most concise and appropriate manner. If this multiplication function is a part of an encryption algorithm, like RSA, it doesn’t make much sense to describe the function itself. Rather you would simply say that the function does an RSA encryption routine.

This is of course not a precise definition of reverse engineering but it describes the main components of software reverse engineering:

Equivalence - what you describe has to be the actual code that is being executed. While this sounds obvious it highlights the need to confirm any assumptions we may hold before writing a reverse engineering report. For example the name of the function above - add - suggests something completely different than the actual executed code.
Brevity - editing and figuring out which parts of the code are important and which aren’t. Do you really need to understand the encryption routine or do you just need to decrypt a string? Maybe there’s a way to do that without understanding how decryption works. Do you need to describe the obfuscation or is the code behind the obfuscation really important? If you need to only decide whether a software steals some data from a computer you may not need to understand how it creates a network connection.
Usefulness - having a goal. Are you trying to understand a specific part of a code? Are you trying to create a way to detect malware? Are you trying to find a vulnerability? Whatever the purpose is, keep that in mind. Otherwise you are in danger of wandering around the code and getting lost. It’s very rare that your goal is to understand every aspect of the analysed software.

Without equivalence you could write in the report anything you want. Without brevity you can simply print out the assembly code and call it a day. Without usefulness you can describe in detail every line of the RSA encryption algorithm, describe the state of the computer after every instruction and document the memory layout changes while still not mentioning what data is being encrypted. In most cases the data passed to the encryption function is probably more important than the details of the encryption implementation.

Conversely, if you take any of the components to the extreme and forget about the other two, you will also end up with a report that is simply wrong. If your report only says “this is software” you might have taken brevity a bit too far. If you want to extract a URL and use the Linux command strings to do it, you may not have equivalence, because you don’t even know if that URL is ever used. Finally, taking equivalence to the extreme can mean describing every single instruction and CPU state.

When you are reverse engineering a software for hours or days it’s very easy to get lost. I got lost many, many times and I had to throw out parts of my hard work, because they weren’t useful to anyone. That’s why it’s important to remember the principles above when you are analysing the code.

P.S. If you’re wondering why the name of the function is add if it multiplies the numbers - it’s a lesson that you should never trust function names! :)

Notes on reverse engineering

Ready for more?