Overview of data for automated reverse engineering

Intro

In this blog i will expose the different kind of data i found in my state of the art. I will talk about the more relevant one because i forgot my notes at a friend house (yes, our parties are fun but nerd only)

In this project because i was a total noob i decided to only study ELF so i wont talk about stuff i’ve only heard about like packer but never really study the subject.

Also because i wanted to have a minimum challenge i assumed that the program didn’t leave any debug symbol and is stripped.

Assembly

Let’s start with the raw, the raw file is a structure ELF but i won’t detail it’s structure we will only talk about the code.

first the file is just a bunch of raw instruction like this :

0000000 457f 464c 0102 0001 0000 0000 0000 0000
0000010 0002 003e 0001 0000 1050 0040 0000 0000
0000020 0040 0000 0000 0000 3918 0000 0000 0000
0000030 0000 0000 0040 0038 000c 0040 0020 001f
0000040 0006 0000 0004 0000 0040 0000 0000 0000
0000050 0040 0040 0000 0000 0040 0040 0000 0000
0000060 02a0 0000 0000 0000 02a0 0000 0000 0000
0000070 0008 0000 0000 0000 0003 0000 0004 0000
0000080 2000 0000 0000 0000 2000 0040 0000 0000
0000090 2000 0040 0000 0000 001c 0000 0000 0000

It’s more or less just “int” which correspond to operand, operator and constant. Some model work with that to perform disassembly and perform kinda well but it’s hard to trust a partly random process so i will prefer IDA to get the ASM :

; Segment type: Pure code
; Segment permissions: Read/Execute
_fini segment dword public 'CODE' use64
assume cs:_fini
;org 4011ACh
assume es:nothing, ss:nothing, ds:_data, fs:nothing, gs:nothing

public _term_proc
_term_proc proc near
endbr64                 ; _fini
sub     rsp, 8
add     rsp, 8
retn
_term_proc endp

_fini ends

With this a human can start working, and a machine also, you can try and put it in chatgpt it will describe what the code do, and given enough context he may be very precise about his purpose.

Now that we have assembly code we will try to characterise his usages

Graphs

I won’t come into details on what they are, you can check my notes if needed.

Graph	Type of values	Usage	Do I like it?
CG	Link betwn subroutines	struct of the code	Not very useful
DFG	Data move in subroutine	Subroutine role ID	Underestimated
CFG	Subroutines logic	Comp param ID + role ID	Intuitive, easy to use
ACFG	CFG more detailed	Auto RE focus	Promising

I thing that ACFG with anotation generated from CFG may be very usefull for obfuscation characterisation

Normalisation

To minimise the impact of ISA on your model it’s very useful to simplify the instruction so they just describe the logic of the code.

As an exemple this ASM :

section .data
    value1 dd 10
    value2 dd 20

section .text
    global _start

_start:
    mov eax, [value1]   
    add eax, [value2]   
    shl eax, 2          
    push eax            
    pop ebx             
    xor ecx, ecx

will get this norm

section .data
    MEM1 dd 10
    MEM2 dd 20

section .text
    global START

START:
    MOV REG, [MEM1]     
    ARI REG, [MEM2]     
    BIT REG, 2          
    PUSH REG            
    POP REG             
    BIT REG, REG

Because for our study, we don’t really care in which register or adresse the value go, it’s the kind of memory movement which is important.

Also some person add operator simplification, i don’t like it because oven it’s too simplified but in some usecase it’s very useful.

Embedding

There is different way to embed the code, there is x2v which represent instruction as vector but higher level there is some people which use BERT-like model and GCNN to embed Basic Block and/or graphs.

The most easy to understand is function naming, furthermore i think it’s a very useful information to give to a model to understand the use of your function, if the name represent well the function in context it’s a high value information.

Emulation

We are getting some limitation if we only use static analysis, and emulation is a good way to get high level information.

In my Sote i found that emulation allow you to get :

return value
return type
variable type

[i will continue this with my note i can’t remember everything]

/reverse/ /machine learning/ /binary/ /research/