Overview of data for automated reverse engineering
Intro
In this blog i will expose the different kind of data i found in my state of the art. I will talk about the more relevant one because i forgot my notes at a friend house (yes, our parties are fun but nerd only)
In this project because i was a total noob i decided to only study ELF so i wont talk about stuff i’ve only heard about like packer but never really study the subject.
Also because i wanted to have a minimum challenge i assumed that the program didn’t leave any debug symbol and is stripped.
Assembly
Let’s start with the raw, the raw file is a structure ELF but i won’t detail it’s structure we will only talk about the code.
first the file is just a bunch of raw instruction like this :
0000000 457f 464c 0102 0001 0000 0000 0000 0000
0000010 0002 003e 0001 0000 1050 0040 0000 0000
0000020 0040 0000 0000 0000 3918 0000 0000 0000
0000030 0000 0000 0040 0038 000c 0040 0020 001f
0000040 0006 0000 0004 0000 0040 0000 0000 0000
0000050 0040 0040 0000 0000 0040 0040 0000 0000
0000060 02a0 0000 0000 0000 02a0 0000 0000 0000
0000070 0008 0000 0000 0000 0003 0000 0004 0000
0000080 2000 0000 0000 0000 2000 0040 0000 0000
0000090 2000 0040 0000 0000 001c 0000 0000 0000
It’s more or less just “int” which correspond to operand, operator and constant. Some model work with that to perform disassembly and perform kinda well but it’s hard to trust a partly random process so i will prefer IDA to get the ASM :
; Segment type: Pure code
; Segment permissions: Read/Execute
_fini segment dword public 'CODE' use64
assume cs:_fini
;org 4011ACh
assume es:nothing, ss:nothing, ds:_data, fs:nothing, gs:nothing
public _term_proc
_term_proc proc near
endbr64 ; _fini
sub rsp, 8
add rsp, 8
retn
_term_proc endp
_fini ends
With this a human can start working, and a machine also, you can try and put it in chatgpt it will describe what the code do, and given enough context he may be very precise about his purpose.
Now that we have assembly code we will try to characterise his usages
Graphs
I won’t come into details on what they are, you can check my notes if needed.
Graph | Type of values | Usage | Do I like it? |
---|---|---|---|
CG | Link betwn subroutines | struct of the code | Not very useful |
DFG | Data move in subroutine | Subroutine role ID | Underestimated |
CFG | Subroutines logic | Comp param ID + role ID | Intuitive, easy to use |
ACFG | CFG more detailed | Auto RE focus | Promising |
I thing that ACFG with anotation generated from CFG may be very usefull for obfuscation characterisation
Normalisation
To minimise the impact of ISA on your model it’s very useful to simplify the instruction so they just describe the logic of the code.
As an exemple this ASM :
section .data
value1 dd 10
value2 dd 20
section .text
global _start
_start:
mov eax, [value1]
add eax, [value2]
shl eax, 2
push eax
pop ebx
xor ecx, ecx
will get this norm
section .data
MEM1 dd 10
MEM2 dd 20
section .text
global START
START:
MOV REG, [MEM1]
ARI REG, [MEM2]
BIT REG, 2
PUSH REG
POP REG
BIT REG, REG
Because for our study, we don’t really care in which register or adresse the value go, it’s the kind of memory movement which is important.
Also some person add operator simplification, i don’t like it because oven it’s too simplified but in some usecase it’s very useful.
Embedding
There is different way to embed the code, there is x2v which represent instruction as vector but higher level there is some people which use BERT-like model and GCNN to embed Basic Block and/or graphs.
The most easy to understand is function naming, furthermore i think it’s a very useful information to give to a model to understand the use of your function, if the name represent well the function in context it’s a high value information.
Emulation
We are getting some limitation if we only use static analysis, and emulation is a good way to get high level information.
In my Sote i found that emulation allow you to get :
- return value
- return type
- variable type
[i will continue this with my note i can’t remember everything]