../nndec-and-vuln

Source: https://arxiv.org/pdf/2412.07538v1

Date: 10 December 2024

Organisation(s): Università degli Studi di Napoli Federico II, Italy

Author(s): D. Cotroneo, F. C. Grasso, R. Natella, V. Orbinato

Note: When reading this, keep in mind that i personnaly think NN for decompilation is a bad idea so i may be biased

[Paper Short] Neural Decomp and Vuln Prediction

TLDR

It may be possible perform CWE detection with LLM derivated technology, for yes or no it’s precise to 95% and for CWE id finding it’s 80% detection. And the usage of Neural Decompilation enhanced the results compare to disassembly even if it’s not the best decompiled for reverser.

Goal

They introduce a system to do CWE Detection and Classification bases on the SoTa of classification (Transformers).

It use an LLM for decompilation which is a little tricky because the random aspect of LLM lead to a lack of trust from the user. But in this system the output is not for a human but for another model, i think it’s a good solution, so we can determine characteristic of the code that will me fact check by a human with standard decompilation methods, it’s more trustful.

How

They use various dataset for training :

Dataset# SamplesCompilableStandardized Taxonomy
ReVeal18,169NoN/A
Devign26,037NoN/A
Juliet128,198YesYes
BigVul264,919NoOnly for 8,783 samples
DiverseVul330,492NoOnly for 16,109 samples

The only on i know is diversevul but i’m very curious about juliet.

That the data for training but we a process to use it in real world, check below.

Function Filter

So they keep only the .text section of the data. Also they remove all the OS directive, main, lib … The goal is to only keep the function specific to this program.

This subfonction are what will be decompile and classified by the system.

Why it’s good

With this approach the system will analyse very small function which is essential for decompilation, the bigger it is the longer it will take.

It’s not really a problem to lack contextual data, the system is trained with raw function without context so it isn’t important.

How reliable

Decompilation

ApproachModelfunctionsMax lengthBLEU-4EDMETEORROUGE-L
GhidraIR1,00033822%23%29%22%
Katz et al.RNN700,00088-30%--
Hoss et al.fairseq2,000,000271-54%--
—————————–––––––———––—————––––––——––––———
ProposedCodeBERT75,00033831%39%43%56%
CodeT5+75,00033848%49%55%70%
fairseq75,00033858%59%64%77%

WARNING : A better score doesn’t mean a better material for reverser !! It just mean less “distance” between result and source

So we can see that with standard NLP efficiency mesure system their work is able to have a “better” decompile than Ghidra and concurrency NLP methods.

That mean more that the decompile code is better to be use by another model. So that’s good for their usecase !

I’m curious of the same statistic with decompiled from Binja and IDA

Classification

ApproachModelYes or noYes or noexact CWE idexact CWE id
Acc (%)F1 (%)Acc (%)F1 (%)
FlawfinderStatic52491311
Schaad et al.SRNN88887779
ProposedSRNN88887270
LSTM94947572
GRU91907877
CodeT5+94948181
CodeGPT95958282
CodeBERT94948382

Long Story short, it work well.

The Binary detection (Yes or No) perform very well to detect potentially vulnerable function. And the exact CWE matching is promissing (not enough for fully automated analysis but a goog indicator for human evaluation).

Conclusion

I think their work show that even if we cant trust Neural Decompiled it may be usefull for higher level classification problem.

It’s like an overkill and fat embedding so may be usefull for my semantic embedding quest. It need to be tested on other classification problem like purpose or author identification.

Also having a decompilation model is a great tool to go for decompiled enhancer.

But, there is a but, i may be distracted by i didn’t find details about how the data have been separated between train and test. So we can’t be sure if the model is good on is database or if he generalised this knowledge.

/machine learning/ /binary/ /reverse/