LLM4反编译

逆向工程：使用大型语言模型反编译二进制代码

更新

关于

LLM4Decompile是首个专注于反编译的开源大型语言模型。当前版本支持将Linux x86_64二进制文件（从GCC的O0到O3优化级别）反编译为人类可读的C源代码。我们团队致力于扩展该工具的功能，正在努力纳入更广泛的架构和配置。
LLM4Decompile-End专注于直接反编译二进制文件。LLM4Decompile-Ref优化Ghidra反编译的伪代码。

评估

框架

在编译过程中，预处理器处理源代码（SRC）以消除注释并展开宏或包含文件。处理后的代码传递给编译器，将其转换为汇编代码（ASM）。汇编器将ASM转换为二进制代码（0和1）。链接器通过链接函数调用完成过程，创建可执行文件。反之，反编译涉及将二进制代码转回源文件。大型语言模型（LLMs）经过文本训练，无法直接处理二进制数据。因此，必须先使用Objdump将二进制文件反汇编为汇编语言（ASM）。需要注意的是，二进制和反汇编的ASM是等效的，它们可以相互转换，因此我们交替使用这两个术语。最后，通过计算反编译代码与源代码之间的损失来指导训练。为评估反编译代码（SRC'）的质量，通过测试断言（可重执行性）测试其功能。

指标

可重执行性评估反编译代码是否能正确执行并通过所有预定义的测试用例。

基准测试

HumanEval-Decompile 一组164个仅依赖标准C库的C函数集合。
ExeBench 从真实项目中抽取的2,621个函数集合，每个函数都使用用户定义的函数、结构和宏。

结果

模型

我们的LLM4Decompile包括参数规模在13亿到330亿之间的模型，这些模型已在Hugging Face上发布。

模型	检查点	大小	可重执行性	备注
llm4decompile-1.3b	🤗 HF链接	1.3B	10.6%	-
llm4decompile-6.7b	🤗 HF链接	6.7B	21.4%	-
llm4decompile-33b	🤗 HF链接	33B	21.5%	-
llm4decompile-6.7b-nsp	🤗 HF链接	6.7B	20.9%	备注1
llm4decompile-6.7b-uo	🤗 HF链接	6.7B	21.9%	备注2
llm4decompile-1.3b-v1.5	🤗 HF链接	1.3B	27.3%	备注3
llm4decompile-6.7b-v1.5	🤗 HF链接	6.7B	45.4%	备注3
llm4decompile-1.3b-v2	🤗 HF链接	1.3B	46.0%	备注4
llm4decompile-6.7b-v2	🤗 HF链接	6.7B	52.7%	备注4
llm4decompile-22b-v2	🤗 HF链接	22B	63.6%	备注4

备注1：NSP模型使用汇编代码进行训练，平均可重执行性约为0.17。

备注2：统一优化（UO）模型在没有优化级别（O0~O3）先验知识的情况下进行训练，平均可重执行性约为0.21。UO模型的预处理稍有不同（没有On的先验知识），请查看模型页面。

备注3：V1.5系列使用更大的数据集（150亿个令牌）和4,096的最大令牌大小进行训练，与之前的模型相比性能显著提升（超过100%的改进）。

备注4：V2系列基于Ghidra构建，并在20亿个令牌上训练，以优化Ghidra生成的伪代码。详情请查看ghidra文件夹。

快速开始

设置： 请使用以下脚本安装必要的环境。

git clone https://github.com/albertan017/LLM4Decompile.git
cd LLM4Decompile
conda create -n 'llm4decompile' python=3.9 -y
conda activate llm4decompile
pip install -r requirements.txt

以下是使用我们模型的示例（针对V1.5版本修订。对于之前的模型，请查看HF上相应的模型页面）。注意：将func0替换为您想要反编译的函数名。

预处理： 将C代码编译为二进制文件，然后将二进制文件反汇编为汇编指令。

import subprocess
import os
OPT = ["O0", "O1", "O2", "O3"]
fileName = 'samples/sample' #'文件路径'
for opt_state in OPT:
    output_file = fileName + '_' + opt_state
    input_file = fileName + '.c'
    compile_command = f'gcc -o {output_file}.o {input_file} -{opt_state} -lm' #在Linux上使用GCC编译代码
    subprocess.run(compile_command, shell=True, check=True)
    compile_command = f'objdump -d {output_file}.o > {output_file}.s' #将二进制文件反汇编成汇编指令
    subprocess.run(compile_command, shell=True, check=True)
    
    input_asm = ''
    with open(output_file+'.s') as f: #汇编文件
        asm = f.read()
        if '<'+'func0'+'>:' not in asm: #重要：将func0替换为函数名
            raise ValueError("编译失败")
        asm = '<'+'func0'+'>:' + asm.split('<'+'func0'+'>:')[-1].split('\n\n')[0] #重要：将func0替换为函数名
        asm_clean = ""
        asm_sp = asm.split("\n")
        for tmp in asm_sp:
            if len(tmp.split("\t"))<3 and '00' in tmp:
                continue
            idx = min(
                len(tmp.split("\t")) - 1, 2
            )
            tmp_asm = "\t".join(tmp.split("\t")[idx:])  #移除二进制代码
            tmp_asm = tmp_asm.split("#")[0].strip()  #移除注释
            asm_clean += tmp_asm + "\n"
    input_asm = asm_clean.strip()
    before = f"# 这是汇编代码：\n" #提示
    after = "\n# 源代码是什么？\n" #提示
    input_asm_prompt = before+input_asm.strip()+after
    with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
        f.write(input_asm_prompt)

汇编指令应该采用以下格式：

<函数名>:\n操作\n操作\n

典型的汇编指令可能如下所示：

<func0>: endbr64 lea (%rdi,%rsi,1),%eax retq


**反编译：** 使用LLM4Decompile将汇编指令翻译成C语言：
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = 'LLM4Binary/llm4decompile-6.7b-v1.5' # V1.5模型
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda()

with open(fileName +'_' + OPT[0] +'.asm','r') as f: #优化级别O0
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=2048) ### 最大长度为4096，新生成的token数应在范围内
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])

with open(fileName +'.c','r') as f: #原始文件
    func = f.read()

print(f'原始函数：\n{func}') # 注意我们只反编译一个函数，而原始文件可能包含多个函数
print(f'反编译后的函数：\n{c_func_decompile}')

HumanEval-Decompile

数据存储在 llm4decompile/decompile-eval/decompile-eval-executable-gcc-obj.json 中，使用JSON列表格式。共有164*4（O0、O1、O2、O3）个样本，每个样本有五个键：

task_id：表示问题的ID。
type：优化阶段，是[O0、O1、O2、O3]之一。
c_func：HumanEval问题的C语言解决方案。
c_test：C语言测试断言。
input_asm_prompt：带有提示的汇编指令，可以按照我们的预处理示例中的方式得出。

请查看评估脚本。

进行中

带有清理流程的更大训练数据集。（完成：2024.05.13）
支持流行的语言/平台和设置。
支持可执行二进制文件。（完成：2024.05.13）
与反编译工具（如Ghidra、Rizin）集成。

许可证

此代码库在MIT和DeepSeek许可证下发布。

引用

@misc{tan2024llm4decompile,
      title={LLM4Decompile: Decompiling Binary Code with Large Language Models}, 
      author={Hanzhuo Tan and Qi Luo and Jing Li and Yuqun Zhang},
      year={2024},
      eprint={2403.05286},
      archivePrefix={arXiv},
      primaryClass={cs.PL}
}