from Hacker News

LLM4Decompile: Decompiling Binary Code with LLM

by Davidbrcz on 3/17/24, 10:15 AM with 129 comments

by klik99 on 3/17/24, 1:33 PM
This is a fascinating idea, but (honest question, not a judgement) would the output be reliable? It would be hard to identify hallucinations since recompiling could produce different machine code. Particularly if there is some novel construct that could be a key part of the code. Are there ways of also reporting the LLMs confidence in sections like this when running generatively? It’s an amazing idea but I worry it would stumble invisibly on the parts that are most critical. I suppose it would just need human confirmation on the output
by nebula8804 on 3/17/24, 11:23 AM
Will be interesting to see is there is some way to train a decompilation module based on who we know developed the application and use their previous code used as training. For example: Super Mario 64 and Zelda 64 were fully decompiled and a handful of other N64 games are in the process. I wonder if we could map which developers worked on these two games (maybe even guess who did what module) and then use that to more easily decompile any other game that had those developers working on it.
If this gets really good, maybe we can dream of having a fully de-obfuscated and open source life. All the layers of binary blobs in a PC can finally be decoded. All the drivers can be open. Why not do the OS as well! We don't have to settle for Linux, we can bring back Windows XP and back port modern security and app compatibility into the OS and Microsoft can keep their Windows 11 junk...at least one can dream! :D
by madisonmay on 3/17/24, 1:21 PM
This is an excellent use case for LLM fine-tuning, purely because of the ease of generating a massive dataset of input / output pairs from public C code
by YeGoblynQueenne on 3/17/24, 9:42 PM
If I read the "re-executability" results in the Results figure right then that's a great idea but it doesn't really work:
https://raw.githubusercontent.com/albertan017/LLM4Decompile/...
To clarify:
>> Re-executability provides this critical measure of semantic correctness. By re-compiling the decompiled output and running the test cases, we assess if the decompilation preserved the program logic and behavior. Together, re-compilability and re-executability indicate syntax recovery and semantic preservation - both essential for usable and robust decompilation.
by a2code on 3/17/24, 12:53 PM
The problem is interesting in at least two aspects. First, an ideal decompiler would eliminate proprietary source code. Second, the abundant publicly available C code allows you to simply make a dataset of paired ASM and source code. There is also a lot of variety with optimization level, compiler choice, and platform.
What is unclear to me is: why did the authors fine-tune the DeepSeek-Coder model? Can you train an LLM from zero with a similar dataset? How big does the LLM need to be? Can it run locally?
by kukas on 3/17/24, 11:53 AM
Hey, I am working on my own LLM-based decompiler for Python bytecode (https://github.com/kukas/deepcompyle). I feel there are not many people working on this research direction but I think it could be quite interesting, especially now that longer attention contexts are becoming feasible. If anyone knows a team that is working on this, I would be quite interested in cooperation.
by dwrodri on 3/18/24, 9:01 PM
I have been planning to work on something like this. I think that eventually, someone will crack the "binary in -> good source code out of LLM" pipeline but we are probably a few years away from that still. I say a few years because I don't think there's a huge pile of money sitting at the end of this problem, but maybe I'm wrong.
A really good "stop-gap" approach would be to build a decompilation pipeline using Ghidra in headless mode and then combine the strict syntax correctness of a decompiler with the "intuition/system 1 skills" of an LLM. My inspiration for this setup comes from two recent advancements, both shared here on HN:
1. AlphaGeometry: The Decompiler and the LLM should complement each other, covering each other's weaknesses. https://deepmind.google/discover/blog/alphageometry-an-olymp...
2. AICI: We need a better way of "hacking" on top of these models, and being able to use something like AICI as the "glue" to coordinate the generation of C source. I don't really want the weights of my LLM to be used to generate syntactically correct C source, I want the LLM to think in terms of variable names, "snippet patterns" and architectural choices while other tools (Ghidra, LLVM) worry about the rest. https://github.com/microsoft/aici
Obviously this is all hand-wavey armchair commentary from a former grad student who just thinks this stuff is cool. Huge props to these researchers for diving into this. I know the authors already mentioned incorporating Ghidra into their future work, so I know they're on the right track.
by potatoman22 on 3/17/24, 11:03 AM
It's interesting the 6b model outperforms the 33b model. I wonder if it means the 33b model needs more training data? It was pretrained on ~1 million C programs, compared to DeepSeek-Coder, which was trained on 2 trillion tokens, which is a few orders of magnitude more data.
I'm also curious about how this compares to non-LLM solutions.
by AndrewKemendo on 3/17/24, 4:00 PM
If successful wouldn’t you be replicating the compilers machine code 1:1?
In which case that means fully complete code can live in the “latent space” but is distributed as probabilities
Or perhaps more likely would it be replicating the logic only, which can then be translated into the target language
I would guess that any binary that requires a non-deterministic input (key, hash etc…) to compile would break this
Fascinating
by kken on 3/17/24, 11:13 AM
Pretty wild how well GPT4 is still doing in comparison. It's significantly better than their model at creating compilable code, but is less accurate at recreating functional code. Still quite impressive.
by maCDzP on 3/17/24, 11:06 AM
Can this be used for deobfuscation of code? I really hadn’t thought about LLM being a tool during reverse engineering.
by sinuhe69 on 3/17/24, 6:23 PM
For me the huge difference between re-compilability and re-excuteability scores is very interesting. GTP4 achieved 8x% on re-compilability (syntactically correct) but abysmal 1x% in re-excutability (schematically correct) demonstrated once again its overgrown mimicry capacity.
by mahaloz on 3/17/24, 8:08 PM
It’s always cool to see different approaches in this area, but I worry its benchmarks are meaningless without a comparison of non-AI based approaches (like IDA Pro). It would be interesting to see how this model holds up on metrics from previous papers in security.
by saagarjha on 3/18/24, 2:11 AM
The approach here is interesting in that it answers a question a lot of people have been asking: “what happens if we pipe a binary into a trained LLM and ask it to decompile it?” The answer is that it doesn’t really work at all right now! This is a surprising result because the design of the paper kind of doesn’t allow for any other conclusion to be drawn. Notably, if the LLM did a really good job in the evaluation they designed it would still be unclear whether it was actually useful, because the test “does it compile and pass a few test cases” is not actually a very good way to test a decompiler.
A couple people here have suggested that the generated decompilation should match the source code exactly, which is a challenging thing to achieve and still hotly debated on whether it is a good metric or not. But the results here show that we’re starting to barely get past the “does it produce code” stage and move towards “does it produce code that looks vaguely correct” status but we’re definitely not there yet. Future steps of “is this a useful tool to drive decompilation” and “does this do better than state of the art” and “is this perfect at decompiling things” are still a long ways away. So it’s good to look at as a negative result as this area continues to attract new interest.
by jagrsw on 3/17/24, 12:35 PM
Decompilation is somewhat a default choice for ML in the world of comp-sec.
Searching for vulns and producing patches in source code is a bit problematic, as the databases of vulnerable source code examples and their corresponding patches are neither well-structured nor comprehensive, and sometimes very, very specific to the analyzed code (for higher abstraction type of problems). So, it's not easy to train something usable beyond standard mem safety problems and use of unsafe APIs.
The area of fuzzing is somewhat messy, with sporadic efforts undertaken here and there, but it also requires a lot of preparatory work, and the results might not be groundbreaking unless we reach a point where we can feed an ML model the entire source code of a project, allowing it to analyze and identify all bugs, producing fixes and providing offending inputs. i.e. not yet.
While decompilation is a fairly standard problem, it is possible to produce input-output pairs somewhat at will based on existing source code, using various compiler switches, CPU architectures, ABIs, obfuscations, syscall calling conventions. And train models on those input-output pairs (i.e. in reversed order).
by m3kw9 on 3/17/24, 4:05 PM
Basically predicting code token by token except now you don’t even have a large enough context size and worse, you are using RAG
by mdaniel on 3/17/24, 6:02 PM
relevant: https://news.ycombinator.com/item?id=34250872 (G-3PO: A protocol droid for Ghidra, or GPT-3 for reverse-engineering <https://github.com/tenable/ghidra_tools/blob/main/g3po/g3po....>; Jan, 2023; 44 comments)
ed: seems they have this, too, which may value your submission: https://github.com/tenable/awesome-llm-cybersecurity-tools#a...
by dolmen on 3/18/24, 10:33 AM
It seems to me that the objdump step (to transform binary to human readable assembly) seems an unnecessary waste of runtime resources.
It should be possible to tokenize directly from the binary.
by speedylight on 3/17/24, 6:52 PM
I have thought about doing something similar for heavily obfuscated JavaScript. Very useful for security research I imagine!
by Nuzzerino on 3/17/24, 11:36 PM
How does it actually compare to non-LLM decompilers IDA, Binja, etc? I only see comparisons with other LLMs.
by ReptileMan on 3/17/24, 3:49 PM
Let's hope it kills Denuvo ...
by quantum_state on 3/17/24, 7:51 PM
It seems the next logical step would be LLMAssistedHacking to turn things up side down…

by xvilka on 3/18/24, 5:34 AM

I think using higher-level input, e.g. the intermediate language like RzIL[1] could produce better results and is more scalable for making such decompliation multiplatform. As RzIL text form resemples SMT, it should make LLM easier to "understand" the meaning. Moreover, information from binary such as symbols, signatures, debug information (DWARF, PDB, etc) could enrich the result further. You can download Rizin[2] and try for yourself by calling `aaa` then `plf` for any chosen functions for architectures supported by RzIL. See the example excerpt for a function with this disassembly:

  │       │   0x140007e51      movsd qword [rdi + 0x50], xmm2
  │       │   0x140007e56      mov   qword [rdi + 0x48], 0
  │       │   0x140007e5e      call  sym.rz_test.exe_ht_pp_free          ; sym.rz_test.exe_ht_pp_free
  │       │   0x140007e63      movaps xmm7, xmmword [var_38h]
  │       │   0x140007e68      movaps xmm6, xmmword [var_28h]
  │       │   0x140007e6d      mov   rbp, qword [var_10h]
  │       └─> 0x140007e72      add   rsp, 0x48
  │           0x140007e76      pop   r15
  │           0x140007e78      pop   rdi
  └           0x140007e79      ret

  0x140007e6d (set rbp (loadw 0 64 (+ (var rsp) (bv 64 0x68))))
  0x140007e72 (seq (set op1 (var rsp)) (set op2 (bv 64 0x48)) (set sum (+ (var op1) (var op2))) (set rsp (var sum)) (set _result (var sum)) (set _popcnt (bv 8 0x0)) (set _val (cast 8 false (var _result))) (repeat (! (is_zero (var _val))) (seq (set _popcnt (+ (var _popcnt) (ite (lsb (var _val)) (bv 8 0x1) (bv 8 0x0)))) (set _val (>> (var _val) (bv 8 0x1) false)))) (set pf (is_zero (mod (var _popcnt) (bv 8 0x2)))) (set zf (is_zero (var _result))) (set sf (msb (var _result))) (set _result (var sum)) (set _x (var op1)) (set _y (var op2)) (set cf (|| (|| (&& (msb (var _x)) (msb (var _y))) (&& (! (msb (var _result))) (msb (var _y)))) (&& (msb (var _x)) (! (msb (var _result)))))) (set of (|| (&& (&& (! (msb (var _result))) (msb (var _x))) (msb (var _y))) (&& (&& (msb (var _result)) (! (msb (var _x)))) (! (msb (var _y)))))) (set af (|| (|| (&& (msb (cast 4 false (var _x))) (msb (cast 4 false (var _y)))) (&& (! (msb (cast 4 false (var _result)))) (msb (cast 4 false (var _y))))) (&& (msb (cast 4 false (var _x))) (! (msb (cast 4 false (var _result))))))))
  0x140007e76 (seq (set r15 (cast 64 false (loadw 0 64 (+ (var rsp) (bv 64 0x0))))) (set rsp (+ (var rsp) (bv 64 0x8))))
  0x140007e78 (seq (set rdi (loadw 0 64 (+ (var rsp) (bv 64 0x0)))) (set rsp (+ (var rsp) (bv 64 0x8))))
  0x140007e79 (seq (set tgt (loadw 0 64 (+ (var rsp) (bv 64 0x0)))) (set rsp (+ (var rsp) (bv 64 0x8))) (jmp (var tgt)))

[1] https://github.com/rizinorg/rizin/blob/dev/doc/rzil.md

[2] https://rizin.re

by xorvoid on 3/17/24, 5:19 PM
As someone who is actively developing a decompiler to reverse engineer old DOS 8086 video games, I'd have a hard time trusting an LLM to do this correctly. My standard is accurate semantics lifting from Machine Code to C. Reversing assembly to C is very delicate. There are many patterns that tend to usually map to obvious C constructs... except when they don't. And that assumes the original source was C. Once you bump into routines that were hand-coded assembly and break every established rule in the calling conventions, all bets are off. I'm somewhat convinced that decompilation cannot be made fully-automatic. Instead a good decompiler is just a lever-arm on the manual work a reverser would otherwise be doing. Corollary: I'm also somewhat convinced that only the decompiler's developers can really use it most effectively because they know where the "bodies are buried" and where different heuristics and assumptions were made. Decompilers are compilers with all the usual engineering challenges, plus a hard inference problem tacked on top.
All that said, I'm not a pessimist on this idea. I think it has pretty great promise as a technique for general reversing security analysis where the reversing is done mostly for "discovery" and "understanding" rather than for perfect semantic lifting to a high-level language. In that world, you can afford to develop "hypotheses" and then drill down to validate if you think you've discovered something big.
Compiling and testing the resulting decompilation is a great idea. I do that as well. The limitation here is TEST SUITE. Some random binary doesn't typically come with a high-coverage test suite, so you have to develop your own acceptance criterion as you go along. In other words: write tests for a function whose computation you don't understand (ha). I suppose a form of static-analysis / symbolic-computation might be handy here (I haven't explored that). Here you're also beset with challenges of specifying which machine state changes are important and which are superfluous (e.g. is it okay if the x86 FLAGS register isn't modified in the decompiled version, probably yes, but sometimes no).
In my case I don't have access to the original compiler and even if I did, I'm not sure I could convince it to reproduce the same code. Maybe this is more feasible for more modern binaries where you can assume GCC, Clang, MSVC, or ICC.
At any rate: crazy hard, crazy fun problem. I'm sure LLMs have a role somewhere, but I'm not sure exactly where: the future will tell. My guess is some kind of "copilot" / "assistant" type role rather than directly making the decisions.
(If this is your kind of thing... I'll be writing more about it on my blog soonish...)