patternModerate
Why is static recompilation not possible?
Viewed 0 times
whyrecompilationpossiblenotstatic
Problem
I'm researching static recompilation but there doesn't seem to be too much information about the subject. I've heard that dynamic recompilation (emulation) can be up to 6 times slower than native assembly, but I'm curious why we aren't able to translate to a different architecture ahead of time. Even though some instructions wouldn't be 1:1, can't we just shift the rest of the code, and all of the jump instructions with it?
Furthermore, if the problem is a lack of source code, would this mean that Super Mario 64 (which has already been completely reverse engineered, to exactly identical binaries), can be fairly easily recompiled into a different architecture?
Furthermore, if the problem is a lack of source code, would this mean that Super Mario 64 (which has already been completely reverse engineered, to exactly identical binaries), can be fairly easily recompiled into a different architecture?
Solution
Static recompilation from a binary is hard, because it is challenging to reconstruct the structure of the program. It is hard to statically figure out the location of all instructions that will be executed, the starting point of all functions, and the set of all jump targets. This information is needed for natural methods of recompilation: we need to know where all the instructions are, so we can recompile them; we need to know where function prologues and epilogues are, so we can translate them to other function calling conventions; we need to know the set of all jump targets, because all of those locations need to be recompiled. It's not impossible, but it can be extremely challenging to do with 100% fidelity.
Given a binary executable, it is hard to reliably find all of the executable code statically, due to the presence of indirect jumps. In particular, on x86, it is possible to jump into the "middle" of an instruction, which will cause the stream of bytes to be interpreted differently than you might expect. Since we can't predict all possible jump targets of indirect jumps (this is as hard as the halting problem), it is hard to know all locations that might be executed as code, and at what offset. This makes reliable static disassembly hard. And of course, if you can't even disassemble, it's challenging to re-assemble or re-compile.
See, e.g., work on static disassembly to learn about the subject. Here are some sample papers:
See also https://hexterisk.github.io/blog/posts/2020/04/02/disassembly-and-binary-analysis-fundamentals/.
This problem is particularly intractable for obfuscated binaries, but even for normal non-obfuscated binaries, existing methods have difficulty fully recovering 100% of the instructions, function starts, and jump targets with perfect accuracy. This is a problem, because if there is even one mistake, then the entire program might crash.
An additional challenge is that if the program does any kind of runtime code generation or runtime JIT or runtime recompilation, then this only makes the static recompilation problem even harder.
In contrast, runtime (dynamic) methods avoid this problem, because they can observe which instructions and code paths actually get executed and recompile only the ones that are executed, at the time they are executed.
Given a binary executable, it is hard to reliably find all of the executable code statically, due to the presence of indirect jumps. In particular, on x86, it is possible to jump into the "middle" of an instruction, which will cause the stream of bytes to be interpreted differently than you might expect. Since we can't predict all possible jump targets of indirect jumps (this is as hard as the halting problem), it is hard to know all locations that might be executed as code, and at what offset. This makes reliable static disassembly hard. And of course, if you can't even disassemble, it's challenging to re-assemble or re-compile.
See, e.g., work on static disassembly to learn about the subject. Here are some sample papers:
- From Hack to Elaborate Technique—A Survey on Binary Rewriting. Matthias Wenzl, Georg Merzdovnik, Johanna Ullrich, and Edgar Weippl. ACM Computing Surveys (CSUR), 52(3), 1-37
- An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries . Dennis Andriesse, Xi Chen, Victor van der Veen, Asia Slowinska, Herbert Bos. Usenix Security 2016.
- SoK: All You Ever Wanted to Know About x86/x64 Binary Disassembly But Were Afraid to Ask. Chengbin Pang, Ruotong Yu, Yaohui Chen, Eric Koskinen, Georgios Portokalidis, Bing Mao, Jun Xu. 2021 IEEE Symposium on Security and Privacy (SP) (pp. 833-851).
See also https://hexterisk.github.io/blog/posts/2020/04/02/disassembly-and-binary-analysis-fundamentals/.
This problem is particularly intractable for obfuscated binaries, but even for normal non-obfuscated binaries, existing methods have difficulty fully recovering 100% of the instructions, function starts, and jump targets with perfect accuracy. This is a problem, because if there is even one mistake, then the entire program might crash.
An additional challenge is that if the program does any kind of runtime code generation or runtime JIT or runtime recompilation, then this only makes the static recompilation problem even harder.
In contrast, runtime (dynamic) methods avoid this problem, because they can observe which instructions and code paths actually get executed and recompile only the ones that are executed, at the time they are executed.
Context
StackExchange Computer Science Q#155511, answer score: 19
Revisions (0)
No revisions yet.