Disassembling EVM Bytecode (the Basics)
Recently, I’ve been looking at how to disassemble (and eventually decompile) bytecode for the Ethereum Virtual Machine (EVM). This is an interesting computer science problem which I’ve been thinking about lately. My goal is just to explain the problem and outline a simple solution.
Background
The EVM is a stack-based virtual machine which supports volatile and non-volatile storage. The volatile storage is called memory and exists only for the life of a transaction. In contrast, the non-volatile storage is just called storage and persists between transactions.
Whilst instructions in the EVM are variable length, the vast majority
are only one byte in length. The family of push instructions
(e.g. PUSH1
, PUSH2
, etc) are the only instructions which have
operands. These simply push a 256bit word constructed from their
operand on the stack. For example, the PUSH1
instruction accepts a
one byte operand which is converted into a 256bit word by treating the
operands as the low byte.
There are a lot of details about the EVM which I’m going to ignore
here. But, of relevance, is the fact that all branches (conditional
or unconditional) are indirect as the branch destination is
loaded from the stack. For example, if we wanted to perform an
unconditional jump to location 0x1a
then we can write this:
PUSH1 0x1a
JUMP
In this case, the jump destination can be determined by looking at the previous instruction. But, in general, it could have been loaded onto the stack at any time in the past. For example, this is a simple variation on the above:
PUSH1 0x1a
PUSH1 0x80
SLOAD
SWAP1
JUMP
This first loads 0x1a
onto the stack, then 0x80
. The SLOAD
instruction loads a word from storage at the address given on top of
the stack (0x80
). To execute this instruction, the machine pops the
address and pushes the new word from storage. Say 0xff
was
currently stored at 0x80
, then after the SLOAD
the stack is
[0x1a,0xff]
(right-most element is top). The SWAP1
operation
swaps the top two stack items, so after executing this operation the
stack is [0xff,0x1a]
. Finally, as before, we perform an
unconditional branch to location 0x1a
.
The Problem
Performing an initial disassembly of the code is actually pretty straightforward. We can do a linear scan of the bytecode translating bytes (and operands where appropriate) into instructions. This gives us a first draft which maybe sufficient. But it lacks information which, in some cases, maybe important:
(Data vs Code). As well as executable instructions, EVM bytecode can contain data. This is often used for the contract creation process (where the code of the contract being created is stored as data), but can be used for other reasons.
(Control-Flow). Understanding the possible flow of control through the program can be important for some applications.
(Stack Height). Some applications need to know the stack height at certain points within the program.
A linear scan of bytecodes does not reveal this information. For example, it does not tell us which bytes represent actual code to execute, versus which represent data. Likewise, it does not tell us where a given branch might jump to. Furthermore, whilst a linear scan can be extended to capture some of this, we really need something more sophisticated to get reliable information.
Analysis Overview
A simple and effective solution for disassembling EVM bytecode is to use a data-flow analysis. Whilst such analyses can be quite complex, even a relatively simple dataflow analysis (such as described here) is enough for many programs.
The basic idea is to simulate execution of the bytecode using a simplified model of the machine. In fact, we’re just going to model the machine stack here and ignore memory/storage. More sophisticated analyses may want to model memory in particular (as this can reveal useful information). Our model of the machine stack is called an abstract stack (i.e. since it abstracts away information). Our abstract stack will contain concrete values (when possible) and unknown values (otherwise). Consider our program from above annotated with the state of the abstract stack before each instruction:
Bytecode | Abstract Stack |
---|---|
PUSH1 0x1a | [] |
PUSH1 0x80 | [0x1a] |
SLOAD | [0x1a,0x80] |
SWAP1 | [0x1a,????] |
JUMP | [????,0x1a] |
Here 0x1a
and 0x80
are concrete values extracted from the PUSH1
operands, whilst ????
represents an unknown value. Since we are
simulating bytecode execution (i.e. rather than actually executing
it), we don’t know what value SLOAD
will push on the stack. To
capture this safely, we abstract it with an unknown value that
represents any possible value.
The above example illustrates the main idea. Whilst our abstract
stack contains unknown values, it also contains known values. By
simulating how instructions manipulate the stack we can determine the
concrete destinations of branching instructions. For example, in the
above, the analysis tells us that the JUMP
will branch to 0x1a
.
Note, however, that this won’t work in all cases. For example, if the
branch address is loaded from storage, then ????
will be on top of
the stack — meaning we don’t know where the branch goes. The
essential goal is to increase analysis precision such that we can (in
many cases) determine correct branch destinations.
Control-Flow Joins
A key question arises as to what happens when different abstract stacks reach the same position in the code. Such a position is typically referred to as a join point. The following illustrates:
Bytecode | Abstract Stack |
---|---|
PUSH1 0x8e | [] |
PUSH1 0x1f | [0x8e] |
PUSH1 0x80 | [0x8e,0x1f] |
SLOAD | [0x8e,0x1f,0x80] |
PUSH1 0x1a | [0x8e,0x1f,????] |
JUMPI | [0x8e,0x1f,????,0x1a] |
POP | [0x8e,0x1f] |
PUSH1 0x2f | [0x8e] |
PUSH1 0x1a | [0x8e,0x2f] |
JUMP | [0x8e,0x2f,0x1a] |
This code branches to position 0x1a
from two locations and, in each
case, a different abstract stack is passed along ([0x8e,0x1f]
versus
[0x8e,0x2f]
). It looks something like this:
The question then is what abstract stack should we
use for location 0x1a
? There are different ways we can solve this
problem, but the simplest is just to merge the two stacks together
in such a way that retains as much information as possible.
When merging two abstract stacks which have the same concrete value at
the same position, then that concrete value is retained. Otherwise,
we simply use ????
at that position (since we have no way to
represent more than one concrete value in our model). For example,
merging[0x8e,0x1f]
with [0x8e,0x2f]
yields the stack [0x8e,????]
which safely approximates the two possible stacks at position 0x1a
.
Generalising the Abstraction
A further limitation with our abstract stack model is that it assumes a concrete stack height. Consider this variation on our program from before:
Bytecode | Abstract Stack |
---|---|
PUSH1 0x8e | [] |
PUSH1 0x1f | [0x8e] |
PUSH1 0x80 | [0x8e,0x1f] |
SLOAD | [0x8e,0x1f,0x80] |
PUSH1 0x1a | [0x8e,0x1f,????] |
JUMPI | [0x8e,0x1f,????,0x1a] |
SWAP1 | [0x8e,0x1f] |
POP | [0x1f,0x8e] |
PUSH1 0x1a | [0x1f] |
JUMP | [0x1f,0x1a] |
Again, this program branches to 0x1a
from two locations each of
which passes along a different stack ([0x8e,0x1f]
versus [0x1f]
).
This looks something like this:
The question this time is how do we merge stacks with different
heights? With our current abstract stack model we actually can’t!
So, we need to generalise it further to support stacks of different
heights. A simple way to do this is with an integer
range l..r
(where l <= r
and the range
is inclusive). For example, the range 0..1
represents the set
{0,1}
, whilst 0..0
represents the constant 0
(i.e. singleton set
{0}
).
We can now adjust our stack model to include its length as an
integer range. For example, the stack [0x1a,0x80]
in our previous
model now becomes [0x1a,0x80]:2..2
(where 2..2
signals a known
stack height of 2
). This allows us to represent the merge of
[0x8e,0x1f]
with [0x1f]
as [0x1f]:1..2
. Essentially, this
describes any stack whose top value is 0x1f
and height is either 1
or 2
. Anyway, there are quite a few details needed to make this
work fully, but hopefully it gives the general idea.
Conclusion
This article has given a basic overview of how dataflow analysis can be used for reasoning about jump destinations. However, to make a serious analysis which can handle bytecode found in the wild needs much more work. A good starting point for understanding this are the following papers:
- Elipmoc: advanced decompilation of Ethereum smart contracts. In Proc OOPSLA, 2022. PDF
- Gigahorse: Thorough, Declarative Decompilation of Smart Contracts. In Proc ICSE, 2019. PDF
Enjoy!
Follow the discussion on Twitter or Reddit