Disassembling EVM Bytecode (the Basics)

Wednesday, January 4th, 2023

Recently, I’ve been looking at how to disassemble (and eventually decompile) bytecode for the Ethereum Virtual Machine (EVM). This is an interesting computer science problem which I’ve been thinking about lately. My goal is just to explain the problem and outline a simple solution.

Background

The EVM is a stack-based virtual machine which supports volatile and non-volatile storage. The volatile storage is called memory and exists only for the life of a transaction. In contrast, the non-volatile storage is just called storage and persists between transactions.

Whilst instructions in the EVM are variable length, the vast majority are only one byte in length. The family of push instructions (e.g. PUSH1, PUSH2, etc) are the only instructions which have operands. These simply push a 256bit word constructed from their operand on the stack. For example, the PUSH1 instruction accepts a one byte operand which is converted into a 256bit word by treating the operands as the low byte.

There are a lot of details about the EVM which I’m going to ignore here. But, of relevance, is the fact that all branches (conditional or unconditional) are indirect as the branch destination is loaded from the stack. For example, if we wanted to perform an unconditional jump to location 0x1a then we can write this:

    PUSH1 0x1a
    JUMP

In this case, the jump destination can be determined by looking at the previous instruction. But, in general, it could have been loaded onto the stack at any time in the past. For example, this is a simple variation on the above:

    PUSH1 0x1a
    PUSH1 0x80
    SLOAD
    SWAP1
    JUMP

This first loads 0x1a onto the stack, then 0x80. The SLOAD instruction loads a word from storage at the address given on top of the stack (0x80). To execute this instruction, the machine pops the address and pushes the new word from storage. Say 0xff was currently stored at 0x80, then after the SLOAD the stack is [0x1a,0xff] (right-most element is top). The SWAP1 operation swaps the top two stack items, so after executing this operation the stack is [0xff,0x1a]. Finally, as before, we perform an unconditional branch to location 0x1a.

The Problem

Performing an initial disassembly of the code is actually pretty straightforward. We can do a linear scan of the bytecode translating bytes (and operands where appropriate) into instructions. This gives us a first draft which maybe sufficient. But it lacks information which, in some cases, maybe important:

(Data vs Code). As well as executable instructions, EVM bytecode can contain data. This is often used for the contract creation process (where the code of the contract being created is stored as data), but can be used for other reasons.
(Control-Flow). Understanding the possible flow of control through the program can be important for some applications.
(Stack Height). Some applications need to know the stack height at certain points within the program.

A linear scan of bytecodes does not reveal this information. For example, it does not tell us which bytes represent actual code to execute, versus which represent data. Likewise, it does not tell us where a given branch might jump to. Furthermore, whilst a linear scan can be extended to capture some of this, we really need something more sophisticated to get reliable information.

Analysis Overview

A simple and effective solution for disassembling EVM bytecode is to use a data-flow analysis. Whilst such analyses can be quite complex, even a relatively simple dataflow analysis (such as described here) is enough for many programs.

The basic idea is to simulate execution of the bytecode using a simplified model of the machine. In fact, we’re just going to model the machine stack here and ignore memory/storage. More sophisticated analyses may want to model memory in particular (as this can reveal useful information). Our model of the machine stack is called an abstract stack (i.e. since it abstracts away information). Our abstract stack will contain concrete values (when possible) and unknown values (otherwise). Consider our program from above annotated with the state of the abstract stack before each instruction:

Bytecode	Abstract Stack
`PUSH1 0x1a`	`[]`
`PUSH1 0x80`	`[0x1a]`
`SLOAD`	`[0x1a,0x80]`
`SWAP1`	`[0x1a,????]`
`JUMP`	`[????,0x1a]`

Here 0x1a and 0x80 are concrete values extracted from the PUSH1 operands, whilst ???? represents an unknown value. Since we are simulating bytecode execution (i.e. rather than actually executing it), we don’t know what value SLOAD will push on the stack. To capture this safely, we abstract it with an unknown value that represents any possible value.

The above example illustrates the main idea. Whilst our abstract stack contains unknown values, it also contains known values. By simulating how instructions manipulate the stack we can determine the concrete destinations of branching instructions. For example, in the above, the analysis tells us that the JUMP will branch to 0x1a. Note, however, that this won’t work in all cases. For example, if the branch address is loaded from storage, then ???? will be on top of the stack — meaning we don’t know where the branch goes. The essential goal is to increase analysis precision such that we can (in many cases) determine correct branch destinations.

Control-Flow Joins

A key question arises as to what happens when different abstract stacks reach the same position in the code. Such a position is typically referred to as a join point. The following illustrates:

Bytecode	Abstract Stack
`PUSH1 0x8e`	`[]`
`PUSH1 0x1f`	`[0x8e]`
`PUSH1 0x80`	`[0x8e,0x1f]`
`SLOAD`	`[0x8e,0x1f,0x80]`
`PUSH1 0x1a`	`[0x8e,0x1f,????]`
`JUMPI`	`[0x8e,0x1f,????,0x1a]`
`POP`	`[0x8e,0x1f]`
`PUSH1 0x2f`	`[0x8e]`
`PUSH1 0x1a`	`[0x8e,0x2f]`
`JUMP`	`[0x8e,0x2f,0x1a]`

This code branches to position 0x1a from two locations and, in each case, a different abstract stack is passed along ([0x8e,0x1f] versus [0x8e,0x2f]). It looks something like this:

Illustrating types being pulled up the AST of an expresion.

The question then is what abstract stack should we use for location 0x1a? There are different ways we can solve this problem, but the simplest is just to merge the two stacks together in such a way that retains as much information as possible.

When merging two abstract stacks which have the same concrete value at the same position, then that concrete value is retained. Otherwise, we simply use ???? at that position (since we have no way to represent more than one concrete value in our model). For example, merging[0x8e,0x1f] with [0x8e,0x2f] yields the stack [0x8e,????] which safely approximates the two possible stacks at position 0x1a.

Generalising the Abstraction

A further limitation with our abstract stack model is that it assumes a concrete stack height. Consider this variation on our program from before:

Bytecode	Abstract Stack
`PUSH1 0x8e`	`[]`
`PUSH1 0x1f`	`[0x8e]`
`PUSH1 0x80`	`[0x8e,0x1f]`
`SLOAD`	`[0x8e,0x1f,0x80]`
`PUSH1 0x1a`	`[0x8e,0x1f,????]`
`JUMPI`	`[0x8e,0x1f,????,0x1a]`
`SWAP1`	`[0x8e,0x1f]`
`POP`	`[0x1f,0x8e]`
`PUSH1 0x1a`	`[0x1f]`
`JUMP`	`[0x1f,0x1a]`

Again, this program branches to 0x1a from two locations each of which passes along a different stack ([0x8e,0x1f] versus [0x1f]). This looks something like this:

The question this time is how do we merge stacks with different heights? With our current abstract stack model we actually can’t! So, we need to generalise it further to support stacks of different heights. A simple way to do this is with an integer range l..r (where l <= r and the range is inclusive). For example, the range 0..1 represents the set {0,1}, whilst 0..0 represents the constant 0 (i.e. singleton set {0}).

We can now adjust our stack model to include its length as an integer range. For example, the stack [0x1a,0x80] in our previous model now becomes [0x1a,0x80]:2..2 (where 2..2 signals a known stack height of 2). This allows us to represent the merge of [0x8e,0x1f] with [0x1f] as [0x1f]:1..2. Essentially, this describes any stack whose top value is 0x1f and height is either 1 or 2. Anyway, there are quite a few details needed to make this work fully, but hopefully it gives the general idea.

Conclusion

This article has given a basic overview of how dataflow analysis can be used for reasoning about jump destinations. However, to make a serious analysis which can handle bytecode found in the wild needs much more work. A good starting point for understanding this are the following papers:

Elipmoc: advanced decompilation of Ethereum smart contracts. In Proc OOPSLA, 2022. PDF
Gigahorse: Thorough, Declarative Decompilation of Smart Contracts. In Proc ICSE, 2019. PDF

Enjoy!

Follow the discussion on Twitter or Reddit