How is Dalvik Interpreter so fast

Many modern virtual machines include either a fast interpreter or a fast baseline compiler. A baseline compiler needs extra memory and is likely slower for code that is only executed once. An interpreter avoids this memory overhead  and is very flexible. For example, it can quickly switch between execution modes (profiling, tracing, single-stepping, etc.). So what, then, is the state of the art of building fast interpreters?

Modern C/C++ compilers are very good at optimising "normal" code, but they are not very good at optimising large switch tables or direct-threaded interpreters. The latter also needs the "Labels as Values" GNU extension. In a direct-threaded interpreter, each instructions takes the following form:

op_ADD_INT: int op1 = READ_OP1; int op2 = READ_OP2; int result = op1 + op2; // The actual implementation. WRITE_RESULT(result); unsigned int opcode = pc->opcode; ++pc; goto *dispatch[opcode]; // Jump to code for next instruction.

The variable  (the dispatch table) is an array of labels, one for each opcode. The last three lines transfer control to the implementation of the next bytecode instruction. If we want to change the implementation of an instruction dynamically, we can just update the pointer in the dispatch table, or change the dispatch table pointer to point to a different table altogether.

Note that everything except for the line "" is interpreter overhead and its cost should be minimised. The program counter () and the dispatch table are needed by every instruction, so they should be kept in registers throughout. Similarly, we want a variable that points to the stack and possibly a variable that points to a list of literals — two more registers. We also need further registers for holding the operands. These registers better be the same for each instruction, since otherwise some instructions need to move things around, leading to memory accesses. On x86 we only have 7 general-purpose registers which makes it very hard for the compiler to optimally use them for all instructions.

Interpreters Written in Assembly

For this reason, most interpreters are hand-optimised assembly routines with a specially designed calling convention that keeps as much state as possible in registers. A particularly nice and fast example is the LuaJIT 2 interpreter.

Both the standard Lua interpreter and the LuaJIT 2 interpreter use a register-based bytecode format. Compared to the more well-known stack-based bytecodes, a register-based bytecode has larger instructions, but requires fewer instructions overall. For example, the expression "" in a stack-based bytecode would translate to something like:

PUSHVAR 0 -- push variable "x" onto operand stack PUSHVAR 1 -- push variable "y" PUSHVAR 2 -- push variable "z" MUL -- top of operand stack is (y * z) ADD -- top of operand stack is x + (y * z) STOREVAR 3 -- write result into variable "s"
With a few optimisations, this can be encoded as only 6 bytes. In a register-based bytecode this would translate to something like this:

MUL 4, 1, 2 -- tmp = y * z ADD 3, 0, 4 -- s = x + tmp
Each instruction takes a variable indices, the (virtual) registers, and reads from and writes directly to the variable (stored on the stack). In LuaJIT 2, each instruction requires 4 bytes, thus the overall bytecode size is a bit larger. However, executing these instructions incurs the interpreter overhead only twice instead of 6 times in the stack-based bytecode. It may also avoid memory traffic by avoiding the separate operand stack.

The LuaJIT 2 interpreter also uses a very simple bytecode format that avoids further bit shuffling (increasing the cost of decoding). Instructions can only have one of two forms:

OP A, B, C OP A, D
Here OP, A, B, and C are 8-bit fields. OP is the opcode and A, B, C are usually register IDs or sometimes literals. D is a 16bit field and overlaps with B and C. It usually holds a literal value (e.g., a jump offset).

The LuaJIT 2 interpreter now combines this with a calling convention where part of the next instruction is decoded before actually transferring control to it. This tries to take advantage of superscalar execution on modern CPUs. For example, the an integer addition instruction implemented using this technique would look as follows:

bc_ADD_INT: -- edx = BASE = start of stack frame. E.g., virtual register 5 -- is at memory address BASE + 5 * BYTES_PER_WORD -- esi = PC, always points to the next instruction -- ebx = DISPATCH = the dispatch table -- ecx = A = pre-decoded value of field A -- eax = D = pre-decoded value of field D -- ebp = OP, opcode of current instruction (usually ignored) -- Semantics: BASE[A] = BASE[B] + BASE[C] -- 1. Decode D into B (ebp) and C (eax, same as D) movzx ebp, ah -- zero-extend 8-bit register ah into ebp = B movzx eax, al -- zero-extend 8-bit register al into eax = C -- 2. Perform the actual addition mov ebp, [edx + 4*ebp] -- read BASE[B] add ebp, [edx + 4*eax] -- ebp = BASE[B] + BASE[C] mov [edx + 4*ecx], ebp   -- BASE[A] = ebp -- 3. Dispatch next instruction mov eax, [esi]    -- Load next instruction into eax movzx ecx, ah     -- Predecode A into ecx movzx ebp, al     -- zero-extend OP into ebp add esi, 4        -- increment program counter shr eax, 16       -- predecode D jmp [ebx + ebp * 4]  -- jump to next instruction via dispatch table
The reason for predecoding some of the arguments is that the final "jmp" instruction may be quite expensive because indirect branches cause difficulties for branch predictors. The previous instructions help keep the pipeline somewhat busy if the branch did indeed get mispredicted.

These 11 instructions are typically executed in about 5 clock cycles on an Intel Core 2 processor. Based on a very simple benchmark of only simple (addition, branch, compare) instructions, interpreted code is roughly 7x slower than machine code. For more complicated instructions the interpreter overhead becomes less severe and the slowdown should be smaller.

Note, though, that in this example the ADD operation was type-specialised to integers. In many dynamic languages the addition operator is overloaded to work over several types. A generic ADD bytecode instruction then must include a type check (e.g., int vs. float) and then dispatch to the relevant implementation. This can introduce severe execution overheads due to the high cost of branch mispredictions on modern (desktop) CPUs. However, this is partly a language design issue (or at least language/implementation co-design issue) and independent of how we choose to implement our interpreter.

In addition to the performance advantages over C-based interpreters, there is size advantage. It's a good idea to ensure that the frequently-executed parts of the interpreter fit within the L1 instruction cache (typically around 32-64 KiB). Writing the interpreter in assembly helps with that. For example, the above code requires exactly 32 bytes. If we chose to use register ebp for BASE, it would have been a few bytes larger due to encoding restrictions on x86.

Portable Fast Interpreters?

The downside of writing an interpreter in assembly is that it is completely unportable. Furthermore, if we need to change the semantics of a bytecode instruction we have to update it for each architecture separately. For example, LuaJIT 2 has interpreters for 6 architectures of around 4000K 4000 lines per architecture (ARMv6, MIPS, PPC, PPCSPE/Cell, x86, x86-64).

The WebKit project therefore built its own custom "portable" assembly language as a Ruby DSL. For an example of what it looks like, see LowLevelInterpreter.asm. Occasionally, you probably still want to special-case some code for a specific architecture, but it seems to go into the right direction. It also seems to make code more readable than Dalvik's (Android's VM) template substitution method. I'd rather have all the code in the same file (possibly with a few #ifdef-like places) rather than spread over 200+ files. It also should be quite simple to translate this assembly into C to get a default interpreter for architectures that are not yet supported.

So far, every project seems to have built its own tool chain. I guess it's too much of a niche problem with too many project-specific requirements to give rise to a reusable standard tool set.