patternjavaMinor

Assembler for ToyVM

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

toyvmassemblerfor

Problem

After rolling my own virtual machine, I decided to implement an assembler for it. Ironically, it's all Java, since I needed to do a lot of text manipulations. Please, tell me anything that comes to mind.

Here is the main component:

ToyVMAssembler.java:

```
package net.coderodde.toy.assembler;

import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;

/**
* This class is responsible for assembling a ToyVM source file.
*
* @author Rodion "rodde" Efremov
* @version 1.6 (Mar 10, 2016).
*/
public class ToyVMAssembler {

public static final byte REG1 = 0x00;
public static final byte REG2 = 0x01;
public static final byte REG3 = 0x02;
public static final byte REG4 = 0x03;

public static final byte ADD = 0x01;
public static final byte NEG = 0x02;
public static final byte MUL = 0x03;
public static final byte DIV = 0x04;
public static final byte MOD = 0x05;

public static final byte CMP = 0x10;
public static final byte JA = 0x11;
public static final byte JE = 0x12;
public static final byte JB = 0x13;
public static final byte JMP = 0x14;

public static final byte CALL = 0x20;
public static final byte RET = 0x21;

public static final byte LOAD = 0x30;
public static final byte STORE = 0x31;
public static final byte CONST = 0x32;

public static final byte HALT = 0x40;
public static final byte INT = 0x41;
public static final byte NOP = 0x42;

public static final byte PUSH = 0x50;
public static final byte PUSH_ALL = 0x51;
public static final byte POP = 0x52;
public static final byte POP_ALL = 0x53;
public static final byte LSP = 0x54;

/**
* Specifies the token starting a one-line comment.
*/
private static final String COMMENT_START_TOKEN = "//";

private final List sourceCodeLineList;
private final List machineCode

Solution

I see some things that may help you improve your code.

Make your code more data driven

For any given instruction, the relevant pieces are spread over quite a bit of code. A better approach might be to have something like an Instruction class that would contain the instruction string, the encoded hex value, the length and the number and type of arguments.

Make error reporting more consistent

One of the problems with repeating very similar code multiple times is that small errors can creep in and be overlooked in the volume of code. For example, the word "instruction" is misspelled for the jmp instruction error code, but not for the jb or other instructions, even though their error strings are nearly identical.

Separate tokenizing from parsing

Classical assembler or compiler construction separates tokenizing from parsing. The tokenizer (also called a "lexer" by some) creates a series of tokens, identified by type and value, to the parser. This allows for more complex constructs, such as an "expression" to be parsed without having to also clutter the code with the parts that determine whether something is a number or an instruction mnemonic or a reference to a register. Doing so will make it easier to modify and maintain both parts.

Use existing string handling

The emitRegister uses a switch to enumerate each register and emit the corresponding value. One could also do something like this instead:

String [] regNames= { "reg1", "reg2", "reg3", "reg4" };
int regnum = Arrays.binarySearch(regNames, registerToken);
if (regnum >= 0) {
    machineCode.add(regnum);
} else {
    throw ...
}

Now it's trivial to add "reg5" for example, just by listing its name.

Create functions for common operations

There are several places in which an int is turned into little-endian format. The code would be more clear and compact if that operation were implemented as a function instead.

Be careful of what you accept

Right now, the assembler will happily accept lines like these:

jmp:    jmp jmp
reg3:   ja reg3

It's possible that this is intentional, but I'm not convinced it's good design. In any case, this is almost certainly not intended:

0004:   jb 0004

It creates a label 0004 which cannot be used, and then assembles a jb 0004 instruction.

Consider using real compiler tools

You might want to look into using flex and bison (or JFlex and BYACC if you want to continue using Java for this). They take a little bit of time to learn, but are well worth it. Your entire program, for instance could be done quite simply in C using flex and bison.

Code Snippets

String [] regNames= { "reg1", "reg2", "reg3", "reg4" };
int regnum = Arrays.binarySearch(regNames, registerToken);
if (regnum >= 0) {
    machineCode.add(regnum);
} else {
    throw ...
}

jmp:    jmp jmp
reg3:   ja reg3

0004:   jb 0004

Context

StackExchange Code Review Q#122840, answer score: 4

Revisions (0)

No revisions yet.