patternjavaMinor
Assembler for ToyVM
Viewed 0 times
toyvmassemblerfor
Problem
After rolling my own virtual machine, I decided to implement an assembler for it. Ironically, it's all Java, since I needed to do a lot of text manipulations. Please, tell me anything that comes to mind.
Here is the main component:
ToyVMAssembler.java:
```
package net.coderodde.toy.assembler;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
/**
* This class is responsible for assembling a ToyVM source file.
*
* @author Rodion "rodde" Efremov
* @version 1.6 (Mar 10, 2016).
*/
public class ToyVMAssembler {
public static final byte REG1 = 0x00;
public static final byte REG2 = 0x01;
public static final byte REG3 = 0x02;
public static final byte REG4 = 0x03;
public static final byte ADD = 0x01;
public static final byte NEG = 0x02;
public static final byte MUL = 0x03;
public static final byte DIV = 0x04;
public static final byte MOD = 0x05;
public static final byte CMP = 0x10;
public static final byte JA = 0x11;
public static final byte JE = 0x12;
public static final byte JB = 0x13;
public static final byte JMP = 0x14;
public static final byte CALL = 0x20;
public static final byte RET = 0x21;
public static final byte LOAD = 0x30;
public static final byte STORE = 0x31;
public static final byte CONST = 0x32;
public static final byte HALT = 0x40;
public static final byte INT = 0x41;
public static final byte NOP = 0x42;
public static final byte PUSH = 0x50;
public static final byte PUSH_ALL = 0x51;
public static final byte POP = 0x52;
public static final byte POP_ALL = 0x53;
public static final byte LSP = 0x54;
/**
* Specifies the token starting a one-line comment.
*/
private static final String COMMENT_START_TOKEN = "//";
private final List sourceCodeLineList;
private final List machineCode
Here is the main component:
ToyVMAssembler.java:
```
package net.coderodde.toy.assembler;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
/**
* This class is responsible for assembling a ToyVM source file.
*
* @author Rodion "rodde" Efremov
* @version 1.6 (Mar 10, 2016).
*/
public class ToyVMAssembler {
public static final byte REG1 = 0x00;
public static final byte REG2 = 0x01;
public static final byte REG3 = 0x02;
public static final byte REG4 = 0x03;
public static final byte ADD = 0x01;
public static final byte NEG = 0x02;
public static final byte MUL = 0x03;
public static final byte DIV = 0x04;
public static final byte MOD = 0x05;
public static final byte CMP = 0x10;
public static final byte JA = 0x11;
public static final byte JE = 0x12;
public static final byte JB = 0x13;
public static final byte JMP = 0x14;
public static final byte CALL = 0x20;
public static final byte RET = 0x21;
public static final byte LOAD = 0x30;
public static final byte STORE = 0x31;
public static final byte CONST = 0x32;
public static final byte HALT = 0x40;
public static final byte INT = 0x41;
public static final byte NOP = 0x42;
public static final byte PUSH = 0x50;
public static final byte PUSH_ALL = 0x51;
public static final byte POP = 0x52;
public static final byte POP_ALL = 0x53;
public static final byte LSP = 0x54;
/**
* Specifies the token starting a one-line comment.
*/
private static final String COMMENT_START_TOKEN = "//";
private final List sourceCodeLineList;
private final List machineCode
Solution
I see some things that may help you improve your code.
Make your code more data driven
For any given instruction, the relevant pieces are spread over quite a bit of code. A better approach might be to have something like an
Make error reporting more consistent
One of the problems with repeating very similar code multiple times is that small errors can creep in and be overlooked in the volume of code. For example, the word "instruction" is misspelled for the
Separate tokenizing from parsing
Classical assembler or compiler construction separates tokenizing from parsing. The tokenizer (also called a "lexer" by some) creates a series of tokens, identified by type and value, to the parser. This allows for more complex constructs, such as an "expression" to be parsed without having to also clutter the code with the parts that determine whether something is a number or an instruction mnemonic or a reference to a register. Doing so will make it easier to modify and maintain both parts.
Use existing string handling
The
Now it's trivial to add "reg5" for example, just by listing its name.
Create functions for common operations
There are several places in which an
Be careful of what you accept
Right now, the assembler will happily accept lines like these:
It's possible that this is intentional, but I'm not convinced it's good design. In any case, this is almost certainly not intended:
It creates a label
Consider using real compiler tools
You might want to look into using
Make your code more data driven
For any given instruction, the relevant pieces are spread over quite a bit of code. A better approach might be to have something like an
Instruction class that would contain the instruction string, the encoded hex value, the length and the number and type of arguments. Make error reporting more consistent
One of the problems with repeating very similar code multiple times is that small errors can creep in and be overlooked in the volume of code. For example, the word "instruction" is misspelled for the
jmp instruction error code, but not for the jb or other instructions, even though their error strings are nearly identical.Separate tokenizing from parsing
Classical assembler or compiler construction separates tokenizing from parsing. The tokenizer (also called a "lexer" by some) creates a series of tokens, identified by type and value, to the parser. This allows for more complex constructs, such as an "expression" to be parsed without having to also clutter the code with the parts that determine whether something is a number or an instruction mnemonic or a reference to a register. Doing so will make it easier to modify and maintain both parts.
Use existing string handling
The
emitRegister uses a switch to enumerate each register and emit the corresponding value. One could also do something like this instead:String [] regNames= { "reg1", "reg2", "reg3", "reg4" };
int regnum = Arrays.binarySearch(regNames, registerToken);
if (regnum >= 0) {
machineCode.add(regnum);
} else {
throw ...
}Now it's trivial to add "reg5" for example, just by listing its name.
Create functions for common operations
There are several places in which an
int is turned into little-endian format. The code would be more clear and compact if that operation were implemented as a function instead.Be careful of what you accept
Right now, the assembler will happily accept lines like these:
jmp: jmp jmp
reg3: ja reg3It's possible that this is intentional, but I'm not convinced it's good design. In any case, this is almost certainly not intended:
0004: jb 0004It creates a label
0004 which cannot be used, and then assembles a jb 0004 instruction.Consider using real compiler tools
You might want to look into using
flex and bison (or JFlex and BYACC if you want to continue using Java for this). They take a little bit of time to learn, but are well worth it. Your entire program, for instance could be done quite simply in C using flex and bison.Code Snippets
String [] regNames= { "reg1", "reg2", "reg3", "reg4" };
int regnum = Arrays.binarySearch(regNames, registerToken);
if (regnum >= 0) {
machineCode.add(regnum);
} else {
throw ...
}jmp: jmp jmp
reg3: ja reg30004: jb 0004Context
StackExchange Code Review Q#122840, answer score: 4
Revisions (0)
No revisions yet.