patternMinor
4-stage pipelined RV32I CPU in Verilog
Viewed 0 times
pipelinedverilogstagerv32icpu
Problem
This is a simple 4-stage pipeline that partially implements the RV32I ISA.
All instructions are supported, except
The pipeline stages are more or less the classic RISC ones without the memory access stage (i.e. fetch instruction, decode and fetch operands, calculate result, write result).
My ultimate goal is to have a simple CPU with somewhat decent performance for synthesis to an FPGA (I'd like to reach 150-200MHz eventually). This is the first major hardware design project that I have attempted, so I'm fairly sure that I have made a bunch of beginner mistakes.
`
module alu (input [3:0] operation,
All instructions are supported, except
jalr, those relating to memory (l, lu, s*, fence and fence.i), or system calls (sbreak and scall). The pipeline stages are more or less the classic RISC ones without the memory access stage (i.e. fetch instruction, decode and fetch operands, calculate result, write result).
My ultimate goal is to have a simple CPU with somewhat decent performance for synthesis to an FPGA (I'd like to reach 150-200MHz eventually). This is the first major hardware design project that I have attempted, so I'm fairly sure that I have made a bunch of beginner mistakes.
`
define ALU_ADD 0
define ALU_SUB 1define ALU_AND 2
define ALU_OR 3define ALU_XOR 4
define ALU_SLL 5define ALU_SRL 6
define ALU_SRA 7define ALU_SEQ 8
define ALU_SNE 9define ALU_SLT 10
define ALU_SGE 11define ALU_SLTU 12
define ALU_SGEU 13define OPCODE_OP 7'b0110011
define OPCODE_OP_IMM 7'b0010011define OPCODE_LUI 7'b0110111
define OPCODE_AUIPC 7'b0010111define OPCODE_JAL 7'b1101111
define OPCODE_JALR 7'b1100111define OPCODE_BRANCH 7'b1100011
define OPCODE_SYSTEM 7'b1110011define FUNCT3_ADD_SUB 3'b000
define FUNCT3_SLL 3'b001define FUNCT3_SLT 3'b010
define FUNCT3_SLTU 3'b011define FUNCT3_XOR 3'b100
define FUNCT3_SRL_SRA 3'b101define FUNCT3_OR 3'b110
define FUNCT3_AND 3'b111define FUNCT3_BEQ 3'b000
define FUNCT3_BNE 3'b001define FUNCT3_BLT 3'b100
define FUNCT3_BGE 3'b101define FUNCT3_BLTU 3'b110
define FUNCT3_BGEU 3'b111define SYSTEM_RDCYCLE 20'b11000000000000000010
define SYSTEM_RDCYCLEH 20'b11001000000000000010define SYSTEM_RDTIME 20'b11000000000100000010
define SYSTEM_RDTIMEH 20'b11001000000100000010define SYSTEM_RDINSTRET 20'b11000000001000000010
define SYSTEM_RDINSTRETH 20'b11001000001000000010module alu (input [3:0] operation,
Solution
- Have I made any rookie mistakes or committed major sins against Verilog style?
Overall it is coded very well and easy to read, no Verilog sins (unlikely to synthesize if there were). Clean Verilog-2001 syntax utilizing ANSI style header and
@*.The only potential error I could spot (without building a testbench) is with
f_pc, regs, e_, and most d_ registers are not assigned in the reset condition. On FPGA this will typically initialize to 0, but will not be reset if reset comes any time later. Typically flops with resets and flops without reset are assigned in separate always blocks.To make live a little easier with accidental missing resets, there is an Emacs has a plug-in called Verilog-mode which can generate reset assignments with
/AUTORESET/; as well as other expansion features. Vim can utilize it to with wrapper script; something similar may exist for other editors.I would suggest making sure all numeric literals have explicit width and radix (ex the `
ALU_* values should start with 4'd, cycle
Look are the timing report to get an idea where the bottleneck(s) are.
If the bottleneck is related to decoding the muxes, then consider one-hot parallel decoding. This will require more gates but can save timing.
If the bottleneck is related to some heavy computation, then consider moving some of the logic to an earlier stage; having the data ready even if it will ignored. This will also take up more gates. It is also likely to make the code more complicated then intended, but if needed then it is needed.
There is a point of diminishing return and more tweaks can become departmental. Adding too much logic make routing more challenging which can also impact timing/performance. And if the design gets to big, it won't fix on the FPGA. The synthesis report should give some clues to this.
- Will explicitly using the undefined value (x) in places where the values do not matter actually help the synthesis tool to generate less logic? (An example would be the default case in the ALU.)
It sometimes can, but in my experience it is can cause more challenges then benefits. As the X propagates in simulation, it will eval as false in an condition statements. There is no X in hardware, it will be seen as 1 or 0, so it could take a different branch when evaluated in any condition. There are X propagation simulation tools/add-ons/plug-ins that can help, but they cost money.
If the testbench is robust then randomization could be used an X-prop alternative (ex: d = ifdef SYNTHESIS 32'dx else $random(...) endif ;).Assigning it to a known value normally doesn't have negative impact and makes debugging a bit easier.
Other comments:
Consider a two-always block coding style by keeping the synchronous assignment simple and moving the algorithmic logic for FETCH, DECODE, EXECUTE, and WRITE into a combinational block. This would separating the present state and next state values. It is a bit of this is personal choice and the opinion of person you were taught by (as well as the teacher of the teacher). This paper by Cliff Cummings (as well as other papers) was a major influence for my coding style and many of my colleagues.
Consider enabling SystemVerilog if your FPGA supports it. Use a package and enums replace the macros (macros have change of name collision with bigger projects, especally when using code from other people). Be more explicit with intention by
always_ff and always_comb. Part of the decoder could be simplified using structs and unions.Context
StackExchange Code Review Q#111852, answer score: 3
Revisions (0)
No revisions yet.