RISC-V learning notes [implementation]

EXU unit of hummingbird E200

Hummingbird E200 Series CPU is a two-stage pipeline architecture, and its decoding, execution, delivery and writeback functions are all in the second stage of the pipeline

These functions are completed by the execution unit EXU. The functions of the EXU are as follows

  • Decode and dispatch the instruction sent from IFU to EXU through IR register (described below)
  • The operand register index obtained by decoding reads the register group
  • Data relevance of maintenance instructions
  • Dispatch instructions to different arithmetic units for execution
  • Delivery instruction
  • Writes the operation result of the instruction back to the register group

decoding

In the classical five stage pipeline structure, the fetch decode execute is divided into three stages. Through decoding, the CPU obtains the operand register index, instruction type, instruction operation information and so on. At present, high-performance processors generally use the method of configuring out of order transmission queue in front of each operation unit to remove the correlation of instructions. When they are transmitted from the transmission queue, they read the general register group and send it to the operation unit for calculation

Decoder in hummingbird E203

The decoder module is saved in e203 under the core directory_ exu_ decode. V file

Completely written by combinatorial logic

To some extent, it can be understood as a super large case statement

module e203_exu_decode(
    input  [`E203_INSTR_SIZE-1:0] i_instr,//32-bit instruction from IFU
    input  [`E203_PC_SIZE-1:0] i_pc,//PC value corresponding to current instruction of IFU
    ......
    input  i_misalgn,//Finger pointing non aligned exception flag
    input  i_buserr,//Refers to the memory access error flag bit
  	
    ......//Omit a pile of decoded information
    output dec_ilegl,//Illegal instruction flag
	
	......
);
    
    //If the content is omitted, it is generally
    //1. Decode 32-bit and 16 bit instructions normally
    //2. Define the bus and register according to the devices (ALU, etc.) connected later
    //3. Use n-input parallel multiplexer to multiplex their information to single channel Dec according to different instruction groups_ Info bus
    //4. Decode the operand or immediate after the instruction and output it to the following devices
    //5. Generate the number and register index according to the specific conditions of 16 bit or 32-bit instructions
    //6. Decode different illegal instructions
endmodule

General register group

The module is defined in hummingbird E200 to realize the integer general register group in RISC-V architecture

Since E200 is a single transmit microarchitecture that writes back one instruction at a time in sequence, the module only needs to support up to two read ports and one write port

Module related codes are saved in e203_exu_regfile.v file

You can configure config V change the number of bits of the general-purpose register

Port logic

  • Write port

    By comparing the input result register index with their respective register numbers, a write enable signal is generated, and the enabled general register writes the write data into the register

  • Read port

    Each read port is a pure parallel multiplexer, which uses the register index of the read operand as the selection signal, and uses the special register to read the register index signal. This special register will be called only when the read operand is executed, which can reduce the dynamic inversion power consumption of the read port

    In short, a gatekeeper is placed on the read port. The gatekeeper will open the door for the register index data only when it is necessary to read (store the register index into the special register)

The code snippet is as follows

module e203_exu_regfile(
  input  [`E203_RFIDX_WIDTH-1:0] read_src1_idx,
  input  [`E203_RFIDX_WIDTH-1:0] read_src2_idx,
  output [`E203_XLEN-1:0] read_src1_dat,
  output [`E203_XLEN-1:0] read_src2_dat,

  input  wbck_dest_wen,
  input  [`E203_RFIDX_WIDTH-1:0] wbck_dest_idx,
  input  [`E203_XLEN-1:0] wbck_dest_dat,

  output [`E203_XLEN-1:0] x1_r,

  input  test_mode,
  input  clk,
  input  rst_n
);

    wire [`E203_XLEN-1:0] rf_r [`E203_RFREG_NUM-1:0];//Here, the register group is defined with a two-dimensional array, and the specific length can be changed
    wire [`E203_RFREG_NUM-1:0] rf_wen;
    
`ifdef E203_REGFILE_LATCH_BASED //{
  	//Here, DFF is used to realize the general register
    //Because if a latch is used, the DFF of the write port must be specially registered for a clock cycle (latch design) to prevent the latch of the write port read port brought by the latch from passing through
    wire [`E203_XLEN-1:0] wbck_dest_dat_r;
  	sirv_gnrl_dffl #(`E203_XLEN) wbck_dat_dffl (wbck_dest_wen, wbck_dest_dat, wbck_dest_dat_r, clk);
  	wire [`E203_RFREG_NUM-1:0] clk_rf_ltch;
`endif//}
    
	genvar i;//Use the parameterized generate syntax to generate the logic of the register group
generate //{
  	for (i=0; i<`E203_RFREG_NUM; i=i+1) begin:regfile//{
  		if(i==0) begin: rf0
		//x0 here is the constant 0. There is no need to generate write logic
			assign rf_wen[i] = 1'b0;
            assign rf_r[i] = `E203_XLEN'b0;
`ifdef E203_REGFILE_LATCH_BASED //{
            assign clk_rf_ltch[i] = 1'b0;
`endif//}
        end
        else begin: rfno0
            //Write enable is generated by comparing the index number and register number of the write register -- a typical & operation
            assign rf_wen[i] = wbck_dest_wen & (wbck_dest_idx == i) ;
`ifdef E203_REGFILE_LATCH_BASED //{
            //If a latch configuration is used, a gating clock is explicitly configured for each general-purpose register to save power consumption
            //Here is the example of gated clock
            e203_clkgate u_e203_clkgate(
              .clk_in  (clk  ),
              .test_mode(test_mode),
              .clock_en(rf_wen[i]),
              .clk_out (clk_rf_ltch[i])
            );
            //Here, the instantiation latch implements the general-purpose register
            sirv_gnrl_ltch #(`E203_XLEN) rf_ltch (clk_rf_ltch[i], wbck_dest_dat_r, rf_r[i]);
`else//}{
            //If latches are not used, the DFF is normalized
            //The gated clock is automatically inserted here to save power consumption
            sirv_gnrl_dffl #(`E203_XLEN) rf_dffl (rf_wen[i], wbck_dest_dat, rf_r[i], clk);
`endif//}
        end
  
      end//}
endgenerate//}
  
    //Each read port is a pure parallel multiplexer. The selection signal of the multiplexer is the register index of the read operand
  	assign read_src1_dat = rf_r[read_src1_idx];
  	assign read_src2_dat = rf_r[read_src2_idx];
endmodule

CSR register

The RISC-V architecture defines the Control and Status Register (CSR), which is used to configure or record the status of some operations. These registers are located inside the core and use their own independent address coding space, which has nothing to do with memory addressing. They can be regarded as "peripheral control registers of the core"

Use a dedicated CSR read / write instruction to access the CSR register

The relevant source code is located in e203_exu_csr.v file, the specific functions of each CSR register are realized in strict accordance with the RISC-V architecture definition

The code snippet is as follows:

module e203_exu_csr(
	input csr_ena,//CSR enable signal from ALU
  	input csr_wr_en,//CSR write operation flag bit
 	input csr_rd_en,//CSR read operation flag bit
    input [12-1:0] csr_idx,//CSR register address index
	......
    
    output [`E203_XLEN-1:0] read_csr_dat,//Read data
    input  [`E203_XLEN-1:0] wbck_csr_dat,//Write data
	......
);
    
    ......
    //Take MTVEC register as an example
    wire sel_mtvec = (csr_idx == 12'h305);///Perform * * decoding on the CSR register index to determine * * whether mtvec is selected
	wire rd_mtvec = csr_rd_en & sel_mtvec;
`ifdef E203_SUPPORT_MTVEC //{
	wire wr_mtvec = sel_mtvec & csr_wr_en;
    wire mtvec_ena = (wr_mtvec & wbck_csr_wen);//mtvec enable signal
	wire [`E203_XLEN-1:0] mtvec_r;
	wire [`E203_XLEN-1:0] mtvec_nxt = wbck_csr_dat;
    //Instantiation generation register DFF
	sirv_gnrl_dfflr #(`E203_XLEN) mtvec_dfflr (mtvec_ena, mtvec_nxt, mtvec_r, clk, rst_n);
	wire [`E203_XLEN-1:0] csr_mtvec = mtvec_r;
`else//}{
  	//The base address of vector table is a configurable parameter and does not support software writing
	wire [`E203_XLEN-1:0] csr_mtvec = `E203_MTVEC_TRAP_BASE;
`endif//}
	//For the CSR register where the read address does not exist, return data 0; CSR register whose write address does not exist, ignore this write operation
    //This is to meet the requirements of RISC-V without exception
    assign csr_mtvec_r = csr_mtvec;
    
    ......
    
    //Generate the read data required by CSR read operation. In essence, the logic is a parallel multiplexer implemented by and - or mode
    assign read_csr_dat = `E203_XLEN'b0 
               //| ({`E203_XLEN{rd_ustatus  }} & csr_ustatus  )
               | ({`E203_XLEN{rd_mstatus  }} & csr_mstatus  )
               | ({`E203_XLEN{rd_mie      }} & csr_mie      )
               | ({`E203_XLEN{rd_mtvec    }} & csr_mtvec    )
               | ({`E203_XLEN{rd_mepc     }} & csr_mepc     )
               | ({`E203_XLEN{rd_mscratch }} & csr_mscratch )
               | ({`E203_XLEN{rd_mcause   }} & csr_mcause   )
               | ({`E203_XLEN{rd_mbadaddr }} & csr_mbadaddr )
               | ({`E203_XLEN{rd_mip      }} & csr_mip      )
               | ({`E203_XLEN{rd_misa     }} & csr_misa      )
               | ({`E203_XLEN{rd_mvendorid}} & csr_mvendorid)
               | ({`E203_XLEN{rd_marchid  }} & csr_marchid  )
               | ({`E203_XLEN{rd_mimpid   }} & csr_mimpid   )
               | ({`E203_XLEN{rd_mhartid  }} & csr_mhartid  )
               | ({`E203_XLEN{rd_mcycle   }} & csr_mcycle   )
               | ({`E203_XLEN{rd_mcycleh  }} & csr_mcycleh  )
               | ({`E203_XLEN{rd_minstret }} & csr_minstret )
               | ({`E203_XLEN{rd_minstreth}} & csr_minstreth)
               | ({`E203_XLEN{rd_counterstop}} & csr_counterstop)// Self-defined
               | ({`E203_XLEN{rd_mcgstop}} & csr_mcgstop)// Self-defined
               | ({`E203_XLEN{rd_itcmnohold}} & csr_itcmnohold)// Self-defined
               | ({`E203_XLEN{rd_mdvnob2b}} & csr_mdvnob2b)// Self-defined
               | ({`E203_XLEN{rd_dcsr     }} & csr_dcsr    )
               | ({`E203_XLEN{rd_dpc      }} & csr_dpc     )
               | ({`E203_XLEN{rd_dscratch }} & csr_dscratch)
               ;
    
endmodule

implement

The execution in the five stage pipeline architecture needs to be executed after decoding. The instructions are allocated to different operation units for execution according to the specific operation type of the instructions. The common operation units are as follows:

  • Arithmetic logic operation unit (ALU): responsible for general logic operation, addition and subtraction operation, shift operation, etc
  • Integer multiplication unit: mainly responsible for the multiplication of signed or unsigned integers
  • Integer division unit: mainly responsible for the division of signed or unsigned integers
  • Floating point arithmetic unit (FPU): it is complex and usually divided into multiple independent arithmetic units

For other processor cores with special instructions, special computing units will be added accordingly (for example, hardware acceleration circuits such as DSP can be mounted next to the processor)

Command transmission sequence

Issue or Dispatch is not a common concept in the classic five stage pipeline, but it is mostly used for all kinds of RISC architecture CPU s, and this definition is also used in RISC-V

Launch: the process in which instructions are decoded and then distributed to different computing units for execution

Transmission and dispatch can be mixed. Dispatch is used as the definition in the hummingbird E200 processor pipeline

According to the number of instructions that can be transmitted once in each cycle, it can be divided into single transmit and multi transmit processors.

In particular, in some high-end superscalar processor cores, there are many pipeline stages, which makes dispatch and transmission have different meanings: dispatch represents the process that instructions are sent to the waiting queue of different computing units after decoding; Sending means the process of sending instructions from the waiting queue of the operation unit to the operation unit for execution.

According to the sequence of launch, execution and writeback, it is often divided into the following schools:

  • Sequential transmit, sequential execute, sequential write back

    The performance is relatively low, the hardware implementation is the simplest and the area is the smallest

    It is often used in the simplest pipelined processor core

  • Sequential launch, out of order execution and sequential write back

    In the execution stage of instructions, different operation units execute different instructions at the same time, which avoids the problem of different operation processing time and improves the processing performance; The final write back is still in order, so ALU often waits for other instructions to write back first and stops the pipeline of its operation unit itself

    It has relatively good performance and a slightly larger area

  • Sequential launch, out of order execution, out of order writeback

    On the basis of the above out of order execution, the operation unit is written back out of order, which is divided into several different methods

    • Reordering cache method

      The re orde buffer (ROB) reordering cache is used to temporarily store the results of ALU execution, and finally the rob writes them back to the register group in order

      There are problems of large area and large dynamic power consumption

      But the performance is very good, and the implementation scheme is very typical and mature

    • Physical register group method

      A unified physical register group is used to dynamically manage the mapping relationship of the logical register group. After the execution of ALU, the results are written back to the physical register group in disorder. The mapping relationship between the physical register group and the logical register group can be changed

      Complex control and optimized power consumption

    • Direct disordered writeback method

      Let the execution results without data correlation be written back to the register group directly, and the execution results with data correlation be written back in sequence

      Only some programs are optimized and additional circuits are needed

    • Other methods

  • Sequential dispatch, out of order launch, out of order execution and out of order write back

    This architecture is often used in high-performance superscalar processors.

    Basically, it can be regarded as the fusion of all high-performance operations above

Branch resolution

The branch prediction function in the fetch stage. For conditional branch instructions, since the conditional resolver requires operand operation, it is necessary to calculate in the execution stage and judge whether the branch instruction really needs to jump, and compare and execute according to the previously specified branch prediction algorithm. If the prediction is wrong, it is likely to need pipeline flushing and cause performance loss

Generally, in order to reduce performance loss, branch analysis will be performed at the front pipeline position

Hummingbird E200 series command launch dispatch

The launch and dispatch of hummingbird E200 Series CPU actually refer to the same behavior: that is, the total process in which instructions are sent to different computing units for execution after decoding

This part is completed with Dispatch module and ALU module

Dispatch module is responsible for forwarding dispatch tasks to ALU module

ALU is responsible for the interface between the delivery module and the front level

The features of hummingbird E200 series are as follows:

  • All instructions must be dispatched to ALU and delivered through the interface between Alu and delivery module; If it is a long instruction, it also needs to be further sent to the corresponding long instruction operation unit through ALU
  • The actual dispatch function occurs within ALU. Because the decoding part has preliminarily grouped and decoded the corresponding instruction signal according to the operation unit of the instruction, the instruction can be sent to the corresponding operation unit according to the instruction signal
  • The dispatch module deals with pipeline conflicts, including resource conflicts and data correlation conflicts, and blocks the dispatch point of the pipeline in some special cases

Pipeline conflict, long instruction and OITF processing

Pipeline conflicts include resource conflicts and data conflicts, both of which will lead to pipeline blocking. Hummingbird E203 adopts two methods to deal with resource conflict and data conflict respectively

Data conflict

Data conflict, as the name suggests, is a conflict caused by data correlation

Hummingbird E203 uses ingenious methods to deal with data conflicts: all instructions are divided into two categories, and the data correlation is divided into three categories, which are processed by long instruction splicing and pipeline flushing

The details are given in the following long instruction and OITF processing section

Resource conflict

The concept of data conflict has been given before. Here we introduce resource conflict

Resource conflicts usually occur when instructions are sent to different execution units for execution. When an instruction is executed, it takes a long clock cycle, After that, when other instructions are sent to the same hardware module for processing, there will be a resource conflict - subsequent instructions need to wait for the hardware module to be released after the previous instruction completes the operation.

The interface implementation of hummingbird E203 adopts a rigorous valid ready handshake interface. Once a module has a resource conflict, it will output a signal of ready=0. Even if the other side has a valid=1, it cannot complete the handshake. Therefore, the previous module cannot allocate instructions and will enter the waiting state until ready=1

Long instruction and OITF processing

Hummingbird E203 divides all instructions to be executed into two categories:

  1. Single cycle execution instruction

    The delivery and writeback functions of hummingbird E203 are in the second level of the pipeline. The single cycle execution instruction completes the delivery and writeback at this level

  2. Multi cycle execution instruction

    This kind of instruction usually needs multiple clock cycles to complete execution and write back, so it is also called "post delivery long pipeline instruction", which is referred to as long instruction for short

    The execution process of long instructions is special

In order to deliver long instructions after many clock cycles, it is necessary to detect the data correlation first. Hummingbird E203 uses a module called OITF (Outstanding Instructions Track FIFO long instruction tracking queue) to detect the RAW and WAW correlation Related to long instructions

The reason why WAR correlation is not detected is that E203 is a microarchitecture dispatched and written back in order. The source operands have been read from the register group at the time of dispatch, so the write back Regfile operation will not occur before reading the Regfile source operands.

To get back to business, OITF is essentially an ordinary FIFO (nonsense), and its source code is rtl/e203/core/e203_exu_oitf.v can be viewed

At the dispatch point, each time a long instruction is dispatched, a table entry will be allocated in OITF, in which the source operand register index and result register index of the long instruction will be stored

At the writeback point, each time a long instruction is written back in sequence, the representation of this instruction in OITF will be removed - he will exit from FIFO

To sum up, OITF essentially saves the long instruction information that has been dispatched but has not been written back

For the sake of simplicity, there is no appendix related code here. Interested readers can browse the source code by themselves

Solutions to resource conflicts

Hummingbird E203 adopts the solution idea of blocking pipeline, does not directly and quickly bypass the results of long instructions to subsequent instructions to be dispatched to solve data conflicts, and does not add more hardware modules to deal with resource conflicts. This is because the design idea of hummingbird E203 adheres to "small area", abandons higher performance and realizes higher performance area ratio. If you design a high-performance CPU, you obviously can't simply use this idea

ALU module

The ALU unit of hummingbird E203 is located under the EXU and mainly includes five sub modules. They share the same actual operation data path, so the area overhead of the main data path is only one

  • ALU is mainly responsible for the operation of ALU and shift logic
  • Memory access address generation: mainly responsible for the address generation of Load, Store and "A" extension instructions, and the micro operation splitting and execution of "A" extension instructions
  • Branch prediction analysis: mainly responsible for the result analysis and execution of branch and Jump instructions
  • CSR read / write control: mainly responsible for the execution of CSR read / write instructions
  • Multi cycle multiplier and divider: mainly responsible for the execution of multiplication and division instructions

Ordinary ALU

At rtl/e203/core/e203_exu_alu_rglr.v

The module is completely composed of combinational logic circuit (that is, it can only occupy a little LUT in FPGA). It has no operation data path. Its main logic initiates the operation request for the shared operation data path according to the instruction type of ordinary ALU, and retrieves the operation result from the shared operation data path

Memory access address generation

This module is abbreviated as AGU (Adress Generation Unit) and is located in rtl/e203/core/e203_exu_alu_lsuagu.v

The relevant contents will be described in detail in the memory architecture section

Branch prediction analysis

At rtl/e203/core/e203_exu_alu_bjp.v

BJP (Branch and Jump resolve) module is the main basis for delivering branch jump instructions. You can check the delivery part for details

CSR read / write control

This module is mainly responsible for the execution of CSR read and write instructions, which is located in rtl/e203/core/e203_exu_alu_csrctrl.v

This module is also completely composed of combinational logic, which generates the control signal of reading and writing CSR register module according to the type of CSR reading and writing instruction

The code snippet is as follows:

`include "e203_defines.v"

module e203_exu_alu_csrctrl(
  //Handshake interface
  input  csr_i_valid, // valid signal
  output csr_i_ready, // ready signal

  input  [`E203_XLEN-1:0] csr_i_rs1,
  input  [`E203_DECINFO_CSR_WIDTH-1:0] csr_i_info,
  input  csr_i_rdwen,   

  output csr_ena, // CSR read / write enable signal
  output csr_wr_en, // CSR write operation indication signal
  output csr_rd_en, // CSR read operation indication signal
  output [12-1:0] csr_idx, // Address index of CSR register

  input  csr_access_ilgl,
  input  [`E203_XLEN-1:0] read_csr_dat, // The read operation reads data from the CSR register module
  output [`E203_XLEN-1:0] wbck_csr_dat, // The write operation writes data to the CSR register module

  `ifdef E203_HAS_CSR_NICE//{
  input          nice_xs_off,
  output         csr_sel_nice,
  output         nice_csr_valid,
  input          nice_csr_ready,
  output  [31:0] nice_csr_addr,
  output         nice_csr_wr,
  output  [31:0] nice_csr_wdata,
  input   [31:0] nice_csr_rdata,
  `endif//}

  //CSR writeback / delivery interface
  output csr_o_valid, // valid signal
  input  csr_o_ready, // ready signal
  // Special writeback interface for non aligned lst and AMO instructions
  output [`E203_XLEN-1:0] csr_o_wbck_wdat,
  output csr_o_wbck_err,   

  input  clk,
  input  rst_n
  );

  `ifdef E203_HAS_CSR_NICE//{
      // If accessed the NICE CSR range then we need to check if the NICE CSR is ready
  assign csr_sel_nice        = (csr_idx[11:8] == 4'hE);
  wire sel_nice            = csr_sel_nice & (~nice_xs_off);
  wire addi_condi         = sel_nice ? nice_csr_ready : 1'b1; 

  assign csr_o_valid      = csr_i_valid
                            & addi_condi; // Need to make sure the nice_csr-ready is ready to make sure
                                          //  it can be sent to NICE and O interface same cycle
  assign nice_csr_valid    = sel_nice & csr_i_valid & 
                            csr_o_ready;// Need to make sure the o-ready is ready to make sure
                                        //  it can be sent to NICE and O interface same cycle

  assign csr_i_ready      = sel_nice ? (nice_csr_ready & csr_o_ready) : csr_o_ready; 

  assign csr_o_wbck_err   = csr_access_ilgl;
  assign csr_o_wbck_wdat  = sel_nice ? nice_csr_rdata : read_csr_dat;

  assign nice_csr_addr = csr_idx;
  assign nice_csr_wr   = csr_wr_en;
  assign nice_csr_wdata = wbck_csr_dat;
  `else//}{
  wire   sel_nice      = 1'b0;
  assign csr_o_valid      = csr_i_valid;
  assign csr_i_ready      = csr_o_ready;
  assign csr_o_wbck_err   = csr_access_ilgl;
  assign csr_o_wbck_wdat  = read_csr_dat;
  `endif//}

  //Extract relevant information from Info Bus
  wire        csrrw  = csr_i_info[`E203_DECINFO_CSR_CSRRW ];
  wire        csrrs  = csr_i_info[`E203_DECINFO_CSR_CSRRS ];
  wire        csrrc  = csr_i_info[`E203_DECINFO_CSR_CSRRC ];
  wire        rs1imm = csr_i_info[`E203_DECINFO_CSR_RS1IMM];
  wire        rs1is0 = csr_i_info[`E203_DECINFO_CSR_RS1IS0];
  wire [4:0]  zimm   = csr_i_info[`E203_DECINFO_CSR_ZIMMM ];
  wire [11:0] csridx = csr_i_info[`E203_DECINFO_CSR_CSRIDX];
  //Generate operand 1. If immediate is used, select immediate; otherwise, select source register 1
  wire [`E203_XLEN-1:0] csr_op1 = rs1imm ? {27'b0,zimm} : csr_i_rs1;
  //Generate a read operation instruction signal according to the information of the instruction
  assign csr_rd_en = csr_i_valid & 
    (
      (csrrw ? csr_i_rdwen : 1'b0) // the CSRRW only read when the destination reg need to be writen
      | csrrs | csrrc // The set and clear operation always need to read CSR
     );
  //Generate a write operation instruction signal according to the information of the instruction
  assign csr_wr_en = csr_i_valid & (
                csrrw // CSRRW always write the original RS1 value into the CSR
               | ((csrrs | csrrc) & (~rs1is0)) // for CSRRS/RC, if the RS is x0, then should not really write
            );                                                                           
  //Generates an address index that accesses the CSR register
  assign csr_idx = csridx;
  //Generates a CSR read / write enable signal sent to the CSR register module
  assign csr_ena = csr_o_valid & csr_o_ready & (~sel_nice);
  //Generate write operation to write data to CSR register module
  assign wbck_csr_dat = 
              ({`E203_XLEN{csrrw}} & csr_op1)
            | ({`E203_XLEN{csrrs}} & (  csr_op1  | read_csr_dat))
            | ({`E203_XLEN{csrrc}} & ((~csr_op1) & read_csr_dat));
endmodule

Multi period multiplier and divider

The hummingbird E200 series uses two multiplication and division solutions

For the hummingbird E203, it is equipped with a multi cycle multiplier and divider with low performance and small area, while for other high-performance devices, it uses a high-performance single cycle multiplier and independent divider

The commonly used multi cycle multiplier and divider and divider are generally realized by the following theory:

  • Signed integer multiplication: the commonly used Booth code is used to generate the partial product, and then the iterative method is used. The adder is used to accumulate the partial product in each cycle, and the final product is obtained after multiple cycles of iteration, so as to realize the multi cycle multiplier
  • Signed integer division: the commonly used addition and subtraction alternating method is used, and then the iterative method is used. The adder is used to generate part of the remainder in each cycle. After multiple cycles of iteration, the final quotient and Yushu are obtained, so as to realize the multi cycle divider

The theoretical contents of the two modules can refer to digital and electronic textbooks or related books

Because both modules take the adder as the core and use a set of registers to save partial product or partial remainder, resource reuse is used in hummingbird E203 - the multi cycle multiplier and divider is combined as a sub unit of ALU. They share the adder in the data path and complete the multiplication or division operation after multiple cycles

The MDV module of multi cycle multiplier and divider is located in rtl/e203/core/e203_exu_alu_muldiv.v

Meanwhile, hummingbird E203 optimizes the multiplication and division method as follows:

  • In the multiplication operation, in order to reduce the number of cycles required, the Booth coding of base 4 (Radix-4) is adopted, and the unsigned multiplication is uniformly treated as a signed number after one bit symbol expansion, so 17 iteration cycles are required
  • In the division operation, the ordinary addition and subtraction alternating method is used. Similarly, the unsigned multiplication is uniformly extended by one bit and then operated as a signed number, which requires 33 iteration cycles. In addition, due to the problem of 1-bit accuracy of the result obtained by the addition and subtraction alternating method, it also needs an additional clock cycle to judge whether the quotient and remainder correction is needed, and there are two additional cycles of quotient and remainder correction, so as to obtain the accurate division result
  • The MDV module only performs operation control without its own adder. The adder multiplexes the shared operation data path with other ALU sub units
  • The MDV module also does not have its own registers, which are multiplexed with AGU units

To sum up, MDV is actually just a state machine. Its multiplication implementation requires 17 iterative cycles and division implementation requires up to 36 cycles. It adopts the typical idea of "speed for area"

Arithmetic data path

In fact, the real module used by ALU for calculation is the data path, which is located in rtl/e203/core/e203_exu_alu_dpath.v

It passively accepts the request of other ALU sub units for specific operation, and then returns the calculation result to the operation data path of non other sub units

It can be said that other sub units of Alu are only a set of state machines (control systems) that select different logic for different instructions, and the operation data path through which the data mainly passes is the operation core of ALU. The whole Alu is similar to the structure of "stars support the moon" - the operation data path with the largest area is in the middle, and the data flow will be selected by the surrounding state machines, Or single pass (ordinary ALU) or repeated pass and output different results to the register (multi cycle multiplier and divider), which greatly reduces the area of ALU

High performance multiplication and division

In addition to the small area multi cycle multiplier and divider, other models of hummingbird E200 are also equipped with high-performance single cycle multiplier and independent divider

The high-performance multiplier will be deployed in the second stage of the pipeline, and the divider will still use the multi cycle divider, but it will no longer reuse the shared operation data path with ALU, but has a separate divider unit as a long instruction, which is also deployed in the second stage of the pipeline

floating point unit

Hummingbird E200 series supports "F" and "D" extended subsets of RISC-V and can handle single precision and double precision floating-point instructions

Floating point instructions are supported by FPU. If FPU is configured, FPU has an independent arithmetic unit as a long instruction, and FPU also has an independent general floating point register group. Modules containing extended subsets of F and D are required to contain 32 general-purpose floating-point registers. If only f is included, the width of general-purpose floating-point registers of floating-point instruction subsets is 32 bits, and the width of general-purpose floating-point registers of floating-point instruction subsets containing only D is 64 bits

The FPU of hummingbird E200 series supports the following functions

  • Independent clock gating
  • Independent power domain
  • Single precision floating-point multiplexed data path

However, the open source hummingbird E203 is not equipped with FPU (SAD)

summary

This part briefly introduces the two links of decoding and execution in the execution unit of hummingbird E203

There are other units in writeback and delivery, because the length involved will be introduced in the blog posts related to writeback, delivery and memory

The EXU part is the core part of hummingbird E203, and there are a lot of codes, so it needs to be understood repeatedly

Keywords: cpu risc-v

Added by mikes1471 on Mon, 31 Jan 2022 10:37:35 +0200