EXU unit of hummingbird E200
Hummingbird E200 Series CPU is a two-stage pipeline architecture, and its decoding, execution, delivery and writeback functions are all in the second stage of the pipeline
These functions are completed by the execution unit EXU. The functions of the EXU are as follows
- Decode and dispatch the instruction sent from IFU to EXU through IR register (described below)
- The operand register index obtained by decoding reads the register group
- Data relevance of maintenance instructions
- Dispatch instructions to different arithmetic units for execution
- Delivery instruction
- Writes the operation result of the instruction back to the register group
decoding
In the classical five stage pipeline structure, the fetch decode execute is divided into three stages. Through decoding, the CPU obtains the operand register index, instruction type, instruction operation information and so on. At present, high-performance processors generally use the method of configuring out of order transmission queue in front of each operation unit to remove the correlation of instructions. When they are transmitted from the transmission queue, they read the general register group and send it to the operation unit for calculation
Decoder in hummingbird E203
The decoder module is saved in e203 under the core directory_ exu_ decode. V file
Completely written by combinatorial logic
To some extent, it can be understood as a super large case statement
module e203_exu_decode( input [`E203_INSTR_SIZE-1:0] i_instr,//32-bit instruction from IFU input [`E203_PC_SIZE-1:0] i_pc,//PC value corresponding to current instruction of IFU ...... input i_misalgn,//Finger pointing non aligned exception flag input i_buserr,//Refers to the memory access error flag bit ......//Omit a pile of decoded information output dec_ilegl,//Illegal instruction flag ...... ); //If the content is omitted, it is generally //1. Decode 32-bit and 16 bit instructions normally //2. Define the bus and register according to the devices (ALU, etc.) connected later //3. Use n-input parallel multiplexer to multiplex their information to single channel Dec according to different instruction groups_ Info bus //4. Decode the operand or immediate after the instruction and output it to the following devices //5. Generate the number and register index according to the specific conditions of 16 bit or 32-bit instructions //6. Decode different illegal instructions endmodule
General register group
The module is defined in hummingbird E200 to realize the integer general register group in RISC-V architecture
Since E200 is a single transmit microarchitecture that writes back one instruction at a time in sequence, the module only needs to support up to two read ports and one write port
Module related codes are saved in e203_exu_regfile.v file
You can configure config V change the number of bits of the general-purpose register
Port logic
-
Write port
By comparing the input result register index with their respective register numbers, a write enable signal is generated, and the enabled general register writes the write data into the register
-
Read port
Each read port is a pure parallel multiplexer, which uses the register index of the read operand as the selection signal, and uses the special register to read the register index signal. This special register will be called only when the read operand is executed, which can reduce the dynamic inversion power consumption of the read port
In short, a gatekeeper is placed on the read port. The gatekeeper will open the door for the register index data only when it is necessary to read (store the register index into the special register)
The code snippet is as follows
module e203_exu_regfile( input [`E203_RFIDX_WIDTH-1:0] read_src1_idx, input [`E203_RFIDX_WIDTH-1:0] read_src2_idx, output [`E203_XLEN-1:0] read_src1_dat, output [`E203_XLEN-1:0] read_src2_dat, input wbck_dest_wen, input [`E203_RFIDX_WIDTH-1:0] wbck_dest_idx, input [`E203_XLEN-1:0] wbck_dest_dat, output [`E203_XLEN-1:0] x1_r, input test_mode, input clk, input rst_n ); wire [`E203_XLEN-1:0] rf_r [`E203_RFREG_NUM-1:0];//Here, the register group is defined with a two-dimensional array, and the specific length can be changed wire [`E203_RFREG_NUM-1:0] rf_wen; `ifdef E203_REGFILE_LATCH_BASED //{ //Here, DFF is used to realize the general register //Because if a latch is used, the DFF of the write port must be specially registered for a clock cycle (latch design) to prevent the latch of the write port read port brought by the latch from passing through wire [`E203_XLEN-1:0] wbck_dest_dat_r; sirv_gnrl_dffl #(`E203_XLEN) wbck_dat_dffl (wbck_dest_wen, wbck_dest_dat, wbck_dest_dat_r, clk); wire [`E203_RFREG_NUM-1:0] clk_rf_ltch; `endif//} genvar i;//Use the parameterized generate syntax to generate the logic of the register group generate //{ for (i=0; i<`E203_RFREG_NUM; i=i+1) begin:regfile//{ if(i==0) begin: rf0 //x0 here is the constant 0. There is no need to generate write logic assign rf_wen[i] = 1'b0; assign rf_r[i] = `E203_XLEN'b0; `ifdef E203_REGFILE_LATCH_BASED //{ assign clk_rf_ltch[i] = 1'b0; `endif//} end else begin: rfno0 //Write enable is generated by comparing the index number and register number of the write register -- a typical & operation assign rf_wen[i] = wbck_dest_wen & (wbck_dest_idx == i) ; `ifdef E203_REGFILE_LATCH_BASED //{ //If a latch configuration is used, a gating clock is explicitly configured for each general-purpose register to save power consumption //Here is the example of gated clock e203_clkgate u_e203_clkgate( .clk_in (clk ), .test_mode(test_mode), .clock_en(rf_wen[i]), .clk_out (clk_rf_ltch[i]) ); //Here, the instantiation latch implements the general-purpose register sirv_gnrl_ltch #(`E203_XLEN) rf_ltch (clk_rf_ltch[i], wbck_dest_dat_r, rf_r[i]); `else//}{ //If latches are not used, the DFF is normalized //The gated clock is automatically inserted here to save power consumption sirv_gnrl_dffl #(`E203_XLEN) rf_dffl (rf_wen[i], wbck_dest_dat, rf_r[i], clk); `endif//} end end//} endgenerate//} //Each read port is a pure parallel multiplexer. The selection signal of the multiplexer is the register index of the read operand assign read_src1_dat = rf_r[read_src1_idx]; assign read_src2_dat = rf_r[read_src2_idx]; endmodule
CSR register
The RISC-V architecture defines the Control and Status Register (CSR), which is used to configure or record the status of some operations. These registers are located inside the core and use their own independent address coding space, which has nothing to do with memory addressing. They can be regarded as "peripheral control registers of the core"
Use a dedicated CSR read / write instruction to access the CSR register
The relevant source code is located in e203_exu_csr.v file, the specific functions of each CSR register are realized in strict accordance with the RISC-V architecture definition
The code snippet is as follows:
module e203_exu_csr( input csr_ena,//CSR enable signal from ALU input csr_wr_en,//CSR write operation flag bit input csr_rd_en,//CSR read operation flag bit input [12-1:0] csr_idx,//CSR register address index ...... output [`E203_XLEN-1:0] read_csr_dat,//Read data input [`E203_XLEN-1:0] wbck_csr_dat,//Write data ...... ); ...... //Take MTVEC register as an example wire sel_mtvec = (csr_idx == 12'h305);///Perform * * decoding on the CSR register index to determine * * whether mtvec is selected wire rd_mtvec = csr_rd_en & sel_mtvec; `ifdef E203_SUPPORT_MTVEC //{ wire wr_mtvec = sel_mtvec & csr_wr_en; wire mtvec_ena = (wr_mtvec & wbck_csr_wen);//mtvec enable signal wire [`E203_XLEN-1:0] mtvec_r; wire [`E203_XLEN-1:0] mtvec_nxt = wbck_csr_dat; //Instantiation generation register DFF sirv_gnrl_dfflr #(`E203_XLEN) mtvec_dfflr (mtvec_ena, mtvec_nxt, mtvec_r, clk, rst_n); wire [`E203_XLEN-1:0] csr_mtvec = mtvec_r; `else//}{ //The base address of vector table is a configurable parameter and does not support software writing wire [`E203_XLEN-1:0] csr_mtvec = `E203_MTVEC_TRAP_BASE; `endif//} //For the CSR register where the read address does not exist, return data 0; CSR register whose write address does not exist, ignore this write operation //This is to meet the requirements of RISC-V without exception assign csr_mtvec_r = csr_mtvec; ...... //Generate the read data required by CSR read operation. In essence, the logic is a parallel multiplexer implemented by and - or mode assign read_csr_dat = `E203_XLEN'b0 //| ({`E203_XLEN{rd_ustatus }} & csr_ustatus ) | ({`E203_XLEN{rd_mstatus }} & csr_mstatus ) | ({`E203_XLEN{rd_mie }} & csr_mie ) | ({`E203_XLEN{rd_mtvec }} & csr_mtvec ) | ({`E203_XLEN{rd_mepc }} & csr_mepc ) | ({`E203_XLEN{rd_mscratch }} & csr_mscratch ) | ({`E203_XLEN{rd_mcause }} & csr_mcause ) | ({`E203_XLEN{rd_mbadaddr }} & csr_mbadaddr ) | ({`E203_XLEN{rd_mip }} & csr_mip ) | ({`E203_XLEN{rd_misa }} & csr_misa ) | ({`E203_XLEN{rd_mvendorid}} & csr_mvendorid) | ({`E203_XLEN{rd_marchid }} & csr_marchid ) | ({`E203_XLEN{rd_mimpid }} & csr_mimpid ) | ({`E203_XLEN{rd_mhartid }} & csr_mhartid ) | ({`E203_XLEN{rd_mcycle }} & csr_mcycle ) | ({`E203_XLEN{rd_mcycleh }} & csr_mcycleh ) | ({`E203_XLEN{rd_minstret }} & csr_minstret ) | ({`E203_XLEN{rd_minstreth}} & csr_minstreth) | ({`E203_XLEN{rd_counterstop}} & csr_counterstop)// Self-defined | ({`E203_XLEN{rd_mcgstop}} & csr_mcgstop)// Self-defined | ({`E203_XLEN{rd_itcmnohold}} & csr_itcmnohold)// Self-defined | ({`E203_XLEN{rd_mdvnob2b}} & csr_mdvnob2b)// Self-defined | ({`E203_XLEN{rd_dcsr }} & csr_dcsr ) | ({`E203_XLEN{rd_dpc }} & csr_dpc ) | ({`E203_XLEN{rd_dscratch }} & csr_dscratch) ; endmodule
implement
The execution in the five stage pipeline architecture needs to be executed after decoding. The instructions are allocated to different operation units for execution according to the specific operation type of the instructions. The common operation units are as follows:
- Arithmetic logic operation unit (ALU): responsible for general logic operation, addition and subtraction operation, shift operation, etc
- Integer multiplication unit: mainly responsible for the multiplication of signed or unsigned integers
- Integer division unit: mainly responsible for the division of signed or unsigned integers
- Floating point arithmetic unit (FPU): it is complex and usually divided into multiple independent arithmetic units
For other processor cores with special instructions, special computing units will be added accordingly (for example, hardware acceleration circuits such as DSP can be mounted next to the processor)
Command transmission sequence
Issue or Dispatch is not a common concept in the classic five stage pipeline, but it is mostly used for all kinds of RISC architecture CPU s, and this definition is also used in RISC-V
Launch: the process in which instructions are decoded and then distributed to different computing units for execution
Transmission and dispatch can be mixed. Dispatch is used as the definition in the hummingbird E200 processor pipeline
According to the number of instructions that can be transmitted once in each cycle, it can be divided into single transmit and multi transmit processors.
In particular, in some high-end superscalar processor cores, there are many pipeline stages, which makes dispatch and transmission have different meanings: dispatch represents the process that instructions are sent to the waiting queue of different computing units after decoding; Sending means the process of sending instructions from the waiting queue of the operation unit to the operation unit for execution.
According to the sequence of launch, execution and writeback, it is often divided into the following schools:
-
Sequential transmit, sequential execute, sequential write back
The performance is relatively low, the hardware implementation is the simplest and the area is the smallest
It is often used in the simplest pipelined processor core
-
Sequential launch, out of order execution and sequential write back
In the execution stage of instructions, different operation units execute different instructions at the same time, which avoids the problem of different operation processing time and improves the processing performance; The final write back is still in order, so ALU often waits for other instructions to write back first and stops the pipeline of its operation unit itself
It has relatively good performance and a slightly larger area
-
Sequential launch, out of order execution, out of order writeback
On the basis of the above out of order execution, the operation unit is written back out of order, which is divided into several different methods
-
Reordering cache method
The re orde buffer (ROB) reordering cache is used to temporarily store the results of ALU execution, and finally the rob writes them back to the register group in order
There are problems of large area and large dynamic power consumption
But the performance is very good, and the implementation scheme is very typical and mature
-
Physical register group method
A unified physical register group is used to dynamically manage the mapping relationship of the logical register group. After the execution of ALU, the results are written back to the physical register group in disorder. The mapping relationship between the physical register group and the logical register group can be changed
Complex control and optimized power consumption
-
Direct disordered writeback method
Let the execution results without data correlation be written back to the register group directly, and the execution results with data correlation be written back in sequence
Only some programs are optimized and additional circuits are needed
-
Other methods
-
-
Sequential dispatch, out of order launch, out of order execution and out of order write back
This architecture is often used in high-performance superscalar processors.
Basically, it can be regarded as the fusion of all high-performance operations above
Branch resolution
The branch prediction function in the fetch stage. For conditional branch instructions, since the conditional resolver requires operand operation, it is necessary to calculate in the execution stage and judge whether the branch instruction really needs to jump, and compare and execute according to the previously specified branch prediction algorithm. If the prediction is wrong, it is likely to need pipeline flushing and cause performance loss
Generally, in order to reduce performance loss, branch analysis will be performed at the front pipeline position
Hummingbird E200 series command launch dispatch
The launch and dispatch of hummingbird E200 Series CPU actually refer to the same behavior: that is, the total process in which instructions are sent to different computing units for execution after decoding
This part is completed with Dispatch module and ALU module
Dispatch module is responsible for forwarding dispatch tasks to ALU module
ALU is responsible for the interface between the delivery module and the front level
The features of hummingbird E200 series are as follows:
- All instructions must be dispatched to ALU and delivered through the interface between Alu and delivery module; If it is a long instruction, it also needs to be further sent to the corresponding long instruction operation unit through ALU
- The actual dispatch function occurs within ALU. Because the decoding part has preliminarily grouped and decoded the corresponding instruction signal according to the operation unit of the instruction, the instruction can be sent to the corresponding operation unit according to the instruction signal
- The dispatch module deals with pipeline conflicts, including resource conflicts and data correlation conflicts, and blocks the dispatch point of the pipeline in some special cases
Pipeline conflict, long instruction and OITF processing
Pipeline conflicts include resource conflicts and data conflicts, both of which will lead to pipeline blocking. Hummingbird E203 adopts two methods to deal with resource conflict and data conflict respectively
Data conflict
Data conflict, as the name suggests, is a conflict caused by data correlation
Hummingbird E203 uses ingenious methods to deal with data conflicts: all instructions are divided into two categories, and the data correlation is divided into three categories, which are processed by long instruction splicing and pipeline flushing
The details are given in the following long instruction and OITF processing section
Resource conflict
The concept of data conflict has been given before. Here we introduce resource conflict
Resource conflicts usually occur when instructions are sent to different execution units for execution. When an instruction is executed, it takes a long clock cycle, After that, when other instructions are sent to the same hardware module for processing, there will be a resource conflict - subsequent instructions need to wait for the hardware module to be released after the previous instruction completes the operation.
The interface implementation of hummingbird E203 adopts a rigorous valid ready handshake interface. Once a module has a resource conflict, it will output a signal of ready=0. Even if the other side has a valid=1, it cannot complete the handshake. Therefore, the previous module cannot allocate instructions and will enter the waiting state until ready=1
Long instruction and OITF processing
Hummingbird E203 divides all instructions to be executed into two categories:
-
Single cycle execution instruction
The delivery and writeback functions of hummingbird E203 are in the second level of the pipeline. The single cycle execution instruction completes the delivery and writeback at this level
-
Multi cycle execution instruction
This kind of instruction usually needs multiple clock cycles to complete execution and write back, so it is also called "post delivery long pipeline instruction", which is referred to as long instruction for short
The execution process of long instructions is special
In order to deliver long instructions after many clock cycles, it is necessary to detect the data correlation first. Hummingbird E203 uses a module called OITF (Outstanding Instructions Track FIFO long instruction tracking queue) to detect the RAW and WAW correlation Related to long instructions
The reason why WAR correlation is not detected is that E203 is a microarchitecture dispatched and written back in order. The source operands have been read from the register group at the time of dispatch, so the write back Regfile operation will not occur before reading the Regfile source operands.
To get back to business, OITF is essentially an ordinary FIFO (nonsense), and its source code is rtl/e203/core/e203_exu_oitf.v can be viewed
At the dispatch point, each time a long instruction is dispatched, a table entry will be allocated in OITF, in which the source operand register index and result register index of the long instruction will be stored
At the writeback point, each time a long instruction is written back in sequence, the representation of this instruction in OITF will be removed - he will exit from FIFO
To sum up, OITF essentially saves the long instruction information that has been dispatched but has not been written back
For the sake of simplicity, there is no appendix related code here. Interested readers can browse the source code by themselves
Solutions to resource conflicts
Hummingbird E203 adopts the solution idea of blocking pipeline, does not directly and quickly bypass the results of long instructions to subsequent instructions to be dispatched to solve data conflicts, and does not add more hardware modules to deal with resource conflicts. This is because the design idea of hummingbird E203 adheres to "small area", abandons higher performance and realizes higher performance area ratio. If you design a high-performance CPU, you obviously can't simply use this idea
ALU module
The ALU unit of hummingbird E203 is located under the EXU and mainly includes five sub modules. They share the same actual operation data path, so the area overhead of the main data path is only one
- ALU is mainly responsible for the operation of ALU and shift logic
- Memory access address generation: mainly responsible for the address generation of Load, Store and "A" extension instructions, and the micro operation splitting and execution of "A" extension instructions
- Branch prediction analysis: mainly responsible for the result analysis and execution of branch and Jump instructions
- CSR read / write control: mainly responsible for the execution of CSR read / write instructions
- Multi cycle multiplier and divider: mainly responsible for the execution of multiplication and division instructions
Ordinary ALU
At rtl/e203/core/e203_exu_alu_rglr.v
The module is completely composed of combinational logic circuit (that is, it can only occupy a little LUT in FPGA). It has no operation data path. Its main logic initiates the operation request for the shared operation data path according to the instruction type of ordinary ALU, and retrieves the operation result from the shared operation data path
Memory access address generation
This module is abbreviated as AGU (Adress Generation Unit) and is located in rtl/e203/core/e203_exu_alu_lsuagu.v
The relevant contents will be described in detail in the memory architecture section
Branch prediction analysis
At rtl/e203/core/e203_exu_alu_bjp.v
BJP (Branch and Jump resolve) module is the main basis for delivering branch jump instructions. You can check the delivery part for details
CSR read / write control
This module is mainly responsible for the execution of CSR read and write instructions, which is located in rtl/e203/core/e203_exu_alu_csrctrl.v
This module is also completely composed of combinational logic, which generates the control signal of reading and writing CSR register module according to the type of CSR reading and writing instruction
The code snippet is as follows:
`include "e203_defines.v" module e203_exu_alu_csrctrl( //Handshake interface input csr_i_valid, // valid signal output csr_i_ready, // ready signal input [`E203_XLEN-1:0] csr_i_rs1, input [`E203_DECINFO_CSR_WIDTH-1:0] csr_i_info, input csr_i_rdwen, output csr_ena, // CSR read / write enable signal output csr_wr_en, // CSR write operation indication signal output csr_rd_en, // CSR read operation indication signal output [12-1:0] csr_idx, // Address index of CSR register input csr_access_ilgl, input [`E203_XLEN-1:0] read_csr_dat, // The read operation reads data from the CSR register module output [`E203_XLEN-1:0] wbck_csr_dat, // The write operation writes data to the CSR register module `ifdef E203_HAS_CSR_NICE//{ input nice_xs_off, output csr_sel_nice, output nice_csr_valid, input nice_csr_ready, output [31:0] nice_csr_addr, output nice_csr_wr, output [31:0] nice_csr_wdata, input [31:0] nice_csr_rdata, `endif//} //CSR writeback / delivery interface output csr_o_valid, // valid signal input csr_o_ready, // ready signal // Special writeback interface for non aligned lst and AMO instructions output [`E203_XLEN-1:0] csr_o_wbck_wdat, output csr_o_wbck_err, input clk, input rst_n ); `ifdef E203_HAS_CSR_NICE//{ // If accessed the NICE CSR range then we need to check if the NICE CSR is ready assign csr_sel_nice = (csr_idx[11:8] == 4'hE); wire sel_nice = csr_sel_nice & (~nice_xs_off); wire addi_condi = sel_nice ? nice_csr_ready : 1'b1; assign csr_o_valid = csr_i_valid & addi_condi; // Need to make sure the nice_csr-ready is ready to make sure // it can be sent to NICE and O interface same cycle assign nice_csr_valid = sel_nice & csr_i_valid & csr_o_ready;// Need to make sure the o-ready is ready to make sure // it can be sent to NICE and O interface same cycle assign csr_i_ready = sel_nice ? (nice_csr_ready & csr_o_ready) : csr_o_ready; assign csr_o_wbck_err = csr_access_ilgl; assign csr_o_wbck_wdat = sel_nice ? nice_csr_rdata : read_csr_dat; assign nice_csr_addr = csr_idx; assign nice_csr_wr = csr_wr_en; assign nice_csr_wdata = wbck_csr_dat; `else//}{ wire sel_nice = 1'b0; assign csr_o_valid = csr_i_valid; assign csr_i_ready = csr_o_ready; assign csr_o_wbck_err = csr_access_ilgl; assign csr_o_wbck_wdat = read_csr_dat; `endif//} //Extract relevant information from Info Bus wire csrrw = csr_i_info[`E203_DECINFO_CSR_CSRRW ]; wire csrrs = csr_i_info[`E203_DECINFO_CSR_CSRRS ]; wire csrrc = csr_i_info[`E203_DECINFO_CSR_CSRRC ]; wire rs1imm = csr_i_info[`E203_DECINFO_CSR_RS1IMM]; wire rs1is0 = csr_i_info[`E203_DECINFO_CSR_RS1IS0]; wire [4:0] zimm = csr_i_info[`E203_DECINFO_CSR_ZIMMM ]; wire [11:0] csridx = csr_i_info[`E203_DECINFO_CSR_CSRIDX]; //Generate operand 1. If immediate is used, select immediate; otherwise, select source register 1 wire [`E203_XLEN-1:0] csr_op1 = rs1imm ? {27'b0,zimm} : csr_i_rs1; //Generate a read operation instruction signal according to the information of the instruction assign csr_rd_en = csr_i_valid & ( (csrrw ? csr_i_rdwen : 1'b0) // the CSRRW only read when the destination reg need to be writen | csrrs | csrrc // The set and clear operation always need to read CSR ); //Generate a write operation instruction signal according to the information of the instruction assign csr_wr_en = csr_i_valid & ( csrrw // CSRRW always write the original RS1 value into the CSR | ((csrrs | csrrc) & (~rs1is0)) // for CSRRS/RC, if the RS is x0, then should not really write ); //Generates an address index that accesses the CSR register assign csr_idx = csridx; //Generates a CSR read / write enable signal sent to the CSR register module assign csr_ena = csr_o_valid & csr_o_ready & (~sel_nice); //Generate write operation to write data to CSR register module assign wbck_csr_dat = ({`E203_XLEN{csrrw}} & csr_op1) | ({`E203_XLEN{csrrs}} & ( csr_op1 | read_csr_dat)) | ({`E203_XLEN{csrrc}} & ((~csr_op1) & read_csr_dat)); endmodule
Multi period multiplier and divider
The hummingbird E200 series uses two multiplication and division solutions
For the hummingbird E203, it is equipped with a multi cycle multiplier and divider with low performance and small area, while for other high-performance devices, it uses a high-performance single cycle multiplier and independent divider
The commonly used multi cycle multiplier and divider and divider are generally realized by the following theory:
- Signed integer multiplication: the commonly used Booth code is used to generate the partial product, and then the iterative method is used. The adder is used to accumulate the partial product in each cycle, and the final product is obtained after multiple cycles of iteration, so as to realize the multi cycle multiplier
- Signed integer division: the commonly used addition and subtraction alternating method is used, and then the iterative method is used. The adder is used to generate part of the remainder in each cycle. After multiple cycles of iteration, the final quotient and Yushu are obtained, so as to realize the multi cycle divider
The theoretical contents of the two modules can refer to digital and electronic textbooks or related books
Because both modules take the adder as the core and use a set of registers to save partial product or partial remainder, resource reuse is used in hummingbird E203 - the multi cycle multiplier and divider is combined as a sub unit of ALU. They share the adder in the data path and complete the multiplication or division operation after multiple cycles
The MDV module of multi cycle multiplier and divider is located in rtl/e203/core/e203_exu_alu_muldiv.v
Meanwhile, hummingbird E203 optimizes the multiplication and division method as follows:
- In the multiplication operation, in order to reduce the number of cycles required, the Booth coding of base 4 (Radix-4) is adopted, and the unsigned multiplication is uniformly treated as a signed number after one bit symbol expansion, so 17 iteration cycles are required
- In the division operation, the ordinary addition and subtraction alternating method is used. Similarly, the unsigned multiplication is uniformly extended by one bit and then operated as a signed number, which requires 33 iteration cycles. In addition, due to the problem of 1-bit accuracy of the result obtained by the addition and subtraction alternating method, it also needs an additional clock cycle to judge whether the quotient and remainder correction is needed, and there are two additional cycles of quotient and remainder correction, so as to obtain the accurate division result
- The MDV module only performs operation control without its own adder. The adder multiplexes the shared operation data path with other ALU sub units
- The MDV module also does not have its own registers, which are multiplexed with AGU units
To sum up, MDV is actually just a state machine. Its multiplication implementation requires 17 iterative cycles and division implementation requires up to 36 cycles. It adopts the typical idea of "speed for area"
Arithmetic data path
In fact, the real module used by ALU for calculation is the data path, which is located in rtl/e203/core/e203_exu_alu_dpath.v
It passively accepts the request of other ALU sub units for specific operation, and then returns the calculation result to the operation data path of non other sub units
It can be said that other sub units of Alu are only a set of state machines (control systems) that select different logic for different instructions, and the operation data path through which the data mainly passes is the operation core of ALU. The whole Alu is similar to the structure of "stars support the moon" - the operation data path with the largest area is in the middle, and the data flow will be selected by the surrounding state machines, Or single pass (ordinary ALU) or repeated pass and output different results to the register (multi cycle multiplier and divider), which greatly reduces the area of ALU
High performance multiplication and division
In addition to the small area multi cycle multiplier and divider, other models of hummingbird E200 are also equipped with high-performance single cycle multiplier and independent divider
The high-performance multiplier will be deployed in the second stage of the pipeline, and the divider will still use the multi cycle divider, but it will no longer reuse the shared operation data path with ALU, but has a separate divider unit as a long instruction, which is also deployed in the second stage of the pipeline
floating point unit
Hummingbird E200 series supports "F" and "D" extended subsets of RISC-V and can handle single precision and double precision floating-point instructions
Floating point instructions are supported by FPU. If FPU is configured, FPU has an independent arithmetic unit as a long instruction, and FPU also has an independent general floating point register group. Modules containing extended subsets of F and D are required to contain 32 general-purpose floating-point registers. If only f is included, the width of general-purpose floating-point registers of floating-point instruction subsets is 32 bits, and the width of general-purpose floating-point registers of floating-point instruction subsets containing only D is 64 bits
The FPU of hummingbird E200 series supports the following functions
- Independent clock gating
- Independent power domain
- Single precision floating-point multiplexed data path
However, the open source hummingbird E203 is not equipped with FPU (SAD)
summary
This part briefly introduces the two links of decoding and execution in the execution unit of hummingbird E203
There are other units in writeback and delivery, because the length involved will be introduced in the blog posts related to writeback, delivery and memory
The EXU part is the core part of hummingbird E203, and there are a lot of codes, so it needs to be understood repeatedly