ECE463/563 – Microprocessor Architecture Project #2
Due date: April 10, 2019
The goal of this project is to design and implement a C/C++ cycle-accurate simulator of a dynamically scheduled processor implementing the Tomasulo algorithm with Reorder Buffer.
Students taking the course at the undergraduate level (ECE463) do not need to support store instructions (SW and SWS) in their simulator, while students taking the course at the graduate level (ECE563) will need to support store instructions as well.
The project2_code.tar.gz archive contains C and C++ code templates for the simulator, as well as test cases to validate its operation. ECE563 students are required to use C++ code templates.
In order to extract the code in a Linux environment, you can invoke the following command:
tar –xzvf project2_code.tar.gz
This will create a project2_code folder containing two subfolders: c and c++, the former containing the C templates and the latter containing the C++ templates. The content of these two subfolders is very similar.
The c and c++ folders have the following content:
· sim_ooo.h/sim_ooo.cc (or sim_ooo.c): code templates for the dynamically scheduled processor’s simulator. These are the only files that you need to modify to implement the simulator.
· Test cases: test cases to validate the operation of the dynamically scheduled processor’s simulator. This folder contains five test cases. For each of them, you will find two files: Test caseN.cc and Test caseN.out. The former contains the test case implementation, and the latter the expected output of the test case. You should not modify any of the test case files. We recommend writing your own test cases to verify the functionality that is not covered by the test cases provided.
· Makefile: Makefile to be used to compile the code. The use of this Makefile will cause an object file (.o) to be created for each C or C++ file that is part of the project. If the compilation succeeds, the binaries corresponding to the test cases will be generated in the bin folder. You don’t need to modify the Makefile unless you are working on the floating-point pipeline simulator or you are using a library that is not included by default.
· asm: assembly files used by the test cases.
· bin: once you compile the code, the test cases binaries will be saved into this folder.
The sim_ooo.h header file is commented and contains details on the functions/methods that you need to implement. Be sure to read the comments in this header file carefully before you start coding.
Assumptions & Requirements
1. Data types and registers: The simulator operates on 32-bit integer numbers and 32-bit floating-point numbers stored in data memory. The simulator has 32 integer registers (R0-R31) and 32 single-precision floating- point registers (F0-F31).
2. Memories: The instruction and data memories are separated. The instruction memory returns the instruction fetched within the clock cycle, while the data memory has a configurable latency.
3. Execution units and memory: The number of execution units and their latency must be configurable. The function init_exec_unit can be invoked at the beginning to configure the execution units used. Execution units can be of five kinds: INTEGER unit, floating-point ADDER, MULTIPLIER, DIVIDER and MEMORY (see exe_unit_t data type). The integer units are used for integer additions (and branches), subtractions and logic operations; the floating-point adders are for floating-point additions and subtractions; the multipliers are for integer and floating-point multiplications; and the dividers for integer and floating- point divisions. While the simulator can have multiple integer units, adders, multipliers and dividers, it will always have a single data memory unit (used by load and store instructions). While the processor can have multiple execution units of the same type with different latencies, in our test cases all the units of the same type will have the same latency. For simplicity, assume that all execution units are unpipelined. You can assume that the latency of the execution stage corresponds to that of the execution unit it uses.
4. Reservation stations and load buffers: Besides load buffers, the processor should have three kinds of reservation stations (see res_station_t data type). Integer reservation stations (INTEGER_RS) feed the integer units, add reservation stations (ADD_RS) feed the floating-point adders, and multiplier reservation stations (MULT_RS) are shared by multipliers and dividers. Load buffers (LOAD_B) are used by memory instructions.
5.Initialization: the size of the data memory, the size of the reorder buffer, the number of reservation stations and load buffers, and the issue width can be configured when instantiating the simulator.
6. Reservation stations, ROB, instruction window: the code template already contains data structures to model reservation stations, ROB and the instruction window. You can extend these data structures if you need, but you should not modify existing field.
7. Instructions supported: the processor simulator needs to support the following instructions (which are listed in the sim_ooo.h header file).
· LW – Load word: Loads a 32-bit integer into a register from the specified address.
· SW – Store word: Stores a 32-bit integer into data memory at a specified address.
· ADD/SUB/XOR/AND/MULT/DIV – Add/Sub/Xor/And/Mult/Div: Computes the addition/subtraction/exclusive XOR/AND/multiplication/division of the values of two integer registers and stores the result into an integer register.
· ADDI/SUBI – Add/Sub Immediate: Computes the addition/subtraction of the value of an integer register and a sign-extended immediate value and stores the result into an integer register.
· BEQZ, BNEZ, BLTZ, BGTZ, BLEZ, BGEZ – Branch if the value of the register is =, ≠, <, >, ≤, ≥ zero. JUMP – Unconditional branch.
· LWS – Load word: Loads a 32-bit floating-point into a register from the specified address.
· SWS – Store word: Stores a 32-bit floating-point value into data memory at a specified address.
· ADDS/SUBS/MULTS/DIVS – Add/Sub/Mult/Div: Computes the addition/subtraction/multiplication/division of the values of two floating-point registers and stores the result into a floating- point register.
· EOP – End of program: Special instruction indicating the end of the program. As mentioned above, ECE463 do not need to support the SW and SWS instructions in their simulator.
8. Load/store instructions and dependencies: Load and store instructions use the memory unit in different stages. Load instructions load data from memory in the execution stage and send their result to the reservation stations and ROB in the write result stage. Store instructions write data to memory at the commit stage and use the execution stage only for the computation of the effective memory address. Load and store instructions from/to different memory addresses do not conflict. However, in a dynamically scheduled processor load and store instructions that operate on the same memory address can cause RAW, WAW and WAR memory hazards. Note, however, that since the ROB postpones writing memory to the commit stage (and this stage processes instructions in program order), in this case WAW and WAR dependencies are not an issue. So, your implementation should handle only RAW load/store dependencies occurring when a load operation follows a pending store operation to the same memory address. Your implementation should assume that, in this situation, the value that will be written to memory by the pending store should be bypassed to the load. To meet these requirements, your implementation should process memory instructions as follows:
· When a load or store instruction is issued, the address field of the corresponding reservation station is filled with the value of the immediate field of the instruction. The address field is updated with the effective memory address when the instruction is moved to the execution stage.
· When multiple instructions are eligible to move from the issue to the execution stage, they are considered in program order. In addition, the effective memory address is computed in program order. Load instructions are not allowed to execute if preceded by potentially conflicting store instructions.
· Store instructions are allowed in the execution stage only if the values of both their input registers are available. Note that, while a less stringent timing is possible, this facilitates the handling of data bypassing between store and dependent load instructions.
· Store instructions update the destination field of the ROB with the effective memory address when they enter the execution stage, and they update the value field of the ROB with the result that they will later write to memory in the write-result stage.
· In the presence of a pending store to the same address, data bypassing from the store to the load instruction is done when the load enters the execution stage. In this stage, the load will save the value bypassed in the value2 field of its reservation station. In the write-result stage, the load will write this value in the reorder buffer.
· Bypassing happens through the ROB (for the timing, see timing constraint on data dependencies below).
· Store instructions spend only one clock cycle in execution stage, and a number of clock cycles equal to the memory latency in the commit stage. In the presence of store-to-load data bypassing, load instructions spend only one clock cycle in the execution stage (since in this case they don’t use the memory unit).
Note that data bypassing does not modify the timing of store instructions, but it only speeds up the processing of dependent load instructions.
9. Other timing constraints:
· If an instruction I uses a ROB entry, such ROB entry becomes available and can be used in the clock cycle after I commits.
· In the presence of a structural hazard on an execution unit, the execution unit is available to the second instruction in the clock cycle after the first instruction has finished using it. If an instruction I uses an execution unit in the execution stage, such execution unit is freed when I writes the result and is available to a different instruction starting from the next clock cycle.
· If an instruction I uses a reservation station or a load buffer, such reservation station or load buffer is freed when I writes the result, and is available to a different instruction starting from the next clock cycle.
· If an instruction J depends on instruction I, the data written by instruction I will be available to J in the clock cycle when I writes the result (say t), and (in the absence of data hazard), instruction J will be able to execute in the following clock cycle (say t+1).
· In the write result stage branch instructions write their target address in the result field of the corresponding ROB entry. If the branch is mispredicted, the target instruction of the branch is fetched in the clock cycle after the branch instruction commits. While a less strict timing is possible, this allows precise exception handling.
· CPI computation: The EOP instruction should be excluded from the computation of the CPI.
Format of the output
The code template includes functions to print out the content of the data memory, the status of the registers (and, if their value is pending, the ROB entry that will update the register), the content of the reservation stations and load buffers, the content of the reorder-buffer, the status of the pending instructions (that is, the ones still in the ROB), and the instruction execution history (which shows the cycle-by-cycle execution of all instructions fetched during operation). These functions do not need to be modified, and they are invoked from the test case files.
As mentioned above, the compilation process generates a separate binary for each test case in the testcases folder. To execute testcaseX, you can go in the c/c++ folder and invoke:
To check if your output is correct, you can compare it with file testcaseX.out in the testcases folder. On Linux, you can use the diff utility to do so.
For example, you can invoke
./bin/testcaseX > my_output
diff my_output testcases/testcaseX.out
The first command will run the test case and save its output into my_output. The second command will compare your output with the reference output line-by-line.
Comment should be a brief explanation of the test case evaluation. Examples: “the test case and reference output fully match”, “everything but the cycle-by-cycle execution matches”, “the final content of memory and register and the execution log fully match”, etc.
1. Report: Your report should be no longer than 4 pages (11 pt. Times New Roman font), including figures. It should include the following information:
· Brief description of the salient aspects of your implementation;
· Brief explanation of what works and what not in your implementation;
· Table above
· If you modified the Makefile, indicate it in the report.
Save your report in pdf format, with file name project2_report.pdf. Include the report in the project2_code folder.
2. Test cases: You should not modify any of the test cases. All the functionality should be included in the sim_ooo.h and sim_ooo.c/cc, files. You can add header and C/C++ files, but you don’t need to.
3. Code: Independently of the development environment and operating system you used to develop your code, your code should compile and run on the grendel.ece.ncsu.edu Linux machine, and it should compile using the provided Makefile (you can modify the Makefile if you are using libraries which are not included or if you have implemented the floating-point pipeline simulator). Have a look at Piazza for information on how to access the grendel machine.
4. You should invoke “make clean” before submitting your code. That is, your submission should not contain any object or binary files.
5. Remove the folder containing the templates that you did not use. In other words, if you used the C templates, delete the c++ folder and its content; if you used the C++ templates, delete the c folder and its content. Your project2_code folder should contain the project2_report.pdf and one of the c or c++ folders.
6. Go to the parent folder of the project2_code folder. Compress the whole project2_code directory into a tar.gz file.
tar -zcvf project2_code.tar.gz project2_code
7. Submit your project through Moodle (no need to print the report and take it to class).