Project 5: Cache Lab

CSCI 2021, Fall 2018

Assigned: November 28, 2018

Due: December 7, 2018 at 11:59 pm

1 Logistics

This is an individual project. You must run this lab on the 64-bit CSE labs machines.

You will receive little credit if we are not able to run your code.

2 Overview

The purpose of this lab is to help you understand the impact that cache memories can have on the performance of your C programs.

In this lab, you will optimize a small matrix transpose function, with the goal of minimizing the number of cache misses.

3 Downloading the assignment

Download cachelab-handout.tar from Canvas

Start by copying cachelab-handout.tar to a Linux directory in which you plan to do your work. Then give the command:

linux> tar -xvf cachelab-handout.tar```

This will create a directory called cachelab-handout that contains a number of files. You will only be modifying trans.c.

To compile the files, type:
linux> make clean linux> make

WARNING: Do not let the WindowsWinZip program open up your .tar file (many Web browsers are set to do this automatically). Instead, save the file to your Linux directory and use the Linux tar program to extract the files.

1 DescriptionIn this lab, you will write a matrix transpose function that is optimized for cache performance.  1.1 Optimizing Matrix Transpose You will write a transpose function in trans.c that causes as few cache misses as possible.Let A denote a matrix, and Aijdenote the component on the ith row and jth column. The transpose of A ,denoted AT, is a matrix such that Aij= AT .

To help you get started, we have given you an example transpose function in trans.c that computes the

transpose of N × M

matrix A and stores the results in M × N

matrix B :

char trans_desc[] = "Simple row-wise scan transpose"; void trans(int M, int N, int A[N][M], int B[M][N])

The example transpose function is correct, but it is inefficient because the access pattern results in

relatively many cache misses.

Your job is to write a similar function, called transpose_submit, that minimizes the number of cache misses across different sized matrices:

char transpose_submit_desc[] = "Transpose submission";
void transpose_submit(int M, int N, int A[N][M], int B[M][N]); 

Do not change the description string (“Transpose submission”) for your transpose_submit function. The autograder searches for this string to determine which transpose function to evaluate for credit.

Programming Rules

● Include your name and x500 in the header comment for trans.c.

● Your code in trans.c must compile without warnings to receive credit.

● You are allowed to define at most 12 local variables of type int per transpose function.

● You are not allowed to side-step the previous rule by using any variables of other number types (ex: type long) or by using any bit tricks to store more than one value to a single variable.

● Your transpose function may not use recursion.

● If you choose to use helper functions, you may not have more than 12 local variables on the stack at a time between your helper functions and your top level transpose function. For example, if your transpose declares 8 variables, and then you call a function which uses 4 variables, which calls another function which uses 2, you will have 14 variables on the stack, and you will be in violation of the rule.

● You may assume that the arrays A and B do not overlap memory at all. That is, the memory used by A is totally distinct from the memory used by B.

● Your transpose function may not modify array A. You may, however, do whatever you want with the contents of array B.

● You are NOT allowed to define any arrays in your code or to use any variant of malloc.

1 EvaluationThis section describes how your work will be evaluated.  The full score for this lab is 100 points: Performance: 85 PointsStyle: 15 Points  1.1 Performance (85 points) We will evaluate the correctness and performance of your transpose_submit function on three different-sized output matrices:● 32 × 32 (M = 32, N = 32)

● 64 × 64 (M = 64, N = 64)

● 61 × 67 (M = 61, N = 67)

For each matrix size, the performance of your transpose_submit function is evaluated by using valgrind to extract the address trace for your function, and then using the reference simulator to replay this trace on a cache with parameters (s = 5, E = 1, b = 5).

Your performance score for each matrix size scales linearly with the number of misses, m, up to some threshold:

32 × 32 : 25 points if m < 300, 0 points if m > 600

● 64 × 64 : 25 points if m < 1,300, 0 points if m > 2,000

● 61 × 67 : 35 points if m < 2,000, 0 points if m > 3,000

Your code must be correct to receive any performance points for a particular size. Your code only needs to be correct for these three cases, and you can optimize it specifically for these three cases. In particular, it is perfectly OK for your function to explicitly check for the input sizes and implement separate code optimized for each case.

1.1 Evaluation for Style (15 points)

There are 15 points for coding style. These will be assigned manually by the course staff. The course staff will inspect your code for illegal arrays and excessive local variables.

2 Working on the Lab

We have provided you with an autograding program, called test-trans.c, that tests the correctness and performance of each of the transpose functions that you have registered with the autograder.

You can register up to 100 versions of the transpose function in your trans.c file. Each transpose version has the following form:

/* Header comment */
char trans_simple_desc[] = "A simple transpose";
void trans_simple(int M, int N, int A[N][M], int B[M][N])
/* your transpose code here */


Register a particular transpose function with the autograder by making a call of the form:

registerTransFunction(trans_simple, trans_simple_desc);

in the registerFunctions routine in trans.c. At runtime, the autograder will evaluate each registered transpose function and print the results. Of course, one of the registered functions must be the transpose_submit function that you are submitting for credit:

See the default trans.c function for an example of how this works.

The autograder takes the matrix size as input. It uses valgrind to generate a trace of each registered transpose function. It then evaluates each trace by running the reference simulator on a cache with parameters (s = 5, E = 1, b = 5).

For example, to test your registered transpose functions on a 32 × 32 matrix, rebuild test-trans, and then run it with the appropriate values for M and N:

linux> make
linux> ./test-trans -M 32 -N 32
Step 1: Evaluating registered transpose funcs for correctness: func 0 (Transpose submission): correctness: 1
func 1 (Simple row-wise scan transpose): correctness: 1 func 2 (column-wise scan transpose): correctness: 1 func 3 (using a zig-zag access pattern): correctness: 1

Step 2: Generating memory traces for registered transpose funcs.

Step 3: Evaluating performance of registered transpose funcs (s=5, E=1, b=5)
func 0 (Transpose submission): hits:1766, misses:287, evictions:255 func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151
func 2 (column-wise scan transpose): hits:870, misses:1183, evictions:1151
func 3 (using a zig-zag access pattern): hits:1076, misses:977, evictions:945

Summary for official submission (func 0): correctness=1 misses=287

In this example, we have registered four different transpose functions in trans.c. The test-trans program tests each of the registered functions, displays the results for each, and extracts the results for the official submission.

Here are some hints and suggestions.

● The test-trans program saves the trace for function i in file These trace files are invaluable debugging tools that can help you understand exactly where the hits and misses for each transpose function are coming from. To debug a particular function, simply run its trace through the reference simulator with the verbose option:

linux> ./csim-ref -v -s 5 -E 1 -b 5 -t trace.f0 S 68312c,1 miss
L 683140,8 miss
L 683124,4 hit
L 683120,4 hit
L 603124,4 miss eviction S 6431a0,4 miss

○ Each line denotes one or two memory accesses. The format of each line is

[space]operation address,size

○ The operation field denotes the type of memory access: “I” denotes an instruction load, “L” a data load, “S” a data store, and “M” a data modify (i.e., a data load followed by a data store). The address field specifies a 64-bit hexadecimal memory address. The size field specifies the number of bytes accessed by the operation.

○ It may be easier to view this output in a file. To redirect this to a file, use the command linux> ./csim-ref -v -s 5 -E 1 -b 5 -t trace.f0 > file.txt where file.txt is the name of the file to create.

● Since your transpose function is being evaluated on a direct-mapped cache, conflict misses are apotential problem. Think about the potential for conflict misses in your code, especially along the diagonal. Try to think of access patterns that will decrease the number of these conflict misses.● Blocking is a useful technique for reducing cache misses. See

for more information.

6.3 Putting it all Together

We have provided you with a driver program, called ./, that performs a complete evaluation of your transpose code. This is the same program your instructor uses to evaluate your handins. The driver uses test-trans to evaluate your submitted transpose function on the three matrix sizes. Then it prints a summary of your results and the points you have earned.

To run the driver, type:

linux> ./

1 Handing in Your Work

Each time you type make in the cachelab-handout directory, the Makefile creates a tarball, called

userid-handin.tar, that contains your current trans.c files.

Upload your userid-handin.tar file on Canvas.

IMPORTANT: Do not create the handin tarball on a Windows or Mac machine, and do not handin files in any other archive format, such as .zip, .gzip, or .tgz files.