Final project: Cache simulator

Due: Friday, January 21 at 16:00

For this final project, you will write a configurable cache simulator using the infrastructure provided. Your cache simulator will read an address trace (a chronological list of memory addresses referenced), simulate the cache, generate cache hit and miss data, and calculate the execution time for the executing program. The address traces have been generated by a simulator executing real programs. Your cache simulator will be graded for accuracy, but is not the end product of this project, rather it is a tool you will use to complete the project. In this project, you will experiment with various cache configurations and make conclusions about the optimal cache organization for this set of programs.

Preliminaries

You can work on this project with a partner if you choose. If you decide to work with a partner, you and your partner should check out a single repository. The first partner will create a team name, and the second partner should choose that team name. Please be careful choosing a team, as this cannot be undone. Please name your team something that makes it clear who you are.

If you choose to work with a partner, you and your partner must complete the entire project together. Dividing the project up into pieces and having each partner complete a part of it on their own will be considered a violation of the honor code. Both you and your partner are expected to fully understand all of the code you submit.

Click on the assignment link.

Once you have accepted the assignment, you can clone the repository on your computer by following the instruction and begin working.

Be sure to ask any questions on Piazza.

Address trace

An address trace is simply a list of addresses produced by a program running on a processor. These are the addresses resulting from load and store instructions in the code as it is executed. Some address traces would include both instruction fetch addresses and data (load and store) addresses, but you will be simulating only a data cache, so the provided traces in the traces directory only have data addresses.

These traces were generated by a simulator of a RISC processor running three programs, art, mcf, and swim from the SPEC benchmarks. The files are art.trace.gz, mcf.trace.gz, and swim.trace.gz. The number of loads/stores in the traces vary by benchmark. They are all compressed with gzip. You do not need to decompress the traces because Cache.java knows how to read both compressed and uncompressed files. (In addition, there is a test.trace file containing the contents of the example trace shown below.)

Use the following command to run your simulator on the given trace file.

java Cache [cache args] tracefile

Because your workload is three programs, you will run three simulations for each architecture you simulate, and then combine the results in some meaningful way. The simulator arguments should be taken in as command line arguments. For example, to simulate a 32 kB, 4-way set-associative cache with 32-byte blocks and a 30-cycle miss penalty on the traces/mcf.trace.gz trace file, you’d run the following.

java Cache -s 32 -a 4 -b 32 -mp 30 traces/mcf.trace.gz

Your code should support any reasonable values for cache size, associativity, etc.

Format of the address trace

All lines of the address trace are of the format

# LS ADDRESS IC

where LS is a 0 for a load and 1 for a store, ADDRESS is an 8-character hexadecimal number, and IC is the number of instructions executed between the previous memory access and this one (including the load or store instruction itself). There is a single space between each field. The instruction count information will be used to calculate execution time (or at least cycle count). A sample address trace starts out like this:

# 0 7fffed80 1
# 0 10010000 10
# 0 10010060 3
# 1 10010030 4
# 0 10010004 6
# 0 10010064 3
# 1 10010034 4

You should assume no accesses address multiple cache blocks (e.g., assume all accesses are for 32 bits or less and appropriately aligned).

Simulator (40 points)

Your simulator will model an n-way set associative, write-back, write-allocate cache. The cache replacement policy is always least-recently used for associative caches.

When run on a trace file, it should produce

total execution time in cycles;
number of instructions;
number of memory access instructions (i.e., number of loads and stores);
overall cache miss rate;
cache miss rate for load instructions;
cycles per instruction (CPI);
average memory access time in cycles (cycles per memory access, assuming 0 cycles for a cache hit and a miss penalty for a cache miss; see below);
the number of dirty cache blocks evicted;
the number of load misses;
the number of store misses;
the number of load hits; and
the number of store hits.

See the examples below.

For execution time, assume the following.

Instructions other than loads and stores take one cycle;
A load or store takes one cycle plus a miss penalty if applicable;
A load or store that misses in the cache has a configurable miss penalty with a default of 30 cycles;
A load or store that misses in the cache and causes the cache to evict a dirty cache block has an additional 2 cycle penalty.

(We’re assuming that on a cache miss that causes a dirty block to be evicted, the write back to memory happens mostly in the background and the additional 2-cycle penalty is to write the dirty block to a write buffer.)

To recap: With the default base miss penalty of 30 cycles, a load or store instruction takes 1 cycle for a cache hit, 30 cycles for a cache miss that does not evict a dirty block, and 32 cycles for a cache miss that evicts a dirty block.

In the trace shown above, the first 31 instructions take 151 cycles, assuming four cache misses and 3 cache hits for the 5 loads and 2 stores, and a 30-cycle base miss penalty.

Each trace contains the memory accesses of just over 5 million instructions. Your simulations should process all of them. Your output must follow the format provided in the skeleton code exactly. See the examples below.

Questions (40 points)

For the second part of this project, you will use your cache simulator to test out different cache configurations and answer questions about your findings. You must try every configuration below on the art, swim, and mcf cache traces provided to you, and discuss all of them in your answers to the questions. Note that your question answers are worth just as much as the code for this project—you are expected to give detailed answers that clearly backup your claims with your simulator results.

The default cache configuration will be 16-byte block size, direct-mapped, 16 kB cache size, write-back, and write-allocate. You will re-evaluate some of these parameters one at a time, in the given order. In each case, choose a best value for each parameter, then use that for all subsequent analyses.

Look at 16 kB, 32 kB, and 128 kB cache sizes. Larger caches take longer to access, so assume that a processor with a 32 kB cache requires a 5% longer cycle time, and the 128 kB 15% longer. Choose the best size/cycle time combination and proceed to the next step.

Look at cache associativity of direct-mapped, 2-way set-associative, and 8-way set-associative. Assume that 2-way associative adds 5% to the cycle time, and 8-way adds 10% to the cycle time. Choose the best associativity and cycle time, and proceed to the next step.

Look at cache block sizes of 16, 32, and 64 bytes. Assume that it takes two extra cycles to load 32 bytes into the cache, and 6 extra cycles to load 64 bytes. (i.e., raise the miss penalty accordingly). Choose the best size and miss penalty and proceed to answering the following questions.

(10 points) What is the optimal cache size, associativity, and block size, given the parameters above?
(10 points) Is the cache miss rate a good indicator of performace? In which cases did the option with the lowest miss rate not have the lowest execution time? Why?
(10 points) Were results uniform across the three programs? In which cases did different programs give different conclusions? Speculate as to why that may have been true.
(10 points) What was the speedup of your final design over the default for each trace? Use the definition of speedup covered in class.

Put the answers to these questions in the README file.

Hints

Think about how to intelligently debug and test your program. Running immediately on the entire input gives you little insight on whether it is working (unless it is way off). To do this create separate memory tests (you can see the text format above) to ensure cache size, cache associativity, blocksize, and miss penalty are functioning correctly. You do not need to turn them in, but they will help tremendously.

Speed matters. These simulations should take less than a couple minutes (actually, much less) on an unloaded machine. If it is taking much more than that, do yourself a favor and think about what you are doing inefficiently.

Simulations are not the same as hardware. If your tag only takes 16 bits, feel free to use an integer for that value. Other time-saving optimizations along these lines might be useful.

Submission

Submit the project by committing your code and answers to the questions and pushing it to your GitHub repository.

Examples

Here are three example simulation runs and the command line to produce them. Your output should be identical.

$ java Cache -s 32 -a 4 -b 32 -mp 30 traces/test.trace 
Cache parameters:
    Cache Size (kB)             32
    Cache Associativity          4
    Cache Block Size (bytes)    32
    Miss penalty (cycles)       30

Simulation results:
    execution time                    151 cycles
    instructions                       31
    memory accesses                     7
    overall miss rate                0.57
    load miss rate                   0.60
    CPI                              4.87
    average memory access time      17.14 cycles
    dirty evictions                     0
    load_misses                         3
    store_misses                        1
    load_hits                           2
    store_hits                          1

$ java Cache -s 32 -a 4 -b 32 -mp 30 traces/art.trace.gz
Cache parameters:
    Cache Size (kB)             32
    Cache Associativity          4
    Cache Block Size (bytes)    32
    Miss penalty (cycles)       30

Simulation results:
    execution time               20127356 cycles
    instructions                  5136716
    memory accesses               1957764
    overall miss rate                0.25
    load miss rate                   0.27
    CPI                              3.92
    average memory access time       7.66 cycles
    dirty evictions                 60015
    load_misses                    475672
    store_misses                    20015
    load_hits                     1256211
    store_hits                     205866

$ java Cache -a 8 -s 64 -b 32 -mp 42 traces/mcf.trace.gz
Cache parameters:
    Cache Size (kB)             64
    Cache Associativity          8
    Cache Block Size (bytes)    32
    Miss penalty (cycles)       42

Simulation results:
    execution time              143963250 cycles
    instructions                 19999998
    memory accesses               6943857
    overall miss rate                0.42
    load miss rate                   0.36
    CPI                              7.20
    average memory access time      17.85 cycles
    dirty evictions                995694
    load_misses                   2036666
    store_misses                   867426
    load_hits                     3552806
    store_hits                     486959