Cache Analysis and Valgrind
Consider the following program that generates a random matrix:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int** genRandomMatrix(int n, int max){
int i, j;
int** mat = malloc(n * sizeof(int*));
for (i=0; i < n; i++){
mat[i] = malloc(n * sizeof(int));
for (j=0; j < n; j++)
mat[i][j] = 1 + rand() % max;
}
return mat;
}
void free_all(int** mat, int n){
int i;
for (i=0; i < n; i++)
free(mat[i]);
free(mat);
}
int main(int argc, char** argv) {
if (argc != 2){
fprintf(stderr, "usage: %s <n>\n", argv[0]);
fprintf(stderr, "where <n> is the dimension of the matrix\n");
return 1;
}
int i;
int n = strtol(argv[1], NULL, 10);
srand(time(NULL));
int ** matrix = genRandomMatrix(n, 100);
free_all(matrix, n);
return 0;
}
Suppose I gave you the problem of computing the average value of the elements in the matrix. How would you do it? Here are two equivalent solutions:
float averageMat_v1(int** mat, int n){
int i, j, total=0;
for (i = 0; i < n; i++){
for (j = 0; j < n; j++){
total+= mat[i][j];
}
}
return (float)total/(n*n);
}
float averageMat_v2(int** mat, int n){
int i, j, total=0;
for (i = 0; i < n; i++){
for (j = 0; j < n; j++){
total+= mat[j][i];
}
}
return (float)total/(n*n);
}
Which one is better?
A first cut: theoretical analysis + benchmarking
If we do a theoretic analysis based on our understanding
of matrices, locality, and the memory hierarchy, we may say that the first
has better spacial locality (on matrix mat
), due to the fact that matrices are
stored in row-major order in memory. The second solution has poor spatial
locality, since each element in the matrix is visited in column-major order.
As we have learned, data is loaded into the cache in blocks. Traversing the
matrix in column-major order will likely lead to more cache misses, resulting
in poorer performance.
Let's modify our main function to include calls to the gettimeofday()
function
so we can accurately measure the difference in performance of the two versions:
int main(int argc, char** argv) {
if (argc != 2){
fprintf(stderr, "usage: %s <n>\n", argv[0]);
fprintf(stderr, "where <n> is the dimension of the matrix\n");
return 1;
}
int i;
int n = strtol(argv[1], NULL, 10);
srand(time(NULL));
struct timeval tstart, tend;
int ** matrix = genRandomMatrix(n, 100);
gettimeofday(&tstart, NULL);
float res = averageMat_v1(matrix, n);
gettimeofday(&tend, NULL);
double time = tend.tv_sec - tstart.tv_sec + (tend.tv_usec - tstart.tv_usec)/1.e6;
printf("v1 average is: %.2f; time is %g\n", res, time);
gettimeofday(&tstart, NULL);
res = averageMat_v2(matrix, n);
gettimeofday(&tend, NULL);
time = tend.tv_sec - tstart.tv_sec + (tend.tv_usec - tstart.tv_usec)/1.e6;
printf("v2 average is: %.2f; time is %g\n", res, time);
free_all(matrix, n);
return 0;
}
Compiling the code and running it yields the following result:
> gcc -m32 -o cachex cachex.c
> ./cachex 5000
v1 average is: 50.49; time is 0.053641
v2 average is: 50.49; time is 0.247644
Holy cow, what a big difference! In essence, our solution using row-major order is 4.61 times faster than the second one!
Cache-Analysis in the real world: Cachegrind
In the above problem, we theoretically analyzed two solutions and then benchmarked to verify that version 1 of the code is faster. However, is there a way to actually see if our analysis about cache misses and hits is true? We can once again use our old friend, the Valgrind suite! Earlier in this book, we discussed how Valgrind can be used to help find memory leaks in a program. In this section, we will discuss cachegrind, Valgrind's cache simulator. This program is useful if you wish to study how a program or particular function impacts the cache.
Cachegrind is useful for simulating how your program interacts with the computer’s cache hierarchy. In several cases, Cachegrind can auto-detect the cache organization of a machine. In the cases that it cannot, it still simulates the first level (L1) cache and the last level (LL) cache. It assumes the first level cache has two independent components: the instruction cache, and the data cache. The reason for this is that the last level cache has the most important implications for runtime. L1 caches also have the lowest level of associativity, so it is important that we ensure that our programs are interacting well with this cache. These assumptions match the structure of most modern machines.
Cachegrind collects and outputs the following information:
- Instruction cache reads (
Ir
) - L1 instruction cache read misses (
I1mr
) and LL cache instruction read misses (ILmr
) - Data cache reads (
Dr
) - D1 cache read misses (
D1mr
) and LL cache data misses (DLmr
). - Data cache writes (
Dw
) - D1 cache write misses (
D1mw
) and LL cache data write misses (DLmw
).
It is important to note that D1 total access is computed by D1 = D1mr + D1mw, and that LL total access is given by ILmr + DLmr + DLmw.
Let’s see how well version 1 of our code operates when we simulate a cache with Cachegrind. We can run Valgrind on our compiled code with the following command:
> valgrind --tool=cachegrind ./cachex 1000
In this invocation, Valgrind's cachegrind tool is acting as a wrapper around
the cachex
exutable. We choose a smaller matrix size for cachegrind to aid
in the speed of execution. This will give you information about the number
of cache hits and misses in the overall program:
==28657== Cachegrind, a cache and branch-prediction profiler
==28657== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==28657== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==28657== Command: ./cachex 1000
==28657==
--28657-- warning: L3 cache found, using its data for the LL simulation.
average is: 50.49; time is 0.080304
average is: 50.49; time is 0.09733
==28657==
==28657== I refs: 122,626,329
==28657== I1 misses: 1,070
==28657== LLi misses: 1,053
==28657== I1 miss rate: 0.00%
==28657== LLi miss rate: 0.00%
==28657==
==28657== D refs: 75,292,076 (56,205,598 rd + 19,086,478 wr)
==28657== D1 misses: 1,192,118 ( 1,129,099 rd + 63,019 wr)
==28657== LLd misses: 64,399 ( 1,543 rd + 62,856 wr)
==28657== D1 miss rate: 1.6% ( 2.0% + 0.3% )
==28657== LLd miss rate: 0.1% ( 0.0% + 0.3% )
==28657==
==28657== LL refs: 1,193,188 ( 1,130,169 rd + 63,019 wr)
==28657== LL misses: 65,452 ( 2,596 rd + 62,856 wr)
==28657== LL miss rate: 0.0% ( 0.0% + 0.3% )
However, we are interested specifically in the hits and misses for the
two versions of our functions. To view this, we will need to use the
cachegrind tool cg_annotate
. Running cachegrind should have produced a file
in your current working directory that looks something like cachegrind.out.n
,
where n
is some process id number. To run cg_annotate
type in the following
command (replacing cachegrind.out.28657
with your own output file):
> cg_annotate cachegrind.out.28657
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 8388608 B, 64 B, 16-way associative
Command: ./cachex 1000
Data file: cachegrind.out.28657
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
122,626,329 1,070 1,053 56,205,598 1,129,099 1,543 19,086,478 63,019 62,856 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
14,009,017 3 3 9,005,008 62,688 0 1,004 0 0 ???:averageMat_v1
14,009,017 0 0 9,005,008 1,062,996 0 1,004 0 0 ???:averageMat_v2
We have edited the output from this command slightly to bring focus to the
two versions of our functions. From this output, we can clearly see that version
2 of our average code has 1,062,996
data misses compared to 62,688
misses
in version 1. Here is solid proof that our analysis is correct!