# CS 31: Introduction to Computer Systems 16 Storage and Locality 03-25-2025



# Reading Quiz

# Last class: The Memory Hierarchy



# Cache Design

- Lot's of characteristics to consider:
  - Where should data be stored in the cache?
  - What size data chunks should we store? (block size)





# System Cache Management

### The **Mechanism** of cache misses:

- How to detect a miss?
- How to fetch and load data from lower level of the memory hierarchy?
- Do we need to replace some current item's data in the cache with new data?

### The **Policy** of cache miss:

— Which data to kick out of the cache to make room for the missed data?

# System Cache Management

### **Separation of Policy and Mechanism**

policy implementation

mechanism implementation

- Tune the mechanism for the very specific hardware
- Policy interacts with mechanism interface
- + Faster mechanisms
- + Easier to change policies

# Cache Design

- Lot's of characteristics to consider:
  - Where should data be stored in the cache?
  - What size data chunks should we store? (block size)

### Goals:

- Maximize hit rate
- Maximize (temporal & spatial) locality benefits
- Reduce cost/complexity of design

Suppose the CPU asks for data, it's not in cache. We need to move in into cache from memory. Where in the cache should it be allowed to go?

A. In exactly one place.

B. In a few places.

C. In most places, but not all.

D. Anywhere in the cache.



### A. Direct-Mapped: In exactly one place

- Every location in memory is directly mapped to one place in the cache.
- Easy to find data.

### B. Set-Associative: In a few places.

- A memory location can be mapped to (2, 4, 8) locations in the cache.
- Middle ground.

### C. In most places, but not all.

### D. "Fully associative": Anywhere in the cache.

- No restrictions on where memory can be placed in the cache.
- Fewer conflict misses, more searching.

### What should we use to determine whether or not data is in the cache?

A. The memory address of the data.

B. The value of the data.

C. The size of the data.

D. Some other aspect of the data.

### What should we use to determine whether or not data is in the cache?

- A. The memory address of the data.
  - Memory address is how we identify the data.
- B. The value of the data.
  - If we knew this, we wouldn't be looking for it!
- C. The size of the data.
- D. Some other aspect of the data.

### Recall: Memory Reads

CPU places address A on the memory bus.

Load operation: movl (A), %eax



### Recall: Memory Reads

Memory retrieves value and sends it across bus.

CPU reads value from the bus, and copies it into register %eax, a copy also goes into the on-chip cache memory.



# Memory Address Tells Us...

Is the block containing the byte(s) you want already in the cache?

- If not, where should we put that block?
  - Do we need to kick out ("evict") another block?

Which byte(s) within the block do you want?

# Memory Addresses for use with Cache

Like everything else: series of bits (x86\_64 has 64 bit addresses)

- Keep in mind:
  - N bits gives us 2<sup>N</sup> unique values.

- 64-bit address:

Divide into regions, each with distinct meaning.

### **Address Division**

### First section: Tag

- Of all the addresses that map to this location, which one is here?
- Number of bits for this section is any bits left over after index and offset.

### Second section: Index

- Which location(s) in the cache should we check for the data with this address?
- Number of bits for this section depends on the number of cache locations.

### Third section: Offset

- If we find a block of bytes in the cache (on a hit) which byte offset within the block do we actually want?
- Number of bits for this section depends on the block size must be able to uniquely identify every byte in the block.

### **Address Division**

#### 

| Tag (X bits) | Index (Y bits) | Byte offset (Z bits) |
|--------------|----------------|----------------------|
|              |                |                      |

- First section: Tag
  - Of all the addresses that map to this location, which one is here?
  - Uniquely identify the subset of memory contained within a cache line.
  - Number of bits for this section is any bits left over after index and offset.
- Second section: Index
  - Which location(s) in the cache should we check for the data with this address?
  - Number of bits for this section depends on the number of cache locations.
- Third section: Offset
  - If we find a block of bytes in the cache (on a hit) which bytes within the block do we actually want?
  - Number of bits for this section depends on the block size must be able to uniquely identify every byte in the block.

### A. <u>In exactly one place</u>. ("Direct-mapped")

Every location in memory is directly mapped to one place in the cache. Easy to find data.

- B. In a few places. ("Set associative")
  - A memory location can be mapped to (2, 4, 8)
     locations in the cache. Middle ground.

- A. Anywhere in the cache. ("Fully associative")
  - No restrictions on where memory can be placed in the cache. Fewer conflict misses, more searching.

# **Direct Mapped Cache**

### **Main Memory**



i and j map to the same cache line and may constantly evict each other!

16 locations

One place data can be.

- Example: let's assume some parameters:
  - 1024 cache locations (every block mapped to one)
  - Block size of 8 bytes

Metadata

1024 cache locations (every block mapped to one) Block size of 8 bytes

| Line | V | D | Tag | Data (8 Bytes) |
|------|---|---|-----|----------------|
| 0    |   |   |     |                |
| 1    |   |   |     |                |
| 2    |   |   |     |                |
| 3    |   |   |     |                |
| 4    |   |   |     |                |
| •••  |   |   |     |                |
| 1020 |   |   |     |                |
| 1021 |   |   |     |                |
| 1022 |   |   |     |                |
| 1023 |   |   |     |                |

### Cache meta-data

Metadata

### Valid bit: is the entry valid?

garbage

If set: data is correct, use it if we 'hit' in cache
If <u>not</u> set: ignore 'hits', the data is

### Dirty bit: has the data been written?

Used by write-back caches
If set, need to update memory
before eviction

| Line | V | D | Tag | Data (8 Bytes) |
|------|---|---|-----|----------------|
| 0    |   |   |     |                |
| 1    |   |   |     |                |
| 2    |   |   |     |                |
| 3    |   |   |     |                |
| 4    |   |   |     |                |
| •••  |   |   | ••• |                |
| 1020 |   |   |     |                |
| 1021 |   |   |     |                |
| 1022 |   |   |     |                |
| 1023 |   |   |     |                |

# Address division: Direct-Mapped

- Identify byte in block
  - How many bits do we need to represent each byte uniquely?
- Identify which row (line)
  - How many bits do we need to represent each line uniquely?



- A. Block 8 bits Row 1024 bits
- C. Block 10 bits Row 10 bits
- B. Block 3 bits Row 10 bits
- D. Block 32 bits Row 32 bits

# Address division: Direct-Mapped

- Identify byte in block
  - How many bits? 3

- Identify which row (line)
  - How many bits? <u>10</u>

- <u>Tag:</u>
  - 64 13: 51 bits





| Tag (51 bits) | Index (10 bits) | Byte offset (3 bits) |
|---------------|-----------------|----------------------|
|               |                 |                      |

Index:

Which line (row) should we check?

Where could data be?





### Address division:

| Tag (51 bits) | Index (10 bits) | Byte offset (3 bits) |
|---------------|-----------------|----------------------|
| 4217          | 4               |                      |
|               |                 |                      |
|               |                 |                      |

In parallel, check:

Tag:

Does the cache hold the data we're looking for, or some other block?

Valid bit:

If entry is not valid, don't trust garbage in that line (row).



If tag doesn't match, or line is invalid, it's a miss!







Suppose our addresses are 16 bits long.

- Our cache has 16 entries, block size of 16 bytes
  - 4 bits in address for the index
  - 4 bits in address for byte offset
  - Remaining bits (8): tag

- Let's say we access memory at address:
  - -0110101100110100

- Step 1:
  - Partition address into tag, index, offset

| Line | V | D | Tag | Data<br>(16 Bytes) |
|------|---|---|-----|--------------------|
| 0    |   |   |     |                    |
| 1    |   |   |     |                    |
| 2    |   |   |     |                    |
| 3    |   |   |     |                    |
| 4    |   |   |     |                    |
| 5    |   |   |     |                    |
|      |   |   |     |                    |
| 15   |   |   |     |                    |

- Let's say we access memory at address:
  - **01101011 0011 0100**

- Step 1:
  - Partition address into tag, index, offset

| Line | V | D | Tag | Data<br>(16 Bytes) |
|------|---|---|-----|--------------------|
| 0    |   |   |     |                    |
| 1    |   |   |     |                    |
| 2    |   |   |     |                    |
| 3    |   |   |     |                    |
| 4    |   |   |     |                    |
| 5    |   |   |     |                    |
| •••  |   |   |     |                    |
| 15   |   |   |     |                    |

- Let's say we access memory at address:
  - 01101011 <u>0011</u> 0100

- Step 2:
  - Use index to find line (row)
  - -0011 -> 3

| Line | V | D | Tag | Data<br>(16 Bytes) |
|------|---|---|-----|--------------------|
| 0    |   |   |     |                    |
| 1    |   |   |     |                    |
| 2    |   |   |     |                    |
| 3    |   |   |     |                    |
| 4    |   |   |     |                    |
| 5    |   |   |     |                    |
| •••  |   |   |     |                    |
| 15   |   |   |     |                    |

• Let's say we access memory at address:





- Use index to find line (row)
- -0011 -> 3



..

5

• Let's say we access memory at address:

- 01101011 <u>0011</u> 0100

Note:

 ANY address with 0011 (3) as the middle four index bits will map to this cache line.

- e.g. 11111111 0011 0000



## Direct-Mapped Example

Line Data (16 Bytes) Tag Let's say we access memory at address: 0 **- 01101011 0011 0100** 2 01101011 • Step 3: Check the tag 4 — Is it 01101011 (hit)? 5 – Something else (miss)? – (Must also ensure valid)

15

#### **Eviction**

- If we don't find what we're looking for (miss), we need to bring in the data from memory.
- Make room by kicking something out.
  - If line to be evicted is dirty, write it to memory first.
- Another important systems distinction:
  - Mechanism: An ability or feature of the system.
     What you <u>can</u> do.
  - Policy: Governs the decisions making for using the mechanism. What you <u>should</u> do.

#### **Eviction**

- For direct-mapped cache:
  - Mechanism: overwrite bits in cache line, updating
    - Valid bit
    - Tag
    - Data
  - Policy: not many options for direct-mapped
    - Overwrite at the only location it could be!

### **Eviction: Direct-Mapped**

Address division:

| Tag (51 bits) | Index (10 bits) | Byte offset (3 bits) |
|---------------|-----------------|----------------------|
| 3941          | 1020            |                      |

Data (8 Bytes) Line Tag 0 1020 0 1323 57883 1021 1022 1023

Find line:

Tag doesn't match, bring in from memory.

If dirty, write back first!

## **Eviction: Direct-Mapped**

#### Address division:

| Tag (51 bits) | Index (10 bits) | Byte offset (3 bits) |
|---------------|-----------------|----------------------|
| 3941          | 1020            |                      |

1. Send address to read main memory.

Main Memory

| Line | V | D | Tag  | Data (8 Bytes) |
|------|---|---|------|----------------|
| 0    |   |   |      |                |
| 1    |   |   |      |                |
| 2    |   |   |      |                |
| 3    |   |   |      |                |
| 4    |   |   |      |                |
| •••  |   |   |      |                |
| 1020 | 1 | 0 | 1323 | 57883          |
| 1021 |   |   |      |                |
| 1022 |   |   |      |                |
| 1023 |   |   |      |                |

### **Eviction: Direct-Mapped**

Address division:

| Tag (51 bits) | Index (10 bits) | Byte offset (3 bits) |
|---------------|-----------------|----------------------|
| 3941          | 1020            |                      |

1. Send address to read main memory.

Main Memory



2. Copy data from memory. Update tag.

## **Direct-Mapped**

Address division:



Byte offset tells us which subset of block to retrieve.

Can one read of a variable straddle multiple cache blocks?



### **Direct-Mapped**

Address division:



Byte offset tells us which subset of block to retrieve.

Can one read of a variable straddle multiple cache blocks?

No, recall mem. alignment!



## Direct-Mapped Example

Suppose our addresses are 16 bits long.

- Our cache has 16 entries, block size of 16 bytes
  - 4 bits in address for the index
  - 4 bits in address for byte offset
  - Remaining bits (8): tag

Suppose we had 8-bit addresses, a cache with 8 lines, and a block size of 4 bytes.

- How many bits would we use for:
  - Tag?
  - Index?
  - Offset?

# Suppose a system has 8-bit addresses, a DM cache with 8 lines, and 4-byte block size

- How many bits would be used for:
  - Byte-offset: 2 bits
    - 4 byte block size, can address each byte in 2 bits (00, 01, 10, 11)
  - Index: 3 bits
    - With 8 lines, need 3 bits to encode each cache line number
  - Tag: 3 bits
    - Bits left over in the address after byte-offset and index (8 2 3)
- Which bits would be used for:

ex. 01010011

- Tag? high order
- Index? middle (right after byte offset bits)
- Byte offset? low order

Memory address

Read 01000100 (Value: 5)

Read 11100010 (Value: 17)

Write 01110000 (Value: 7)

Read 10101010 (Value: 12)

Write 01101100 (Value: 2)

| Line | V | D | Tag | Data<br>(4 Bytes) |
|------|---|---|-----|-------------------|
| 0    | 1 | 0 | 111 | 17                |
| 1    | 1 | 0 | 011 | 9                 |
| 2    | 0 | 0 | 101 | 15                |
| 3    | 1 | 1 | 001 | 8                 |
| 4    | 1 | 0 | 011 | 4                 |
| 5    | 0 | 0 | 111 | 6                 |
| 6    | 0 | 0 | 101 | 32                |
| 7    | 1 | 0 | 110 | 3                 |



| V | D | Tag     | Data (4 Bytes) |
|---|---|---------|----------------|
| 1 | 0 | 111     | 17             |
| 1 | 0 | 011 010 | 9 5            |
| 0 | 0 | 101     | 15             |
| 1 | 1 | 001     | 8              |
| 1 | 0 | 011     | 4              |
| 0 | 0 | 111     | 6              |
| 0 | 0 | 101     | 32             |
| 1 | 0 | 110     | 3              |

At line 1, V=1 but tags don't match, so we have a MISS.

Dirty bit is 0, so we can safely overwrite it.

Write: V = 1; D = 0 (we're reading, not writing); tag, data (value)



| V | D | Tag     | Data (4 Bytes) |
|---|---|---------|----------------|
| 1 | 0 | 111     | 17             |
| 1 | 0 | 011 010 | 9 5            |
| 0 | 0 | 101     | 15             |
| 1 | 1 | 001     | 8              |
| 1 | 0 | 011     | 4              |
| 0 | 0 | 111     | 6              |
| 0 | 0 | 101     | 32             |
| 1 | 0 | 110     | 3              |

At line 1, V=1 and tags match, so we have a HIT. What we wanted in cache is there, so we saved time!



| V                 | D | Tag                | Data (4 Bytes) |
|-------------------|---|--------------------|----------------|
| 1                 | 0 | 111                | 17             |
| 1                 | 0 | 011 010            | 9 5            |
| 0                 | 0 | 101                | 15             |
| 1                 | 1 | 001                | 8              |
| 1                 | 0 | 011                | 4              |
| 0                 | 0 | 111                | 6              |
| 0                 | 0 | 101                | 32             |
| <del>0</del><br>1 | 0 | <del>110</del> 101 | 3 2            |

At line 7, V=0 and tags don't match, so we have a MISS.

Dirty bit is 0, so we can safely overwrite it.

Write: V = 1; tag, data (value)



| V                              | D                 | Tag            | Data (4 Bytes)   |
|--------------------------------|-------------------|----------------|------------------|
| 1                              | 0                 | 111            | 17               |
| 1                              | 0                 | 011 010        | 9 5              |
| 0                              | 0                 | 101            | 15               |
| 1                              | 1                 | 001            | 8                |
| 1                              | 0                 | 011            | 4                |
| 0                              | 0                 | 111            | 6                |
| 0                              | 0                 | 101            | 32               |
| <del>0</del><br><del>1</del> 1 | <del>0</del><br>1 | 110 101<br>011 | 3 <b>2</b><br>10 |

At line 7, V=1 but tags don't match, so we have a MISS. D=0, so we can safely overwrite it. Write: V = 1; D = 1 (we're updating cache but it is now out of sync with main memory); tag, data (value)



| V              | D                 | Tag            | Data (4 Bytes)       |
|----------------|-------------------|----------------|----------------------|
| 1              | 0                 | 111            | 17                   |
| 1              | 0                 | 011 010        | 9 5                  |
| 0              | 0                 | 101            | 15                   |
| 1              | 1                 | 001            | 8                    |
| 1              | 0                 | 011            | 4 7                  |
| 0              | 0                 | 111            | 6                    |
| 0              | 0                 | 101            | 32                   |
| θ<br><b>±1</b> | <del>0</del><br>1 | 110 101<br>011 | 3 <del>2</del><br>10 |

At line 4, V=1 and tags match, so we have a HIT. D=0. Write value 7. Set data (value); D = 1 (we're updating cache and it is now out of synch with main memory)



cache. Set V=1, D=0; tag; data (value).

| V                 | D                 | Tag                | Data (4 Bytes)   |
|-------------------|-------------------|--------------------|------------------|
| 1                 | 0                 | 111                | 17               |
| 1                 | 0                 | 011 010            | 9 5              |
| <del>0</del><br>1 | 0                 | <del>101</del> 101 | <del>15</del> 12 |
| 1                 | 1                 | 001                | 8                |
| 1                 | 0                 | 011                | 4 7              |
| 0                 | 0                 | 111                | 6                |
| 0                 | 0                 | 101                | 32               |
| 0<br>11           | <del>0</del><br>1 | 110 101<br>011     | 3 2<br>10        |



| V                 | D                 | Tag                       | Data (4 Bytes)          |
|-------------------|-------------------|---------------------------|-------------------------|
| 1                 | 0                 | 111                       | 17                      |
| 1                 | 0                 | 011 010                   | 9 5                     |
| <del>0</del><br>1 | 0                 | <del>101</del> <b>101</b> | <del>15</del> <b>12</b> |
| 1<br>1            | 1<br>1            | <del>001</del> <b>011</b> | 용 <b>2</b>              |
| 1                 | 0                 | 011                       | 4 7                     |
| 0                 | 0                 | 111                       | 6                       |
| 0                 | 0                 | 101                       | 32                      |
| θ<br><b>11</b>    | <del>0</del><br>1 | 110 101<br>011            | 3 <b>2</b><br>10        |

### Associativity

 Problem: suppose we're only using a small amount of data (e.g., 8 bytes, 4-byte block size)

- Bad luck: (both) blocks map to same cache line
  - Constantly evicting one another
  - Rest of cache is going unused!
- Associativity: allow a set blocks to be stored at the same index. Goal: reduce conflict misses.

### Comparison

#### **Direct-mapped**

- Tag tells you if you found the correct data.
- Offset specifies which byte within block.
- Middle bits (index) tell you which <u>1</u> line to check.
- (+) Low complexity, fast.
- (-) Conflict misses.

#### N-way set associative

- Tag tells you if you found the correct data.
- Offset specifies which byte within block.
- Middle bits (set) tell you which <u>N</u> lines to check.
- (+) Fewer conflict misses.
- (-) More complex, slower, consumes more power.

### **Direct Mapped Cache**

#### **Main Memory**



i and j map to the same cache line and may constantly evict each other!

16 locations

#### Set Associative Cache

#### **Main Memory**



i and j map to the same cache line but different locations in the cache.

16 locations

## Comparison: 1024 Lines

(For the same cache size, in bytes.)

#### **Direct-mapped**

• 1024 indices (10 bits)

#### 2-way set associative

- 512 sets (9 bits)
  - Tag slightly (1 bit) larger.

| Set# | V | D | Tag | Data (8 Bytes) | V | D | Tag | Data (8 Bytes) |
|------|---|---|-----|----------------|---|---|-----|----------------|
| 0    |   |   |     |                |   |   |     |                |
| 1    |   |   |     |                |   |   |     |                |
| 2    |   |   |     |                |   |   |     |                |
| 3    |   |   |     |                |   |   |     |                |
| 4    |   |   |     |                |   |   |     |                |
| •••  |   |   |     |                |   |   | ••• |                |
| 508  |   |   |     |                |   |   |     |                |
| 509  |   |   |     |                |   |   |     |                |
| 510  |   |   |     |                |   |   |     |                |
| 511  |   |   |     |                |   |   |     |                |

## 2-Way Set Associative

| Tag (52 bits) | Set (9 bits) | Byte offset (3 bits) |
|---------------|--------------|----------------------|
| 3941          | 4            |                      |

Same capacity as previous example: 1024 rows with 1 entry vs. 512 rows with 2 entries

|   |      |   |   |      | 32210113       |   |   |      |                |
|---|------|---|---|------|----------------|---|---|------|----------------|
|   | Set# | V | D | Tag  | Data (8 Bytes) | V | D | Tag  | Data (8 Bytes) |
|   | 0    |   |   |      |                |   |   |      |                |
|   | 1    |   |   |      |                |   |   |      |                |
|   | 2    |   |   |      |                |   |   |      |                |
|   | 3    |   |   |      |                |   |   |      |                |
| Ļ | 4    | 1 | 1 | 4063 |                | 1 | 0 | 3941 |                |
|   |      |   |   | •••  |                |   |   | •••  |                |
|   | 508  |   |   |      |                |   |   |      |                |
|   | 509  |   |   |      |                |   |   |      |                |
|   | 510  |   |   |      |                |   |   |      |                |
|   | 511  |   |   |      |                |   |   |      |                |
|   |      |   |   |      |                |   |   |      |                |

## 2-Way Set Associative

| Tag (52 bits) | Set (9 bits) | Byte offset (3 bits) |
|---------------|--------------|----------------------|
| 3941          | 4            |                      |



Check all locations in the set, in parallel.

## 2-Way Set Associative

| Tag (52 bits) | Set (9 bits) | Byte offset (3 bits) |
|---------------|--------------|----------------------|
| 3941          | 4            |                      |



#### **Eviction**

- Mechanism is the same...
  - Overwrite bits in cache line: update tag, valid, data
- Policy: choose which line in the set to evict
  - Pick a random line in set
  - Choose an invalid line first
  - Choose the least recently used block
    - Has exhibited the least locality, kick it out!

Common combo in practice.

### Least Recently Used (LRU)

• Intuition: if it hasn't been used in a while, we have no reason to believe it will be used soon.

 Need extra state to keep track of LRU info: which line is least recently used -> left or right?

| Set# | LRU | V | D | Tag  | Data (8 Bytes) | V | D | Tag  | Data (8 Bytes) |
|------|-----|---|---|------|----------------|---|---|------|----------------|
| 0    | 0   |   |   |      |                |   |   |      |                |
| 1    | 1   |   |   |      |                |   |   |      |                |
| 2    | 1   |   |   |      |                |   |   |      |                |
| 3    | 0   |   |   |      |                |   |   |      |                |
| 4    | 1   | 1 | 1 | 4063 |                | 1 | 0 | 3941 |                |
|      |     |   |   | •••  |                |   |   | •••  |                |

### Least Recently Used (LRU)

• Intuition: if it hasn't been used in a while, we have no reason to believe it will be used soon.

Need extra state to keep track of LRU info.

• For perfect LRU info:

- 2-way: 1 bit

– 4-way: 8 bits

– N-way: N \* log<sub>2</sub> N bits

Another reason why associativity often maxes out at 8 or 16.

These are metadata bits, not "useful" program data storage.

(Approximations make it not quite as bad.)

# Suppose a system has 8-bit addresses, a two-way set associative cache with 8 lines, and 4-byte block size

- How many bits would we use for:
  - Tag?
  - Index?
  - Offset?

# Suppose a system has 8-bit addresses, a two-way set associative cache with 8 lines, and 4-byte block size

- How many bits would we use for:
  - Tag? **3**
  - Set? 3
  - Offset? 2

Read 01000100 (Value: 5)

Read 11100010 (Value: 17)

Write 01100100 (Value: 7)

Read 01000110 (Value: 5)

Write 01100000 (Value: 2)

Tag:

Set:

Offset:

LRU = 0 means the left line in the set was least recently used.

| Set# | LRU | V | D | Tag | Data (4 Bytes) | V | D | Tag | Data (4 Bytes) |
|------|-----|---|---|-----|----------------|---|---|-----|----------------|
| 0    | 0   | 0 | 0 | 111 | 4              | 1 | 0 | 001 | 17             |
| 1    | 1   | 1 | 1 | 111 | 9              | 1 | 0 | 010 | 5              |
| 2    |     |   |   | ••• |                |   |   | ••• |                |
| 3    |     |   |   |     |                |   |   |     |                |
| 4    |     |   |   |     |                |   |   |     |                |
| 5    |     |   |   |     |                |   |   |     |                |
| 6    |     |   |   |     |                |   |   |     |                |
| 7    |     |   |   |     |                |   |   |     |                |

HIT Read 01000100 (Value: 5)

Read 11100010 (Value: 17)

Write 01100100 (Value: 7)

Read 01000110 (Value: 5)

Write 01100000 (Value: 2)

Tag: 3

Set: 3

Offset: 2

LRU = 0 means the left line in the set was least recently used.

| Set # | LRU            | V | D | Tag | Data (4 Bytes) | V | D | Tag | Data (4 Bytes) |
|-------|----------------|---|---|-----|----------------|---|---|-----|----------------|
| 0     | 0              | 0 | 0 | 111 | 4              | 1 | 0 | 001 | 17             |
| 1     | <del>1</del> 0 | 1 | 1 | 111 | 9              | 1 | 0 | 010 | 5              |
| 2     |                |   |   | ••• |                |   |   | ••• |                |
| 3     |                |   |   |     |                |   |   |     |                |
| 4     |                |   |   |     |                |   |   |     |                |
| 5     |                |   |   |     |                |   |   |     |                |
| 6     |                |   |   |     |                |   |   |     |                |
| 7     |                |   |   |     |                |   |   |     |                |

HIT Read 01000100 (Value: 5)

MISS Read 11100010 (Value: 17)

Write 01100100 (Value: 7)

Read 01000110 (Value: 5)

Write 01100000 (Value: 2)

Tag: 3

Index: 3
Offset: 2

LRU = 0 means the left line in the set was least recently used.

| Set # | LRU            | V          | D | Tag | Data (4 Bytes)  | V | D | Tag | Data (4 Bytes) |
|-------|----------------|------------|---|-----|-----------------|---|---|-----|----------------|
| 0     | <del>0</del> 1 | <b>9</b> 1 | 0 | 111 | <del>4</del> 17 | 1 | 0 | 001 | 17             |
| 1     | <del>1</del> 0 | 1          | 1 | 111 | 9               | 1 | 0 | 010 | 5              |
| 2     |                |            |   | ••• |                 |   |   | ••• |                |
| 3     |                |            |   |     |                 |   |   |     |                |
| 4     |                |            |   |     |                 |   |   |     |                |
| 5     |                |            |   |     |                 |   |   |     |                |
| 6     |                |            |   |     |                 |   |   |     |                |
| 7     |                |            |   |     |                 |   |   |     |                |

HIT Read 01000100 (Value: 5)

MISS Read 11100010 (Value: 17)

MISS — Write 01100100 (Value: 7)

Read 01000110 (Value: 5)

Write 01100000 (Value: 2)

Tag: 3

Index: 3
Offset: 2

LRU = 0 means the left line in the set was least recently used.

| Set # | LRU                   | V              | D   | Tag                       | Data (4 Bytes)  | V | D | Tag | Data (4 Bytes) |
|-------|-----------------------|----------------|-----|---------------------------|-----------------|---|---|-----|----------------|
| 0     | <del>0</del> <b>1</b> | θ <sup>1</sup> | 0   | 111                       | <del>4</del> 17 | 1 | 0 | 001 | 17             |
| 1     | 1 0 1                 | <b>1</b> 1     | 1 1 | <del>111</del> <b>011</b> | 9 7             | 1 | 0 | 010 | 5              |
| 2     |                       |                |     | •••                       |                 |   |   |     |                |
| 3     |                       |                |     |                           |                 |   |   |     |                |
| 4     |                       |                |     |                           |                 |   |   |     |                |
| 5     |                       |                |     |                           |                 |   |   |     |                |
| 6     |                       |                |     |                           |                 |   |   |     |                |
| 7     |                       |                |     |                           |                 |   |   |     |                |

Offset: 2

HIT Read 01000100 (Value: 5)

MISS Read 11100010 (Value: 17) Tag: 3 Index: 3

MISS Write 01100100 (Value: 7)

HIT Read 01000110 (Value: 5)

Write 01100000 (Value: 2)

LRU = 0 means the left line in the set was least recently used.

| Set# L | LRU             | V              | D   | Tag                       | Data (4 Bytes)         | V | D | Tag | Data (4 Bytes) |
|--------|-----------------|----------------|-----|---------------------------|------------------------|---|---|-----|----------------|
|        | <del>0</del> 1  | θ <sup>1</sup> |     | 111                       | <del>4</del> <b>17</b> | 1 | 0 | 001 | 17             |
| 1 0    | 10 <sup>1</sup> | 1 <sup>1</sup> | 1 1 | <del>111</del> <b>011</b> | 9 7                    | 1 | 0 | 010 | 5              |
| 2      |                 |                |     | •••                       |                        |   |   |     |                |
| 3      |                 |                |     |                           |                        |   |   |     |                |
| 4      |                 |                |     |                           |                        |   |   |     |                |
| 5      |                 |                |     |                           |                        |   |   |     |                |
| 6      |                 |                |     |                           |                        |   |   |     |                |
| 7      |                 |                |     |                           |                        |   |   |     |                |

Tag: 3

Index: 3

Offset: 2

HIT Read 01000100 (Value: 5)

MISS Read 11100010 (Value: 17)

MISS Write 01100100 (Value: 7)

HIT Read 01000110 (Value: 5)

Write 01100000 (Value: 2)

LRU = 0 means the left line in the set was least recently used.

| Set# LRU  | V              | D          | Tag                       | Data (4 Bytes)         | V              | D                | Tag                       | Data (4 Bytes)  |
|-----------|----------------|------------|---------------------------|------------------------|----------------|------------------|---------------------------|-----------------|
| 0 0 01    | θ <sup>1</sup> | 0          | 111                       | <del>4</del> <b>17</b> | 4 <sup>1</sup> | 0 <mark>1</mark> | <del>001</del> <b>011</b> | <del>17</del> 2 |
| 1 0 1 0 1 | 1 <sup>1</sup> | <b>1</b> 1 | <del>111</del> <b>011</b> | 9 7                    | 1              | 0                | 010                       | 5               |
| 2         |                |            | •••                       |                        |                |                  | •••                       |                 |
| 3         |                |            |                           |                        |                |                  |                           |                 |
| 4         |                |            |                           |                        |                |                  |                           |                 |
| 5         |                |            |                           |                        |                |                  |                           |                 |
| 6         |                |            |                           |                        |                |                  |                           |                 |
| 7         |                |            |                           |                        |                |                  |                           |                 |

### Cache Conscious Programming

Knowing about caching and designing code around it can significantly effect performance

(ex) 2D array accesses

```
for(i=0; i < N; i++) {
  for(j=0; j < M; j++) {
    sum += arr[i][j];
}
</pre>
for(j=0; j < M; j++) {
  for(i=0; i < N; i++) {
    sum += arr[i][j];
}
</pre>
```

Algorithmically, both O(N \* M).

Is one faster than the other?

### Cache Conscious Programming

Knowing about caching and designing code around it can significantly effect performance

(ex) 2D array accesses

```
for(i=0; i < N; i++) {
  for(j=0; j < M; j++) {
    sum += arr[i][j];
}

A. is faster.

for(j=0; j < M; j++) {
  for(i=0; i < N; i++) {
    sum += arr[i][j];
}

B. is faster.</pre>
```

Algorithmically, both O(N \* M).

Is one faster than the other?

C. Both would exhibit roughly equal performance.

### Cache Conscious Programming

The first nested loop is more efficient if the cache block size is larger than a single array bucket (for arrays of basic C types, it will be).

```
for(i=0; i < N; i++) {
  for(j=0; j < M; j++) {
    sum += arr[i][j];
  }
}</pre>
for(j=0; j < M; j++) {
  for(i=0; i < N; i++) {
    sum += arr[i][j];
  }
}
```





(ex) 1 miss every 4 buckets vs. 1 miss every bucket

### FYI: Example Cache Organization

Multicore processor: multiple CPU cores/chip



#### **Example Sizes**

#### & Access Times

(KB: 2<sup>10</sup> MB: 2<sup>20</sup> GB: 2<sup>30</sup>)

L1: Size: 32 KB
Access: 4 cycles
write through
low associativity

L2: Size: 256 KB

Access: 10 cycles

L3: Size: 8 MB

Access: 40 cycles

write back

higher associativity

Main Memory (off chip):

Size: 16 GB

Access: 100 cycles

## Program Efficiency and Memory

- Be aware of how your program accesses data
  - Sequentially, in strides of size X, randomly, ...
  - How data is laid out in memory
- Will allow you to structure your code to run much more efficiently based on how it accesses its data

- Don't go nuts...
  - Optimize the most important parts, ignore the rest
  - "Premature optimization is the root of all evil." -Knuth

#### Amdahl's Law

<u>Idea</u>: an optimization can improve total runtime at most by the fraction it contributes to total runtime

If program takes 100 secs to run, and you optimize a portion of the code that accounts for 2% of the runtime, the best your optimization can do is improve the runtime by 2 secs!

Amdahl's Law tells us to focus our optimization efforts on the code that matters:

Speed-up what is accounting for the largest portion of runtime to get the largest benefit. And, don't waste time on the small stuff.