Cache Memory Organization

You know that cache stores copies of data from various main memory addresses. Because the cache cannot hold copies of the data from all the addresses in main memory simultaneously, there has to be a way to know which addresses are currently copied into the cache so that, if we need data from those addresses, it can be read from the cache rather than from the main memory.

This function is performed by Tag RAM, which is additional memory in the cache that holds an index of the addresses that are copied into the cache. Each line of cache memory has a corresponding address tag that stores the main memory address of the data currently copied into that particular cache line.

If data from a particular main memory address is needed, the cache controller can quickly search the address tags to see whether the requested address is currently being stored in the cache (a hit) or not (a miss). If the data is there, it can be read from the faster cache; if it isn't, it has to be read from the much slower main memory.

Various ways of organizing or mapping the tags affect how cache works. A cache can be mapped as fully associative, direct-mapped, or set associative. In a fully associative mapped cache, when a request is made for data from a specific main memory address, the address is compared against all the address tag entries in the cache tag RAM.

If the requested main memory address is found in the tag (a hit), the corresponding location in the cache is returned. If the requested address is not found in the address tag entries, a miss occurs and the data must be retrieved from the main memory address instead of the cache.

In a direct-mapped cached, specific main memory addresses are preassigned to specific line locations in the cache where they will be stored. Therefore, the tag RAM can use fewer bits because when you know which main memory address you want, only one address tag needs to be checked and each tag needs to store only the possible addresses a given line can contain.

This also results in faster operation because only one tag address needs to be checked for a given memory address. A set associative cache is a modified direct-mapped cache. A direct-mapped cache has only one set of memory associations, meaning a given memory address can be mapped into (or associated with) only a specific given cache line location.

A two-way set associative cache has two sets, so that a given memory location can be in one of two locations. A four-way set associative cache can store a given memory address into four different cache line locations (or sets). By increasing the set associativity, the chance of finding a value increases; however, it takes a little longer because more tag addresses must be checked when searching for a specific location in the cache.

In essence, each set in an n-way set associative cache is a sub-cache that has associations with each main memory address. As the number of sub-caches or sets increases, eventually the cache becomes fully associative—a situation in which any memory address can be stored in any cache line location.

In that case, an n-way set associative cache is a compromise between a fully associative cache and a direct-mapped cache. In general, a direct-mapped cache is the fastest at locating and retrieving data from the cache because it has to look at only one specific tag address for a given memory address.

However, it also results in more misses overall than the other designs. A fully associative cache offers the highest hit ratio but is the slowest at locating and retriving the data because it has many more address tags to check through. An n-way set associative cache is a compromise between optimizing cache speed and hit ratio.

But the more associativity there is, the more hardware (tag bits, comparator circuits, and so on) is required, making the cache more expensive. Obviously, cache design is a series of tradeoffs, and what works best in one instance might not work best in another.

Multitasking environments such as Windows are good examples of environments in which the processor needs to operate on different areas of memory simultaneously and in which an n-way cache can improve performance. The organization of the cache memory in the 486 and MMX Pentium family is called a four-way set associative cache, which means that the cache memory is split into four blocks.

Each block also is organized as 128 or 256 lines of 16 bytes each. The following table shows the associativity of various processor L1 and L2 caches.

Processor

L1 Cache Associativity

L2 Cache Associativity

486

Four-way

Not in CPU

Pentium (non-MMX)

Two-way

Not in CPU

Pentium MMX

Four-way

Not in CPU

Pentium Pro/II/III

Four-way

Four-way (off-die)

Pentium III/4

Four-way

Eight-way (on-die)

The contents of the cache must always be in sync with the contents of main memory to ensure that the processor is working with current data. For this reason, the internal cache in the 486 family is a write-through cache.

Write-through means that when the processor writes information out to the cache, that information is automatically written through to main memory as well. By comparison, the Pentium and later chips have an internal write-back cache, which means that both reads and writes are cached, further improving performance.

Even though the internal 486 cache is write-through, the system can employ an external write-back cache for increased performance. In addition, the 486 can buffer up to 4 bytes before actually storing the data in RAM, improving efficiency in case the memory bus is busy.

Another feature of improved cache designs is that they are nonblocking. This is a technique for reducing or hiding memory delays by exploiting the overlap of processor operations with data accesses. A nonblocking cache enables program execution to proceed concurrently with cache misses as long as certain dependency constraints are observed.

In other words, the cache can handle a cache miss much better and enable the processor to continue doing something nondependent on the missing data. The cache controller built into the processor also is responsible for watching the memory bus when alternative processors, known as bus masters, are in control of the system.

This process of watching the bus is referred to as bus snooping. If a bus master device writes to an area of memory that also is stored in the processor cache currently, the cache contents and memory no longer agree.

The cache controller then marks this data as invalid and reloads the cache during the next memory access, preserving the integrity of the system. All PC processor designs that support cache memory include a feature known as a translation look aside buffer (TLB) to improve recovery from cache misses.

The TLB is a table inside the processor that stores information about the location of recently accessed memory addresses. The TLB speeds up the translation of virtual addresses to physical memory addresses.

To improve TLB performance, several recent processors have increased the number of entries in the TLB, as AMD did when it moved from the Athlon Thunderbird core to the Palomino core. Pentium 4 processors that support HT Technology have a separate instruction TLB (iTLB) for each virtual processor thread.

As clock speeds increase, cycle time decreases. Newer systems don't use cache on the motherboard any longer because the faster DDR-SDRAM or RDRAM used in modern Pentium 4/Celeron or Athlon systems can keep up with the motherboard speed.

Modern processors all integrate the L2 cache into the processor die just like the L1 cache. This enables the L2 to run at full-core speed because it is now a part of the core. Cache speed is always more important than size. The rule is that a smaller but faster cache is always better than a slower but bigger cache.