Cache Memory - SRAM

Another distinctly different type of memory exists that is significantly faster than most types of DRAM. SRAM stands for static RAM, which is so named because it does not need the periodic refresh rates like DRAM. Because of how SRAMs are designed, not only are refresh rates unnecessary, but SRAM is much faster than DRAM and much more capable of keeping pace with modern processors.

SRAM memory is available in access times of 2ns or less, so it can keep pace with processors running 500MHz or faster. This is because of the SRAM design, which calls for a cluster of six transistors for each bit of storage.

The use of transistors but no capacitors means that refresh rates are not necessary because there are no capacitors to lose their charges over time. As long as there is power, SRAM remembers what is stored. With these attributes, why don't we use SRAM for all system memory?

The answers are simple. Compared to DRAM, SRAM is much faster but also much lower in density and much more expensive.

Comparing DRAM and SRAM

Type

Speed

Density

Cost

DRAM

Slow

High

Low

SRAM

Fast

Low

High

The lower density means that SRAM chips are physically larger and store fewer bits overall. The high number of transistors and the clustered design mean that SRAM chips are both physically larger and much more expensive to produce than DRAM chips.

For example, a DRAM module might contain 64MB of RAM or more, whereas SRAM modules of the same approximate physical size would have room for only 2MB or so of data and would cost the same as the 64MB DRAM module. Basically, SRAM is up to 30 times larger physically and up to 30 times more expensive than DRAM.

The high cost and physical constraints have prevented SRAM from being used as the main memory for PC systems. Even though SRAM is too expensive for PC use as main memory, PC designers have found a way to use SRAM to dramatically improve PC performance.

Rather than spend the money for all RAM to be SRAM memory, which can run fast enough to match the CPU, designing in a small amount of high-speed SRAM memory, called cache memory, is much more cost-effective. The cache runs at speeds close to or even equal to the processor and is the memory from which the processor usually directly reads from and writes to.

During read operations, the data in the high-speed cache memory is resupplied from the lower-speed main memory or DRAM in advance. Up until recently, DRAM was limited to about 60ns (16MHz) in speed. To convert access time in nanoseconds to MHz, use the following formula:

1 / nanoseconds x 1000 = MHz

Likewise, to convert from MHz to nanoseconds, use the following inverse formula:

1 / MHz x 1000 = nanoseconds

When PC systems were running 16MHz and less, the DRAM could fully keep pace with the motherboard and system processor and there was no need for cache. However, as soon as processors crossed the 16MHz barrier, DRAM could no longer keep pace, and that is exactly when SRAM began to enter PC system designs.

This occurred back in 1986 and 1987 with the debut of systems with the 386 processor running at speeds of 16MHz and 20MHz or faster. These were among the first PC systems to employ what's called cache memory, a high-speed buffer made up of SRAM that directly feeds the processor.

Because the cache can run at the speed of the processor, the system is designed so that the cache controller anticipates the processor's memory needs and preloads the high-speed cache memory with that data. Then, as the processor calls for a memory address, the data can be retrieved from the high-speed cache rather than the much lower-speed main memory.

Cache effectiveness is expressed as a hit ratio. This is the ratio of cache hits to total memory accesses. A hit occurs when the data the processor needs has been preloaded into the cache from the main memory, meaning the processor can read it from the cache.

A cache miss is when the cache controller did not anticipate the need for a specific address and the desired data was not preloaded into the cache. In that case the processor must retrieve the data from the slower main memory, instead of the faster cache.

Anytime the processor reads data from main memory, the processor must wait longer because the main memory cycles at a much slower rate than the processor. If the processor with integral on-die cache is running at 3000MHz (3GHz), both the processor and the integral cache would be cycling at 0.33ns, while the main memory would most likely be cycling 7.5 times more slowly at 2.5ns (or 5ns DDR = double data rate).

Therefore, the memory would be running at only a 400MHz equivalent rate. So, every time the 3GHz processor reads from main memory, it would effectively slow down 7.5-fold to only 400MHz! The slowdown is accomplished by having the processor execute what are called wait states, which are cycles in which nothing is done; the processor essentially cools its heels while waiting for the slower main memory to return the desired data.

Obviously, you don't want your processors slowing down, so cache function and design become more important as system speeds increase. To minimize the processor being forced to read data from the slow main memory, two stages of cache usually exist in a modern system, called Level 1 (L1) and Level 2 (L2).

The L1 cache is also called integral or internal cache because it is directly built into the processor and is actually a part of the processor die (raw chip). Because of this, L1 cache always runs at the full speed of the processor core and is the fastest cache in any system.

All 486 and higher processors incorporate integral L1 cache, making them significantly faster than their predecessors. L2 cache is also called external cache because it is external to the processor chip. Originally, this meant it was installed on the motherboard, as was the case with all 386, 486, and Pentium systems.

In those systems, the L2 cache runs at motherboard speed because it is installed on the motherboard. You typically can find the L2 cache directly next to the processor socket in Pentium and earlier systems.

In the interest of improved performance, later processor designs from Intel and AMD have included the L2 cache as a part of the processor. In all the processors since late 1999 (and some earlier models), the L2 cache was directly incorporated as a part of the processor die just like the L1 cache.

In chips with on-die L2, the cache runs at the full core speed of the processor and is much more efficient. By contrast, most processors from 1999 and earlier had L2 cache in separate chips that were external to the main processor core. The L2 cache in many of these older chips ran at only half or one-third the processor core speed.

Cache speed is very important, so systems having L2 cache on the motherboard were the slowest. Including L2 inside the processor made it faster, and including it directly on the processor die (rather than as chips external to the die) is the fastest yet.

Any chip that has on-die full core speed L2 cache has a distinct performance advantage over any chip that doesn't. Processors with built-in L2 cache, whether it's on-die or not, still run the cache more quickly than any found on the motherboard.

Thus, most motherboards designed for processors with built-in cache don't have any cache on the board; all the cache is contained in the processor module instead. The Itanium processor family from Intel (used in large network servers) has three levels of cache within the processor module for even greater performance.

More cache and more levels of cache help mitigate the speed differential between the fast processor core and the relatively slow motherboard and main memory. The key to understanding both cache and main memory is to see where they fit in the overall system architecture.

Cache designs originally were asynchronous, meaning they ran at a clock speed that was not identical or in sync with the processor bus. Starting with the 430FX chipset released in early 1995, a new type of synchronous cache design was supported.

It required that the chips now run in sync or at the same identical clock timing as the processor bus, further improving speed and performance. Also added at that time was a feature called pipeline burst mode, which reduces overall cache latency (wait states) by allowing single-cycle accesses for multiple transfers after the first one.

Because both synchronous and pipeline burst capability came at the same time in new modules, specifying one usually implies the other. Synchronous pipeline burst cache allowed for about a 20% improvement in overall system performance, which was a significant jump.

The cache controller for a modern system is contained in either the North Bridge of the chipset, as with Pentium and lesser systems, or within the processor, as with the Pentium Pro/II and newer systems. The capabilities of the cache controller dictate the cache's performance and capabilities.

One important thing to note is that most external cache controllers have a limitation on the amount of memory that can be cached. Often, this limit can be quite low, as with the 430TX chipset-based Pentium systems. Most original Pentium class chipsets such as the 430FX/VX/TX can cache data only within the first 64MB of system RAM.

If you add more memory than that, you will see a noticeable slowdown in system performance because all data outside the first 64MB is never cached and is always accessed with all the wait states required by the slower DRAM. Depending on what software you use and where data is stored in memory, this can be significant.

For example, 32-bit operating systems such as Windows load from the top down, so if you had 96MB of RAM, the operating system and applications would load directly into the upper 32MB (past 64MB), which is not cached. This results in a dramatic slowdown in overall system use.

Removing the additional memory to bring the system total down to the cacheable limit of 64MB is the solution. In short, it is unwise to install more main RAM memory than your system (CPU or chipset) can cache.

Chipsets made for the Pentium Pro/II and later processors did not control the L2 cache because it was moved into the processor instead. So, with the Pentium Pro/II and beyond, the processor sets the cacheability limits. The Pentium Pro and some of the earlier Pentium IIs can address up to 64GB but only cache up to 512MB.

The later Pentium IIs and all Pentium III and Pentium 4 processors can cache up to 4GB. Most desktop chipsets for those processors allow only up to 1GB, 2GB, or 4GB of RAM anyway, making cacheability limits moot. All the server-oriented Xeon processors can cache up to 64GB. This is beyond the maximum RAM support of any of the chipsets.

In any case, it is important not to install more memory than the cache controller can support. If you want to know the cacheability limit for your system, consult the chipset documentation if you have a Pentium class or older system (or any system with cache on the motherboard), or check the processor documentation if you have a Pentium II class or newer system (or any system with all the cache integrated into the CPU).