Cache memory is random access memory (RAM) that a computer microprocessor can access
more quickly than it can access regular RAM. As the microprocessor processes data, it looks first
in the cache memory and if it finds the data there (from a previous reading of data), it does not
have to do the more time-consuming reading of data from larger memory.
Cache memory is sometimes described in levels of closeness and accessibility to the
microprocessor. An L1 cache is on the same chip as the microprocessor. (For example,
thePowerPC 601 processor has a 32 kilobyte level-1 cache built into its chip.) L2 is usually a
separate static RAM (SRAM) chip. The main RAM is usually a dynamic RAM (DRAM) chip.
In addition to cache memory, one can think of RAM itself as a cache of memory for hard
diskstorage since all of RAM's contents come from the hard disk initially when you turn your
computer on and load the operating system (you are loading it into RAM) and later as you start
new applications and access new data. RAM can also contain a special area called adisk
cache that contains the data most recently read in from the hard disk.
A type of RAM (random Access Memory) that computers go to first to research for the information it
needs. The computer can access this quicker than it can regular RAM and need go no farther,
providing the information is in the Cache memory.
A CPU cache is a cache used by the central processing unit (CPU) of a computer to reduce the
average time to accessmemory. The cache is a smaller, faster memory which stores copies of the
data from frequently used main memorylocations. Most CPUs have different independent caches,
including instruction and data caches, where the data cache is usually organized as a hierarchy of
more cache levels (L1, L2 etc.)
Cache (pronounced cash) memory is extremely fast memory that is built into a computer’scentral
processing unit (CPU), or located next to it on a separate chip. The CPU uses cache memory to store
instructions that are repeatedly required to run programs, improving overall system speed. The
advantage of cache memory is that the CPU does not have to use themotherboard’s system bus for
data transfer. Whenever data must be passed through the system bus, the data transfer speed slows
to the motherboard’s capability. The CPU can process data much faster by avoiding the bottleneck
created by the system bus.
As it happens, once most programs are open and running, they use very few resources. When these
resources are kept in cache, programs can operate more quickly and efficiently. All else being equal,
cache is so effective in system performance that a computer running a fast CPU with little cache can
have lower benchmarks than a system running a somewhat slower CPU with more cache. Cache built
into the CPU itself is referred to as Level 1 (L1)cache. Cache that resides on a separate chip next to
the CPU is called Level 2 (L2) cache. Some CPUs have both L1 and L2 cache built-in and designate
the separate cache chip asLevel 3 (L3) cache.
Cache that is built into the CPU is faster than separate cache, running at the speed of
themicroprocessor itself. However, separate cache is still roughly twice as fast as Random Access
Memory (RAM). Cache is more expensive than RAM, but it is well worth getting a CPU and
motherboard with built-in cache in order to maximize system performance.
Disk caching applies the same principle to the hard disk that memory caching applies to the CPU.
Frequently accessed hard disk data is stored in a separate segment of RAM in order to avoid having
to retrieve it from the hard disk over and over. In this case, RAM is faster than the platter technology
used in conventional hard disks. This situation will change, however, as hybrid hard disks become
ubiquitous. These disks have built-in flash memory caches. Eventually, hard drives will be 100% flash
drives, eliminating the need for RAM disk caching, as flash memory is faster than RAM.
Cache Logical Partition
address translation cache
Software Defined Storage: What About the File System?
Sanbolic Launches New Software for the Software-Defined Data Center
Maxta Raises $25M, Teams with Intel for Software-Defined Storage
(cash cōhēr´&ns) (n.) A protocol for managing the caches of a multiprocessor system so that no data is lost or
overwritten before the data is transferred from a cache to the target memory. When two or more computer processors
work together on a single program, known as multiprocessing, each processor may have its own memory cache that
is separate from the larger RAM that the individual processors will access. A memory cache, sometimes called
a cache store or RAM cache, is a portion of memory made of high-speed static RAM (SRAM) instead of the slower
and cheaperdynamic RAM (DRAM) used for main memory. Memory caching is effective because
most programs access the same data or instructions over and over. By keeping as much of this information as
possible in SRAM, the computer avoids accessing the slower DRAM.
When multiple processors with separate caches share a common memory, it is necessary to keep the caches in a
state of coherence by ensuring that any shared operand that is changed in any cache is changed throughout the
entire system. This is done in either of two ways: through a directory-based or a snooping system. In a directory-
based system, the data being shared is placed in a common directory that maintains the coherence between caches.
The directory acts as a filter through which the processor must ask permission to load an entry from the primary
memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that
entry. In a snooping system, all caches on the bus monitor (or snoop) the bus to determine if they have a copy of the
block of data that is requested on the bus. Every cache has a copy of the sharing status of every block of physical
memory it has.
Cache misses and memory traffic due to shared data blocks limit the performance of parallel computing in
multiprocessor computers or systems. Cache coherence aims to solve the problems associated with sharing data.
From Wikipedia, the free encyclopedia
Multiple Caches of Shared Resource
In computing, cache coherence is the consistency of shared resource data that ends up stored in
When clients in a system maintain caches of a common memory resource, problems may arise with
inconsistent data. This is particularly true of CPUs in amultiprocessing system. Referring to the
illustration on the right, if the top client has a copy of a memory block from a previous read and the
bottom client changes that memory block, the top client could be left with an invalid cache of
memory without any notification of the change. Cache coherence is intended to manage such
conflicts and maintain consistency between cache and memory.
In a shared memory multiprocessor system with a separate cache memory for each processor, it is
possible to have many copies of any one instruction operand: one copy in the main memory and one
in each cache memory. When one copy of an operand is changed, the other copies of the operand
must be changed also. Cache coherence is the discipline that ensures that changes in the values of
shared operands are propagated throughout the system in a timely fashion.:30
There are three distinct levels of cache coherence:
1. every write operation appears to occur instantaneously
2. all processors see exactly the same sequence of changes of values for each separate
3. different processors may see an operation and assume different sequences of values; this is
considered to be a noncoherent behavior.
In both level 2 behavior and level 3 behavior, a program can observe stale data. Recently, computer
designers have come to realize that the programming discipline required to deal with level 2
behavior is sufficient to deal also with level 3 behavior.
Therefore, at some point only
level 1 and level 3 behavior will be seen in machines.
Coherence defines the behavior of reads and writes to the same memory location. The coherence of
caches is obtained if the following conditions are met:
1. In a read made by a processor P to a location X that follows a write by the same processor P
to X, with no writes of X by another processor occurring between the write and the read
instructions made by P, X must always return the value written by P. This condition is related
with the program order preservation, and this must be achieved even in monoprocessed
2. A read made by a processor P1 to location X that happens after a write by another processor
P2 to X must return the written value made by P2 if no other writes to X made by any
processor occur between the two accesses and the read and write are sufficiently
separated. This condition defines the concept of coherent view of memory. If processors can
read the same old value after the write made by P2, we can say that the memory is
3. Writes to the same location must be sequenced. In other words, if location X received two
different values A and B, in this order, from any two processors, the processors can never
read location X as B and then read it as A. The location X must be seen with values A and B
in that order.
These conditions are defined supposing that the read and write operations are made
instantaneously. However, this doesn't happen in computer hardware given memory latency and
other aspects of the architecture. A write by processor P1 may not be seen by a read from processor
P2 if the read is made within a very small time after the write has been made. Thememory
consistency model defines when a written value must be seen by a following read instruction made
by the other processors.
Rarely, and especially in algorithms, coherence can instead refer to the locality of reference.
In a directory-based system, the data being shared is placed in a common directory that
maintains the coherence between caches. The directory acts as a filter through which the
processor must ask permission to load an entry from the primary memory to its cache. When
an entry is changed the directory either updates or invalidates the other caches with that
This is a process where the individual caches monitor address lines for accesses to memory
locations that they have cached.
It is called a write invalidate protocol when a write
operation is observed to a location that a cache has a copy of and the cache controller
invalidates its own copy of the snooped memory location.
It is a mechanism where a cache controller watches both address and data in an attempt to
update its own copy of a memory location when a second master modifies a location in main
memory. When a write operation is observed to a location that a cache has a copy of, the
cache controller updates its own copy of the snarfed memory location with the new data.
Distributed shared memory systems mimic these mechanisms in an attempt to maintain
consistency between blocks of memory in loosely coupled systems.
The two most common mechanisms of ensuring coherency are snooping and directory-
based, each having its own benefits and drawbacks. Snooping protocols tend to be
faster, if enough bandwidth is available, since all transactions are a request/response
seen by all processors. The drawback is that snooping isn't scalable. Every request
must be broadcast to all nodes in a system, meaning that as the system gets larger, the
size of the (logical or physical) bus and the bandwidth it provides must grow. Directories,
on the other hand, tend to have longer latencies (with a 3 hop request/forward/respond)
but use much less bandwidth since messages are point to point and not broadcast. For
this reason, many of the larger systems (>64 processors) use this type of cache
For the snooping mechanism, a snoop filter reduces the snooping traffic by maintaining
a plurality of entries, each representing a cache line that may be owned by one or more
nodes. When replacement of one of the entries is required, the snoop filter selects for
replacement the entry representing the cache line or lines owned by the fewest nodes,
as determined from a presence vector in each of the entries. A temporal or other type of
algorithm is used to refine the selection if more than one cache line is owned by the
fewest number of nodes.
A coherency protocol is a protocol which maintains the consistency between all the
caches in a system of distributed shared memory. The protocol maintains memory
coherence according to a specific consistency model. Older multiprocessors support
the sequential consistency model, while modern shared memory systems typically
support therelease consistency or weak consistency models.
Transitions between states in any specific implementation of these protocols may vary.
For example, an implementation may choose different update and invalidation
transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-
on-write. The choice of transition may affect the amount of inter-cache traffic, which in
turn may affect the amount of cache bandwidth available for actual work. This should be
taken into consideration in the design of distributed software that could cause strong
contention between the caches of multiple processors.
Various models and protocols have been devised for maintaining coherence, such
as MSI, MESI (aka Illinois), MOSI, MOESI,MERSI, MESIF, write-once,
and Synapse, Berkeley, Firefly and Dragon protocol.:30–34
Cache Coherence for Multi Processing
In computing, cache coherence (also cache coherency) refers to the consistency of data stored in local caches of
a shared resource. Cache coherence is a special case of memory coherence. When clients in a system maintain
caches of a common memory resource, problems may arise with inconsistent data.
This is particularly true of CPUs in a multiprocessing system. Referring to the "Multiple Caches of Shared Resource"
figure, if the top client has a copy of a memory block from a previous read and the bottom client changes that
memory block, the top client could be left with an invalid cache of memory without any notification of the
change. Cache coherence is intended to manage such conflicts and maintain consistency between cache and
Definition - What does Cache Coherence mean?
Cache coherence is the regularity or consistency of data stored in cache memory. Maintaining cache
and memory consistency is imperative for multiprocessors or distributed shared memory (DSM)
systems. Cache management is structured to ensure that data is not overwritten or lost. Different
techniques may be used to maintain cache coherency, including directory based coherence, bus
snooping and snarfing. To maintain consistency, a DSM system imitates these techniques and uses
a coherency protocol, which is essential to system operations. Cache coherence is also known as
cache coherency or cache consistency.
Techopedia explains Cache Coherence
The majority of coherency protocols that support multiprocessors use a sequential consistency
standard. DSM systems use a weak or release consistency standard. The following methods are
used for cache coherence management and consistency in read/write (R/W) and instantaneous
operations: Written data locations are sequenced. Write operations occur instantaneously. Program
order preservation is maintained with RW data. A coherent memory view is maintained, where
consistent values are provided through shared memory. Several types of cache coherency may be
utilized by different structures, as follows: Directory based coherence: References a filter in
which memory data is accessible to all processors. When memory area data changes, the
cache is updated or invalidated. Bus snooping: Monitors and manages all cache memory and
notifies the processor when there is a write operation. Used in smaller systems with fewer
processors.Snarfing: Self-monitors and updates its address and data versions. Requires
large amounts of bandwidth and resources compared to directory based coherence and bus
Issues in Cache Memory
Memory Hierarchy Issues
We first illustrate the issues involved in optimizing memory system performance on
multiprocessors, and define the terms that are used in this paper. True sharingcache
misses occur whenever two processors access the same data word. True sharing
requires the processors involved to explicitly synchronize with each other to ensure
program correctness. A computation is said to have temporal locality if it re-uses
much of the data it has been accessing; programs with high temporal locality tend to
have less true sharing. The amount of true sharing in the program is a critical factor
for performance on multiprocessors; high levels of true sharing and synchronization
can easily overwhelm the advantage of parallelism.
It is important to take synchronization and sharing into consideration when deciding
on how to parallelize a loop nest and how to assign the iterations to processors.
Consider the code shown in Figure 1(a). While all the iterations in the first two-deep
loop nest can run in parallel, only the inner loop of the second loop nest is
parallelizable. To minimize synchronization and sharing, we should also parallelize
only the inner loop in the first loop nest. By assigning the ith iteration in each of the
inner loops to the same processor, each processor always accesses the same rows of
the arrays throughout the entire computation. Figure 1(b) shows the data accessed by
each processor in the case where each processor is assigned a block of rows. In this
way, no interprocessor communication or synchronization is necessary.
Figure 1: A simple example: (a) sample code, (b) original data mapping and (c)
optimized data mapping. The light grey arrows show the memory layout order.
Due to characteristics found in typical data caches, it is not sufficient to just minimize
sharing between processors. First, data are transferred in fixed-size units known
as cache lines, which are typically 4 to 128 bytes long. A computation is said to
have spatial locality if it uses multiple words in a cache line before the line is
displaced from the cache. While spatial locality is a consideration for both uni- and
multiprocessors, false sharing is unique to multiprocessors. False sharing results when
different processors use different data that happen to be co-located on the same cache
line. Even if a processor re-uses a data item, the item may no longer be in the cache
due to an intervening access by another processor to another word in the same cache
Assuming the FORTRAN convention that arrays are allocated in column-major order,
there is a significant amount of false sharing in our example, as shown in Figure1(b).
If the number of rows accessed by each processor is smaller than the number of words
in a cache line, every cache line is shared by at least two processors. Each time one of
these lines is accessed, unwanted data are brought into the cache. Also, when one
processor writes part of the cache line, that line is invalidated in the other processor's
cache. This particular combination of computation mapping and data layout will result
in poor cache performance.
Another problematic characteristic of data caches is that they typically have a small
set-associativity; that is, each memory location can only be cached in a small number
of cache locations. Conflict misses occur whenever different memory locations
contend for the same cache location. Since each processor only operates on a subset of
the data, the addresses accessed by each processor may be distributed throughout the
shared address space.
Consider what happens to the example in Figure 1(b) if the arrays are of
size and the target machine has a direct-mapped cache of size 64KB.
Assuming that REALs are 4B long, the elements in every 16th column will map to the
same cache location and cause conflict misses. This problem exists even if the caches
are set-associative, given that existing caches usually only have a small degree of
As shown above, the cache performance of multiprocessor code depends on how the
computation is distributed as well as how the data are laid out. Instead of simply
obeying the data layout convention used by the input language (e.g. column-major in
FORTRAN and row-major in C), we can improve the cache performance by
customizing the data layout for the specific program. We observe that multiprocessor
cache performance problems can be minimized by making the data accessed by each
processor contiguous in the shared address space, an example of which is shown in
Figure 1(c). Such a layout enhances spatial locality, minimizes false sharing and also
minimizes conflict misses.
The importance of optimizing memory subsystem performance for multiprocessors
has also been confirmed by several studies of hand optimizations on real applications.
Singh et al. explored performance issues on scalable shared address space
architectures; they improved cache behavior by transforming two-dimensional arrays
into four-dimensional arrays so that each processor's local data are contiguous in
memory. Torrellas et al. and Eggers et al.[11,12] also showed that improving
spatial locality and reducing false sharing resulted in significant speedups for a set of
programs on shared-memory machines. In summary, not only must we minimize
sharing to achieve efficient parallelization, it is also important to optimize for the
multi-word cache line and the small set associativity. The cache behavior depends on
both the computation mapping and the data layout. Thus, besides choosing a good
parallelization scheme and a good computation mapping, we may also wish to change
the data structures in the program.