Your SlideShare is downloading. ×
Cache memory
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Cache memory

653
views

Published on

Published in: Science, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
653
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. cache memory Cache memory is random access memory (RAM) that a computer microprocessor can access more quickly than it can access regular RAM. As the microprocessor processes data, it looks first in the cache memory and if it finds the data there (from a previous reading of data), it does not have to do the more time-consuming reading of data from larger memory. Cache memory is sometimes described in levels of closeness and accessibility to the microprocessor. An L1 cache is on the same chip as the microprocessor. (For example, thePowerPC 601 processor has a 32 kilobyte level-1 cache built into its chip.) L2 is usually a separate static RAM (SRAM) chip. The main RAM is usually a dynamic RAM (DRAM) chip. In addition to cache memory, one can think of RAM itself as a cache of memory for hard diskstorage since all of RAM's contents come from the hard disk initially when you turn your computer on and load the operating system (you are loading it into RAM) and later as you start new applications and access new data. RAM can also contain a special area called adisk cache that contains the data most recently read in from the hard disk. A type of RAM (random Access Memory) that computers go to first to research for the information it needs. The computer can access this quicker than it can regular RAM and need go no farther, providing the information is in the Cache memory. A CPU cache is a cache used by the central processing unit (CPU) of a computer to reduce the average time to accessmemory. The cache is a smaller, faster memory which stores copies of the data from frequently used main memorylocations. Most CPUs have different independent caches, including instruction and data caches, where the data cache is usually organized as a hierarchy of more cache levels (L1, L2 etc.)
  • 2. Cache (pronounced cash) memory is extremely fast memory that is built into a computer’scentral processing unit (CPU), or located next to it on a separate chip. The CPU uses cache memory to store instructions that are repeatedly required to run programs, improving overall system speed. The advantage of cache memory is that the CPU does not have to use themotherboard’s system bus for data transfer. Whenever data must be passed through the system bus, the data transfer speed slows to the motherboard’s capability. The CPU can process data much faster by avoiding the bottleneck created by the system bus. As it happens, once most programs are open and running, they use very few resources. When these resources are kept in cache, programs can operate more quickly and efficiently. All else being equal, cache is so effective in system performance that a computer running a fast CPU with little cache can have lower benchmarks than a system running a somewhat slower CPU with more cache. Cache built into the CPU itself is referred to as Level 1 (L1)cache. Cache that resides on a separate chip next to the CPU is called Level 2 (L2) cache. Some CPUs have both L1 and L2 cache built-in and designate the separate cache chip asLevel 3 (L3) cache. Ad Cache that is built into the CPU is faster than separate cache, running at the speed of themicroprocessor itself. However, separate cache is still roughly twice as fast as Random Access Memory (RAM). Cache is more expensive than RAM, but it is well worth getting a CPU and motherboard with built-in cache in order to maximize system performance. Disk caching applies the same principle to the hard disk that memory caching applies to the CPU. Frequently accessed hard disk data is stored in a separate segment of RAM in order to avoid having to retrieve it from the hard disk over and over. In this case, RAM is faster than the platter technology used in conventional hard disks. This situation will change, however, as hybrid hard disks become ubiquitous. These disks have built-in flash memory caches. Eventually, hard drives will be 100% flash drives, eliminating the need for RAM disk caching, as flash memory is faster than RAM.
  • 3. cache coherence    Related Terms  cache  trace cache  Cache Logical Partition  address translation cache  DataBurst Cache Expert Articles  Software Defined Storage: What About the File System?  Sanbolic Launches New Software for the Software-Defined Data Center  Maxta Raises $25M, Teams with Intel for Software-Defined Storage (cash cōhēr´&ns) (n.) A protocol for managing the caches of a multiprocessor system so that no data is lost or overwritten before the data is transferred from a cache to the target memory. When two or more computer processors work together on a single program, known as multiprocessing, each processor may have its own memory cache that is separate from the larger RAM that the individual processors will access. A memory cache, sometimes called a cache store or RAM cache, is a portion of memory made of high-speed static RAM (SRAM) instead of the slower and cheaperdynamic RAM (DRAM) used for main memory. Memory caching is effective because most programs access the same data or instructions over and over. By keeping as much of this information as possible in SRAM, the computer avoids accessing the slower DRAM. When multiple processors with separate caches share a common memory, it is necessary to keep the caches in a state of coherence by ensuring that any shared operand that is changed in any cache is changed throughout the entire system. This is done in either of two ways: through a directory-based or a snooping system. In a directory- based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry. In a snooping system, all caches on the bus monitor (or snoop) the bus to determine if they have a copy of the block of data that is requested on the bus. Every cache has a copy of the sharing status of every block of physical memory it has. Cache misses and memory traffic due to shared data blocks limit the performance of parallel computing in multiprocessor computers or systems. Cache coherence aims to solve the problems associated with sharing data.
  • 4. Cache coherence From Wikipedia, the free encyclopedia Multiple Caches of Shared Resource In computing, cache coherence is the consistency of shared resource data that ends up stored in multiplelocal caches. When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in amultiprocessing system. Referring to the illustration on the right, if the top client has a copy of a memory block from a previous read and the bottom client changes that memory block, the top client could be left with an invalid cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts and maintain consistency between cache and memory. Overview[edit] In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.[1]:30 There are three distinct levels of cache coherence:[2] 1. every write operation appears to occur instantaneously 2. all processors see exactly the same sequence of changes of values for each separate operand 3. different processors may see an operation and assume different sequences of values; this is considered to be a noncoherent behavior.
  • 5. In both level 2 behavior and level 3 behavior, a program can observe stale data. Recently, computer designers have come to realize that the programming discipline required to deal with level 2 behavior is sufficient to deal also with level 3 behavior.[citation needed] Therefore, at some point only level 1 and level 3 behavior will be seen in machines.[citation needed] Definition[edit] Coherence defines the behavior of reads and writes to the same memory location. The coherence of caches is obtained if the following conditions are met:[3] 1. In a read made by a processor P to a location X that follows a write by the same processor P to X, with no writes of X by another processor occurring between the write and the read instructions made by P, X must always return the value written by P. This condition is related with the program order preservation, and this must be achieved even in monoprocessed architectures. 2. A read made by a processor P1 to location X that happens after a write by another processor P2 to X must return the written value made by P2 if no other writes to X made by any processor occur between the two accesses and the read and write are sufficiently separated. This condition defines the concept of coherent view of memory. If processors can read the same old value after the write made by P2, we can say that the memory is incoherent. 3. Writes to the same location must be sequenced. In other words, if location X received two different values A and B, in this order, from any two processors, the processors can never read location X as B and then read it as A. The location X must be seen with values A and B in that order.[2] These conditions are defined supposing that the read and write operations are made instantaneously. However, this doesn't happen in computer hardware given memory latency and other aspects of the architecture. A write by processor P1 may not be seen by a read from processor P2 if the read is made within a very small time after the write has been made. Thememory consistency model defines when a written value must be seen by a following read instruction made by the other processors. Rarely, and especially in algorithms, coherence can instead refer to the locality of reference. Coherency mechanisms[edit] Directory-based
  • 6. In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry. Snooping This is a process where the individual caches monitor address lines for accesses to memory locations that they have cached.[4] It is called a write invalidate protocol when a write operation is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.[5] Snarfing It is a mechanism where a cache controller watches both address and data in an attempt to update its own copy of a memory location when a second master modifies a location in main memory. When a write operation is observed to a location that a cache has a copy of, the cache controller updates its own copy of the snarfed memory location with the new data. Distributed shared memory systems mimic these mechanisms in an attempt to maintain consistency between blocks of memory in loosely coupled systems. The two most common mechanisms of ensuring coherency are snooping and directory- based, each having its own benefits and drawbacks. Snooping protocols tend to be faster, if enough bandwidth is available, since all transactions are a request/response seen by all processors. The drawback is that snooping isn't scalable. Every request must be broadcast to all nodes in a system, meaning that as the system gets larger, the size of the (logical or physical) bus and the bandwidth it provides must grow. Directories, on the other hand, tend to have longer latencies (with a 3 hop request/forward/respond) but use much less bandwidth since messages are point to point and not broadcast. For this reason, many of the larger systems (>64 processors) use this type of cache coherence. For the snooping mechanism, a snoop filter reduces the snooping traffic by maintaining a plurality of entries, each representing a cache line that may be owned by one or more nodes. When replacement of one of the entries is required, the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries. A temporal or other type of algorithm is used to refine the selection if more than one cache line is owned by the fewest number of nodes.[6]
  • 7. Coherency protocol[edit] A coherency protocol is a protocol which maintains the consistency between all the caches in a system of distributed shared memory. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the sequential consistency model, while modern shared memory systems typically support therelease consistency or weak consistency models. Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate- on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors. Various models and protocols have been devised for maintaining coherence, such as MSI, MESI (aka Illinois), MOSI, MOESI,MERSI, MESIF, write-once, and Synapse, Berkeley, Firefly and Dragon protocol.[1]:30–34
  • 8. Cache Coherence for Multi Processing In computing, cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. Cache coherence is a special case of memory coherence. When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in a multiprocessing system. Referring to the "Multiple Caches of Shared Resource" figure, if the top client has a copy of a memory block from a previous read and the bottom client changes that memory block, the top client could be left with an invalid cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts and maintain consistency between cache and memory. Definition - What does Cache Coherence mean? Cache coherence is the regularity or consistency of data stored in cache memory. Maintaining cache and memory consistency is imperative for multiprocessors or distributed shared memory (DSM) systems. Cache management is structured to ensure that data is not overwritten or lost. Different techniques may be used to maintain cache coherency, including directory based coherence, bus snooping and snarfing. To maintain consistency, a DSM system imitates these techniques and uses a coherency protocol, which is essential to system operations. Cache coherence is also known as cache coherency or cache consistency. Techopedia explains Cache Coherence The majority of coherency protocols that support multiprocessors use a sequential consistency standard. DSM systems use a weak or release consistency standard. The following methods are used for cache coherence management and consistency in read/write (R/W) and instantaneous operations: Written data locations are sequenced. Write operations occur instantaneously. Program order preservation is maintained with RW data. A coherent memory view is maintained, where consistent values are provided through shared memory. Several types of cache coherency may be utilized by different structures, as follows: Directory based coherence: References a filter in which memory data is accessible to all processors. When memory area data changes, the cache is updated or invalidated. Bus snooping: Monitors and manages all cache memory and notifies the processor when there is a write operation. Used in smaller systems with fewer processors.Snarfing: Self-monitors and updates its address and data versions. Requires large amounts of bandwidth and resources compared to directory based coherence and bus snooping.
  • 9. Issues in Cache Memory Memory Hierarchy Issues We first illustrate the issues involved in optimizing memory system performance on multiprocessors, and define the terms that are used in this paper. True sharingcache misses occur whenever two processors access the same data word. True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing; programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors; high levels of true sharing and synchronization can easily overwhelm the advantage of parallelism. It is important to take synchronization and sharing into consideration when deciding on how to parallelize a loop nest and how to assign the iterations to processors. Consider the code shown in Figure 1(a). While all the iterations in the first two-deep loop nest can run in parallel, only the inner loop of the second loop nest is parallelizable. To minimize synchronization and sharing, we should also parallelize only the inner loop in the first loop nest. By assigning the ith iteration in each of the inner loops to the same processor, each processor always accesses the same rows of the arrays throughout the entire computation. Figure 1(b) shows the data accessed by each processor in the case where each processor is assigned a block of rows. In this way, no interprocessor communication or synchronization is necessary.
  • 10. Figure 1: A simple example: (a) sample code, (b) original data mapping and (c) optimized data mapping. The light grey arrows show the memory layout order. Due to characteristics found in typical data caches, it is not sufficient to just minimize sharing between processors. First, data are transferred in fixed-size units known as cache lines, which are typically 4 to 128 bytes long[16]. A computation is said to have spatial locality if it uses multiple words in a cache line before the line is displaced from the cache. While spatial locality is a consideration for both uni- and multiprocessors, false sharing is unique to multiprocessors. False sharing results when different processors use different data that happen to be co-located on the same cache line. Even if a processor re-uses a data item, the item may no longer be in the cache due to an intervening access by another processor to another word in the same cache line.
  • 11. Assuming the FORTRAN convention that arrays are allocated in column-major order, there is a significant amount of false sharing in our example, as shown in Figure1(b). If the number of rows accessed by each processor is smaller than the number of words in a cache line, every cache line is shared by at least two processors. Each time one of these lines is accessed, unwanted data are brought into the cache. Also, when one processor writes part of the cache line, that line is invalidated in the other processor's cache. This particular combination of computation mapping and data layout will result in poor cache performance. Another problematic characteristic of data caches is that they typically have a small set-associativity; that is, each memory location can only be cached in a small number of cache locations. Conflict misses occur whenever different memory locations contend for the same cache location. Since each processor only operates on a subset of the data, the addresses accessed by each processor may be distributed throughout the shared address space. Consider what happens to the example in Figure 1(b) if the arrays are of size and the target machine has a direct-mapped cache of size 64KB. Assuming that REALs are 4B long, the elements in every 16th column will map to the same cache location and cause conflict misses. This problem exists even if the caches are set-associative, given that existing caches usually only have a small degree of associativity. As shown above, the cache performance of multiprocessor code depends on how the computation is distributed as well as how the data are laid out. Instead of simply obeying the data layout convention used by the input language (e.g. column-major in FORTRAN and row-major in C), we can improve the cache performance by customizing the data layout for the specific program. We observe that multiprocessor cache performance problems can be minimized by making the data accessed by each processor contiguous in the shared address space, an example of which is shown in Figure 1(c). Such a layout enhances spatial locality, minimizes false sharing and also minimizes conflict misses. The importance of optimizing memory subsystem performance for multiprocessors has also been confirmed by several studies of hand optimizations on real applications. Singh et al. explored performance issues on scalable shared address space architectures; they improved cache behavior by transforming two-dimensional arrays into four-dimensional arrays so that each processor's local data are contiguous in memory[28]. Torrellas et al.[30] and Eggers et al.[11,12] also showed that improving spatial locality and reducing false sharing resulted in significant speedups for a set of programs on shared-memory machines. In summary, not only must we minimize sharing to achieve efficient parallelization, it is also important to optimize for the
  • 12. multi-word cache line and the small set associativity. The cache behavior depends on both the computation mapping and the data layout. Thus, besides choosing a good parallelization scheme and a good computation mapping, we may also wish to change the data structures in the program.