Characteristics of an On-Chip Cache
on NEC SX Vector Architecture
Akihiro MUSA1;3Ã
, Yoshiei SATO1y
, Ryusuke EGAWA1;2z
, ...
We have proposed an on-chip cache, called vector cache, to maintain the effective memory bandwidth rate of future
vector su...
are provided for each type; total 20 pipes are included in the vector unit. The address control unit queues and issues
vec...
As leading applications, we have selected five applications in three scientific computing areas. The applications have
been ...
5.2 Relationship between Efficiency and Cache Hit Rate on Kernel Loops
We simulate the execution of the kernel loop (1) on t...
FLOP and 2 B/FLOP systems need a cache hit rate of 100% to achieve a performance comparable to that of the 4 B/
FLOP syste...
from the cache. PBM of the 2 B/FLOP system reaches the effective memory bandwidth of the 4 B/FLOP system at the
cache hit r...
cache hit rate varies with the cache capacities and the associativity. In SFHT, the memory access patterns have 8 bytes
st...
(2). However, the longer the loop length, the lower the effect of the cache latency. Figure 13 shows that the
computation e...
smallest due to the decrease in memory latency by the cache, and in these kernel loops the memory access time of the
short...
In the 256 KB cache case, the efficiencies of ‘‘ALL’’ and ‘‘Selective’’ are 33.3 and 34.1%, and the cache usage rate
are 9.6...
H zði À 1; j; kÞ (line 12) cause cache misses, because the reused data of these arrays are replaced by other array data.
F...
6.2 Effects of Loop-unrolling and Caching
Loop unrolling is effective for higher utilization of vector pipelines, however, l...
These results indicate that loop unrolling has both conflicting and synergistic effects of caching. Loop unrolling is a
most...
performance using two DAXPY-like loops. Moreover, we have clarified the same tendency of the relationship on the
five scient...
Processors,’’ In Proceedings of the 8th MEDEA workshop on Memory performance: Dealing with Applications, systems and
archi...
Upcoming SlideShare
Loading in …5
×

Characteristics of an on chip cache on nec sx

300 views

Published on

Arquitetura de Computadores

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
300
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Characteristics of an on chip cache on nec sx

  1. 1. Characteristics of an On-Chip Cache on NEC SX Vector Architecture Akihiro MUSA1;3Ã , Yoshiei SATO1y , Ryusuke EGAWA1;2z , Hiroyuki TAKIZAWA1;2x , Koki OKABE2} , and Hiroaki KOBAYASHI1;2k 1 Graduate School of Information Sciences, Tohoku University, Sendai 980-8578, Japan 2 Cyberscience Center, Tohoku University, Sendai 980-8578, Japan 3 NEC Corporation, Tokyo 108-8001, Japan Received May 6, 2008; final version accepted December 10, 2008 Thanks to the highly effective memory bandwidth of the vector systems, they can achieve the high computation efficiency for computation-intensive scientific applications. However, they have been encountering the memory wall problem and the effective memory bandwidth rate has decreased, resulting in the decrease in the bytes per flop rates of recent vector systems from 4 (SX-7 and SX-8) to 2 (SX-8R) and 2.5 (SX-9). The situation is getting worse as many functions units and/or cores will be brought into a single chip, because the pin bandwidth is limited and does not scale. To solve the problem, we propose an on-chip cache, called vector cache, to maintain the effective memory bandwidth rate of future vector supercomputers. The vector cache employs a bypass mechanism between the main memory and register files under software controls. We evaluate the performance of the vector cache on the NEC SX vector processor architecture with bytes per flop rates of 2 B/FLOP and 1 B/FLOP, to clarify the basic characteristics of the vector cache. For the evaluation, we use the NEC SX-7 simulator extended with the vector cache mechanism. Benchmark programs for performance evaluation are two DAXPY-like loops and five leading scientific applications. The results indicate that the vector cache boosts the computational efficiencies of the 2 B/FLOP and 1 B/FLOP systems up to the level of the 4 B/FLOP system. Especially, in the case where cache hit rates exceed 50%, the 2 B/FLOP system can achieve a performance comparable to the 4 B/ FLOP system. The vector cache with the bypass mechanism can provide the data both from the main memory and the cache simultaneously. In addition, from the viewpoints of designing the cache, we investigate the impact of cache associativity on the cache hit rate, and the relationship between cache latency and the performance. The results also suggest that the associativity hardly affects the cache hit rate, and the effects of the cache latency depend on the vector loop length of applications. The cache shorter latency contributes to the performance improvement of the applications with shorter loop lengths, even in the case of the 4 B/FLOP system. In the case of longer loop lengths of 256 or more, the latency can effectively be hidden, and the performance is not sensitive to the cache latency. Finally, we discuss the effects of selective caching using the bypass mechanism and loop unrolling on the vector cache performance for the scientific applications. The selective caching is effective for efficient use of the limited cache capacity. The loop unrolling is also effective for the improvement of performance, resulting in a synergistic effect with caching. However, there are exceptional cases; the loop unrolling worsens the cache hit rate due to an increase in the working space to process the unrolled loops over the cache. In this case, an increase in the cache miss rate cancels the gain obtained by unrolling. KEYWORDS: Vector processing, Vector cache, Performance characterization, Memory system, Scientific application 1. Introduction Vector supercomputers have high computation efficiency for computation-intensive scientific applications. In our previous work (Kobayashi (2006), Musa et al. (2006)), we have shown that NEC SX vector supercomputers can achieve high sustained performance in several leading applications. This is due to their high bytes per flop rates compared to scalar systems. However, as recent advantages in VLSI technology have also been accelerating processor speeds, vector supercomputers have been encountering the memory wall problem. As a result, it is getting harder for vector supercomputers to keep a high memory bandwidth balanced with the improvement of their flop/s performance. On the NEC SX systems, the bytes per flop rate has decreased from 8 B/FLOP to 2 B/FLOP during the last twenty years. Ã E-mail: musa@sc.isc.tohoku.ac.jp y E-mail: yoshiei@sc.isc.tohoku.ac.jp z E-mail: egawa@isc.tohoku.ac.jp x E-mail: tacky@isc.tohoku.ac.jp } E-mail: okabe@isc.tohoku.ac.jp k E-mail: koba@isc.tohoku.ac.jp Interdisciplinary Information Sciences Vol. 15, No. 1 (2009) 51–66 #Graduate School of Information Sciences, Tohoku University ISSN 1340-9050 print/1347-6157 online DOI 10.4036/iis.2009.51
  2. 2. We have proposed an on-chip cache, called vector cache, to maintain the effective memory bandwidth rate of future vector supercomputers (Kobayashi et al. (2007)) and provided the early evaluation of the vector cache using the NEC SX simulator (Musa et al. (2007)). The vector cache employs a bypass mechanism between a vector register and a main memory under software controls, and load/store data are selectively cached. The bypass mechanism and the cache work complementarily together to provide data to the register file, resulting in the high effective memory bandwidth rate. However, we did not investigate the fundamental configuration of the vector cache, i.e. cache associativity and cache latency. Moreover, our previous work did not discuss the relationship between loop unrolling and caching. The main contribution of this paper is to clarify the characteristics of the on-chip vector cache on the vector supercomputers. We investigate the relationship among cache associativity, cache latency and performance using five practical applications selected from the areas of advanced computational sciences. In addition, we discuss in detail selective caching and loop unrolling under the vector cache to improve the performance. The rest of the paper is organized as follows. Section 2 presents related work. Section 3 describes a vector architecture with an on-chip vector cache discussed in this paper. Section 4 provides our experimental methodology and benchmark programs for performance evaluation. Section 5 presents experimental results when executing the kernel loops and five real applications on our architecture. Through the experimental results, we clarify the characteristics of the vector cache. In Section 6, we examine the effect of selective caching on the performance, in which only the data with high localities of reference are cached. In addition, we discuss the relationship between loop unrolling and the vector cache. Finally, Section 7 summarizes the paper with some future work. 2. Related Work Vector caches have been previously studied by a number of researchers using trace driven simulators of convectional vector architectures in the 1990’s. Gee et al. have provided an evaluation of the cache performance in vector architectures: Cray X-MP and Ardent Titan (Gee and Smith (1992), Gee and Smith (1994)). Their caches employ a full- associative cache and an n-way set associative cache. The line sizes range from 16 to 128 bytes, and the maximum cache size is 4 MB. Their benchmark programs are selected from Livermore Loops, NAS kernels, Los Alamos benchmark set and real applications in chemistry, fluid dynamics and linear analysis. They showed that vector references contained somewhat less temporal locality, but large amounts of spatial locality compared to instruction and scalar references, and the cache improved the computational performance of the vector processors. Kontothanassis et al. have evaluated the cache performance of Cray C90 architecture using NAS parallel benchmarks (Kontothanassis et al. (1994)). Their cache is a direct-mapped cache with a 128 bytes line size, and the maximum cache size is 8 MB. They showed that the cache reduced memory traffic from off-chip memory, and a DRAM memory system with the cache was competitive to a SRAM memory system. Modern vector supercomputer Cray X1 has a 2 MB vector cache, called Ecache, organized as a 2-way set associative write-back cache with 32 bytes line size and the least recently used (LRU) replacement policy (Abts et al. (2003)). The performance of Cray X1 has been evaluated using real scientific applications by several researchers (Dunigan Jr. et al. (2005), Oliker et al. (2004), Shan and Strohmaier (2004)). These studies have compared the performance of Cray X1 with that of other platforms; IBM Power, SGI Altix and NEC SX. However, they have not quantitatively discussed the effect of the cache on its performance. The aim of our research is to quantitatively investigate the effect of the vector cache using modern vector supercomputer NEC SX architecture. We clarify the basic characteristics of the vector cache and discuss its effective usages. 3. On-Chip Cache Memory for Vector Architecture We design an on-chip vector cache for a vector processor architecture as shown in Figure 1. Our vector processor has a superscalar unit, a vector unit and an address control unit. The vector unit contains five types of vector arithmetic pipes (Mask, Logical, Multiply, Add/Shift and Divide), vector registers and a vector cache. Four parallel vector pipes 1B/FLOP ~ 4B/FLOP Vector Registers Main Memory Vector Cache Vector Cache Vector Cache Vector Cache Vector Cache ×32 4B/FLOP Vector Cache Vector Registers Mask Reg. Mask Logical Multiply Add/Shift Divide Vector Unit Scalar Unit ×4 ×4 ×4 ×4 ×4 MainMemoryUnit Address Control Unit . . Figure 1. Vector architecture with vector cache and memory system block diagram. 52 MUSA et al.
  3. 3. are provided for each type; total 20 pipes are included in the vector unit. The address control unit queues and issues vector operation instructions to the vector unit. The vector operation instructions contain vector memory instructions and vector computational instructions. Each instruction concurrently deals with 256 double-precision floating-point data. The main memory unit employs an interleaved memory system. The main memory is divided into 32 parts, each of which operates independently and supplies data to the vector processor. Thus, the main memory and the vector processor are interconnected through 32 memory ports. Moreover, we implement the vector cache in each memory part; the vector cache consists of 32 sub-caches, and the sub-cache is a non-blocking cache with a bypass mechanism between a vector register file and the main memory. The bypass mechanism is software-controlled, and vector load/ store instructions have a flag for cache control, i.e. cache on/off. When the flag indicates ‘‘cache on,’’ the supplied data by the vector load/store instruction are cached. Meanwhile, when the flag is ‘‘cache off,’’ the vector load/store instruction provides data via the bypass mechanism: the data are not cached. The sub-cache employs a set-associative write-through cache with the LRU replacement policy. The line size is 8 bytes; it is the unit for memory accesses. 4. Experimental Environment 4.1 Methodology We use a trace-driven simulator that can simulate the behavior of the proposed vector architecture at the register transfer level. It is based on the NEC SX simulator, which accurately models a single processor of the SX architecture; the vector unit, the scalar unit and the memory system. The simulator takes a system parameter file and a trace file as input, and the output of the simulator contains instruction cycle counts of a benchmark program and cache hit information. The system parameter file has configuration parameters of a vector architecture and setting parameters, i.e., the cache size, the associativity, the cache latency and the memory bandwidth. The trace file contains an instruction sequence of a benchmark program and directives of cache control: cache on/ off. These directives are used in selective caching and set the cache control flags in the vector load/store instructions. A benchmark program is compiled by the NEC FORTRAN compiler: FORTRAN90/SX. It supports ANSI/ISO Fortran95 in addition to functions of automatic vectorization and automatic parallelization. The executable program runs on the SX trace generator to produce the trace file. In this work, we use two kernel loops and the original source codes of five applications, and compile them with the highest optimizations option (-C hopt) and inlining subroutines. To evaluate on-chip vector caching for future vector processors with a higher flop/s rate but a relatively lower off- chip memory bandwidth, we evaluate the effect of the cache on the vector processor by limiting its memory bandwidth per flop/s rate from 4 B/FLOP down to 1 B/FLOP. Here, we adopt the NEC SX-7 architecture (Kitagawa et al. (2003)) to our vector processor. The setting parameters are shown in Table 11 . 4.2 Benchmark Programs To clarify the basic characteristics and validity of the vector cache, we select basic kernel loops and five leading applications. In this section, we describe our benchmark programs in detail. The following kernel loops are used to evaluate the performance of the vector processor with the cache. AðiÞ ¼ XðiÞ þ YðiÞ: ð1Þ AðiÞ ¼ XðiÞ Â YðiÞ þ ZðiÞ: ð2Þ Table 1. Summary of setting parameters. Base System Architecture NEC SX-7 Main Memnory DDR-SDRAM Vector Cache SRAM Total Size (Sub-cache Size) 256 KB–8 MB (8 KB–256 KB) Cache Policy LRU, Write-through Associativity 2WAY, 4WAY, 8WAY Cache Latency 15%, 50%, 85%, 100% (Rate of main memory latency) Cache Bank Cycle 5% of memory cycle Line Size 8 B Memory—Cache bandwidth per flop/s 1 B/FLOP, 2 B/FLOP, 4 B/FLOP Cache—Register bandwidth per flop/s 4 B/FLOP 1 The absolute values of memory related parameters such as latency and bank cycle cannot be disclosed in this paper because of their confidentiality. Characteristics of an On-Chip Cache on NEC SX Vector Architecture 53
  4. 4. As leading applications, we have selected five applications in three scientific computing areas. The applications have been developed by researchers of Tohoku University for the SX-7 system and are representative of each research area. Table 2 shows the summary of the evaluated applications, which methods are standard in individual areas (Kobayashi (2006)). The GPR simulation code evaluates a performance of the SAR-GPR (Synthetic Aperture Radar–Ground Penetrating Radar) in detection of buried anti personnel mines under conditions of inhomogeneous subsurface mediums (Kobayashi et al. (2003)). The simulation method is based on the three dimensional FDTD (Finite Difference Time Domain) method (Kunz and Luebbers (1993)). The vector operation ratio is 99.7% and the vector loop length is over 500. The APFA simulation is applied to design of a high gain antenna; an Anti-Podal Fermi Antenna (APFA) (Takagi et al. (2004)). The simulation calculates directional characteristics of the APFA using the FDTD method and the Fourier transform. The vector operation ratio is 99.9% and the vector loop length is 255. The PRF simulation provides numerical simulations of two dimensional Premixed Reactive Flow (PRF) in combustion for design of engines of planes (Tsuboi and Masuya (2003)). The simulation uses the sixth-order compact finite difference scheme and the third-order Runge-Kutta method for time advancement. The vector operation ratio is 99.3% and the vector loop length is 513. The SFHT simulation realizes direct numerical simulations of three-dimensional laminar Separated Flow and Heat Transfer (SFHT) on surfaces of a plane (Nakajima et al. (2005)). The finite-difference forms are the fifth-order upwind difference scheme for space derivatives and the Crank–Nicholson method for a time derivative. The resulting finite- difference equations are solved using the SMAC method. The vector operation ratio is 99.4% and the vector loop length is 349. The PBM simulation uses the three-dimensional numerical Plate Boundary Models (PBM) to explain an observed variation in propagation speed of postseismic slip (Ariyoshi et al. (2007)). This is a quasi-static simulation in an elastic half-space including a rate- and state-dependent friction. The vector operation ratio is 99.5% and the vector loop length is 32400. 5. Performance Evaluation of Vector Cache We evaluate the performance of the proposed vector processor architecture by using the benchmark programs and clarify the basic characteristics of the vector cache. 5.1 Impact of Memory Bandwidth on Effective Performance To evaluate the impact of the memory bandwidth on performance of the vector processor without the vector cache, we varied memory bandwidth per flop/s rates from 4 B/FLOP down to 1 B/FLOP. Figure 2 shows the computation efficiency, the ratio of the sustained performance to the peak performance, in the execution of the benchmark programs. Here, the 4 B/FLOP case is equal to the memory bandwidth per flop/s rate of the SX-8 or earlier system, the 2 B/FLOP is the same as Cary X1 and BlackWidow (Abts et al. (2007)), and the 1 B/FLOP is the same level as the commodity- based high performance scalar systems such as NEC TX7 series (Senta et al. (2003)) and SGI Altix series (SGI (2007)). Figure 2 indicates that the computational performance is seriously affected by the B/FLOP rate. In particular, the performances of GPR and PBM, which are memory-intensive programs, are degraded by half and quarter when the memory bandwidth per flop/s rate is reduced to 2 B/FLOP and 1 B/FLOP from 4 B/FLOP. Even if APFA is computation-intensive, the performance on the 1 B/FLOP system is decreased to 69% of the performance on the 4 B/FLOP system. Therefore, the memory bandwidth per flop/s rate needs the 4 B/FLOP rate on the vector supercomputer. Table 2. Summary of benchmark programs. Area Name and Description Method Subdivision Memory Size Electromagnetic Analysis GPR simulation: Simulation of Array Antenna Ground Penetrating Radar FDTD 50 Â 750 Â 750 8.9 GB APFA simulation: Simulation of Anti-Podal Fermi Antenna FDTD 612 Â 105 Â 505 12 GB CFD/Heat Analysis PRF simulation: Simulation of Premixed Reactive Flow in Combustion DNS 513 Â 513 1.4 GB SFHT simulation: Simulation of Separated Flow and Heat Transfer SMAC 711 Â 91 Â 221 6.6 GB Seismology PBM simulation: Simulation of Plate Boundary Model on Seismic Slow Slip Friction Law 32400 Â 32400 8 GB 54 MUSA et al.
  5. 5. 5.2 Relationship between Efficiency and Cache Hit Rate on Kernel Loops We simulate the execution of the kernel loop (1) on the vector processor with the vector cache. Since this loop has no temporal locality, we store a part of XðiÞ and YðiÞ data on the vector cache in advance of the loop execution. In the following two ways, we select the data for caching to change the range of cache hit rates, and examine their effects on performance. One is to store both XðiÞ and YðiÞ with the same index i on the cache, and load instructions of both XðiÞ and YðiÞ are set to ‘‘cache on,’’ named Case 1. The other is to cache only XðiÞ, and load instructions of XðiÞ and YðiÞ are set to ‘‘cache on’’ and ‘‘cache off,’’ respectively. This is Case 2. Here, the range of index i is 1 i 25600. Figure 3 shows the relationship between the cache hit rate and the relative memory bandwidth that is obtained by normalizing the effective memory bandwidth of the systems with the vector cache by that of the 4 B/FLOP system without the cache. Figure 4 shows the relationship between the cache hit rate and the computation efficiency. Figure 3 indicates that the relative memory bandwidth of each case increases as the cache hit rate increases. Therefore, the vector cache is one of the promising solutions to cover a lack of the memory bandwidth. Similar to Figure 3, the computation efficiency of each case in Figure 4 improves as the cache hit rate increases. Figures 3 and 4 show the correlation between the relative memory bandwidth and the computation efficiency. On both the 2 B/FLOP and 1 B/FLOP systems, the performance of Case 2 is greater than that of Case 1. In particular, Case 2 with the vector cache of the 50% hit rate on the 2 B/FLOP system achieves the same performance of the 4 B/ FLOP system. On the 4 B/FLOP system, each of XðiÞ and YðiÞ is provided at a 2 B/FLOP bandwidth rate on average. When the cache hit rate is 50% in Case 2, all data of XðiÞ are supplied directly from the vector cache and all data of YðiÞ are supplied from the memory at the 2 B/FLOP bandwidth rate through the bypass mechanism. Consequently, each of XðiÞ and YðiÞ is provided at the 2 B/FLOP bandwidth rate on the vector registers, and the total amount of data provided to vector registers in time is equal to that of the 4 B/FLOP system. However, the 1 B/FLOP system with vector caching with the 50% hit rate does not achieve the performance of the 4 B/FLOP system. Because YðiÞ is provided at a 1 B/ FLOP bandwidth rate, the total bandwidth rate at the vector registers does not reach the 4 B/FLOP. On the other hand, in Case 1, the data of XðiÞ and YðiÞ are supplied from either of the vector cache or the memory at the same index i. While supplying the data from the memory, XðiÞ and YðiÞ have to share the memory bandwidth rate. Therefore, the 1 B/ 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 100 RelativeMemoryBandwidth Cache Hit Rate (%) 4B/F 2B/FCase1 2B/FCase2 1B/FCase1 1B/FCase2 Figure 3. Relationship between cache hit rate and relative memory bandwidth on Kernel loop (1). 0 10 20 30 40 50 60 X +Y X*Y+Z GPR APFA PRF SFHT PBM Efficiency(%) 4B/FLOP 2B/FLOP 1B/FLOP Figure 2. Effect of memory bandwidth per flop/s rate. Characteristics of an On-Chip Cache on NEC SX Vector Architecture 55
  6. 6. FLOP and 2 B/FLOP systems need a cache hit rate of 100% to achieve a performance comparable to that of the 4 B/ FLOP system. Similarly, we examine the performance of Kernel (2) on the 2 B/FLOP system in the following three cases. In Case 1, all of XðiÞ, YðiÞ, and ZðiÞ are cached with the same index i in advance. In Case 2, both XðiÞ and YðiÞ are provided from the vector cache, and therefore the maximum cache hit rate is 66%. In Case 3, only XðiÞ is in the cache, and hence the maximum cache hit rate is 33%. Figure 5 shows the relationship between the cache hit rate and the change in the relative memory bandwidth regarding Kernel (2). On the 4 B/FLOP system, each of XðiÞ, YðiÞ and ZðiÞ is provided at a 4/3 B/FLOP bandwidth rate on average. The performance of Case 2 is comparable to that of the 4 B/ FLOP system when the cache hit rate is 66%, because the bandwidth per flop/s rate of each data is the 4/3 B/FLOP rate on average. However, in Case 3, both YðiÞ and ZðiÞ are provided from the memory, and YðiÞ and ZðiÞ have to share the 2 B/FLOP bandwidth rate at the vector register. As a result, the relative memory bandwidth never reaches that of the 4 B/FLOP system. These results indicate that the vector cache has a potential for the future SX systems to cover the shortage of memory bandwidth. In additions, all data do not need to be cached from the discussions on Figures 3 and 5. The key data determining the performance should be cached to make good use of limited on-chip capacity. 5.3 Relationship between Efficiency and Cache Hit Rate on Five Scientific Applications We simulate the execution of five scientific applications with the vector cache varying memory bandwidth per flop/s rate; 1 B/FLOP and 2 B/FLOP. Here, all vector load/store instructions are set to ‘‘cache on.’’ Figure 6 shows the relationship between the cache hit rate of the 8 MB vector cache and the relative memory bandwidth. Figure 6 indicates that the vector cache can improve the effective memory bandwidth, depending on the cache hit rate of the applications. Cache hit rates vary from 13% in SFHT to 96% in APFA, because these depend on the locality of reference in individual applications. In APFA, the relative memory bandwidth is 0.96 at the 2 B/FLOP system with almost 100% hit of the 8 MB cache, resulting in an effective memory bandwidth almost equal to the 4 B/FLOP system. In this case, most data are provided 0 2 4 6 8 10 12 14 16 18 0 10 20 30 40 50 60 70 80 90 100 Efficiency(%) Cache Hit Rate (%) 4B/F 2B/FCase1 2B/FCase2 1B/FCase1 1B/FCase2 Figure 4. Relationship between cache hit rate and computation efficiency on Kernel loop (1). 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 80 90 100 RelativeMemoryBandwidth Cache Hit Rate (%) 4B/F 2B/F Case1 2B/F Case2 2B/F Case3 Figure 5. Relationship between cache hit rate and relative memory bandwidth on Kernel loop (2). 56 MUSA et al.
  7. 7. from the cache. PBM of the 2 B/FLOP system reaches the effective memory bandwidth of the 4 B/FLOP system at the cache hit rate 50%. Figure 7 shows the main routine of PBM. The inner loop, index j, is vectorized and arrays gd dip and wary are stored in the vector cache by vector load instructions. However, gd dip is spilled from the cache, because gd dip needs 7.6 GB. On the other hand, wary can be held in the cache, because its needs only 250 KB and the array is defined in the preceding loop. Therefore, the cache hit rate of this loop is 50%. Then all data of wary are supplied directly from the vector cache at the 2 B/FLOP rate and all data of gd dip are supplied from the memory at the 2 B/ FLOP rate through the cache, and the total amount of data provided to vector registers in time is equal to that of the 4 B/FLOP system. The relative memory bandwidths in others applications, SFHT, GPR and PRF are over 0.6 on the 2 B/FLOP system and 0.3 on the 1 B/FLOP system. Here, the relative memory bandwidth without the cache is 0.5 on the 2 B/FLOP system and 0.25 on the 1 B/FLOP system. Thus, the improvement of the relative memory bandwidth is over 20% using the 8 MB cache. Figure 8 shows the computation efficiency of five scientific applications on the 4 B/FLOP system without the cache, the 2 B/FLOP and 1 B/FLOP systems with/without the 8 MB cache. Figure 8 indicates that the efficiency of the 2 B/ FLOP and 1 B/FLOP systems is increased by the cache. In particular, the 2 B/FLOP system with the cache achieves the same efficiency as the 4 B/FLOP system in APFA and PBM. Figure 9 shows the relationship between the cache hit rate of the 8 MB cache and the recovery rate of the performance by the cache from the 2 B/FLOP and 1 B/FLOP systems to the 4 B/FLOP system. The recovery rate goes up as the cache hit rate increases. The vector cache can increase the recovery rate by 18 to 99% on the 2 B/FLOP system and 9 to 96% on the 1 B/FLOP system, depending on the data access locality of each application. The results indicate that the vector cache has a potential for the future SX systems to cover the shortage of memory bandwidth on real scientific applications. 5.4 Relationship between Associativity and Cache Hit Rate In this section, we examine the effect of the associativity ranging from 2 to 8 ways on the performance. Figure 10 shows that the cache hit rate on each set associative cache, covering cache sizes from 256 KB to 8 MB. Four applications, APFA, PRF, SFHT and PBM, approximately have constant cache hit rates across three associativity cases. In GPR, the cache hit rate slightly varies with the associativity. The cache hit rate is dependent on among placements of the arrays on memory addresses, memory access patterns of the programs and the associativity. In GPR, its basic computational structure consists of triple nested loops. The loops access many arrays at intervals of 584 bytes stride addresses, and the cache data are frequently replaced. Therefore, the 0 0.25 0.5 0.75 1 0 20 40 60 80 100 RelativeMemoryBandwidth Cache Hit Rate (%) 2B/F 8MB 1B/F 8MB PBM SFHT GPR PRF APFA Figure 6. Relationship between cache hit rate and relative memory bandwidth on five applications. 01 do i=1,ncells 02 do j=1,ncells 03 wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j) 04 end do 05 end do Figure 7. Main routine of PBM. Characteristics of an On-Chip Cache on NEC SX Vector Architecture 57
  8. 8. cache hit rate varies with the cache capacities and the associativity. In SFHT, the memory access patterns have 8 bytes stride accesses. On the 2 MB case, the cached data which will be reused on the 2-way set associative cache are evicted by some other data. Thus this is due to a placement of memory addresses of the arrays. Other programs have constant cache hit rates across three associativity cases. These memory access patterns are 8 bytes or 16 bytes stride accesses, and the placement of the arrays on memory do not cause the conflict of reused data. These results provide that the associativity hardly have effects on the cache hit rates, and the cache capacity mainly influences the cache hit rate across the five applications. 5.5 Effects of Vector Cache Latency on Performance In general, the latency of the on-chip cache is considerably shorter than that of the off-chip memory. We investigate the effects of the cache access latency on the performance of the five applications. Figure 11 shows the relationship between the computation efficiency of the applications and four cache latencies; 15, 50, 85 and 100% of the memory access latency on the 2 B/FLOP system with the 2 MB cache. The five applications have almost the same efficiency when changing the cache latency. Because the vector architecture can hide the memory access times by pipelined vector operations when a vector operation ratio and a vector loop length of applications are enough large. In addition, the vector cache consists of 32 sub-caches between a vector register file and the main memory via each memory port. The sub-caches connected to 32 memory ports provide multiple words at a time. The cache latency of the second memory access or later is hidden in the sub-cache system. For comparison, we examine the performance of shorter loop length cases using Kernel loops (1) and (2) on the 2 B/ FLOP system when changing the cache access latency. Figure 12 shows that the relationship between the computation efficiency of Kernel loops and four cache latency cases. Here, we examine three loop length cases in the 100% cache hit rate; 1 i 64, 1 i 128, 1 i 256. On the shorter loop length cases, Figure 12 indicates that the vector cache latency is an importance factor in the performance. Especially, on the loop length 64 case, the 15% latency case has two times higher efficiency in Kernel (1) than the 100% case, and the 15% latency case has 12% higher efficiency in Kernel 0 20 40 60 80 100 0 20 40 60 80 100 PerformanceRecoveryRate(%) Cache Hit Rate (%) 2B/F 8MB 1B/F 8MB PBM SFHT GPR PRF APFA Figure 9. Recovery rate of performance on 8 MB cache. 0 10 20 30 40 50 GPR APFA PRF SFHT PBM Efficiency(%) 4B/F 2B/F non-cache 2B/F 8MB 1B/F non-cache 1B/F 8MB Figure 8. Computation efficiency of five applications with/without 8 MB vector cache. 58 MUSA et al.
  9. 9. (2). However, the longer the loop length, the lower the effect of the cache latency. Figure 13 shows that the computation efficiency of Kernel loops in the 4 B/FLOP and 2 B/FLOP systems with/without the vector cache. On both the loop lengths 64 and 128 of Kernel (1), the efficiency of the 4 B/FLOP system with the cache is higher than that of the 4 B/FLOP system without the cache, and the loop length 128 case of the 4 B/FLOP system with the cache achieves the highest efficiency. Meanwhile, in Kernel (2), the loop length 64 case of the 4 B/FLOP system with the cache has the highest efficiency. This is because the ratio of memory access time to the processing time becomes the GPR APFA PRF SFHT PBM 0 10 20 30 40 50 2WAY 4WAY 8WAY CacheHitRate(%) 0.25MB 0.5MB 1MB 2MB 4MB 8MB 0 20 40 60 80 100 2WAY 4WAY 8WAY CacheHitRate(%) 0.25MB 0.5MB 1MB 2MB 4MB 8MB 0 10 20 30 40 50 2WAY 4WAY 8WAY CacheHitRate(%) 0.25MB 0.5MB 1MB 2MB 4MB 8MB 0 10 20 30 40 50 2WAY 4WAY 8WAY CacheHitRate(%) 0.25MB 0.5MB 1MB 2MB 4MB 8MB 0 10 20 30 40 50 2WAY 4WAY 8WAY CacheHitRate(%) 0.25MB 0.5MB 1MB 2MB 4MB 8MB Figure 10. Cache hit rate vs. associativity on five applications. 0 10 20 30 40 50 60 15% 50% 85% 100% Efficiency(%) Cache Latency Case GPR APFA PRF SFHT PBM Figure 11. Computation efficiency of five applications on four cache latency cases. Characteristics of an On-Chip Cache on NEC SX Vector Architecture 59
  10. 10. smallest due to the decrease in memory latency by the cache, and in these kernel loops the memory access time of the shorter loop length case cannot be entirely hidden by pipelined vector operations only. Therefore, the performance of shorter loop length cases is sensitive to the vector cache latency, and the vector cache can boost the performance of the applications with short vector lengths on the 4 B/FLOP system. Moreover, the cache latency is not hidden when the bank conflicts occur. In this work, the bank conflicts hardly occur in the five applications. As our future work, quantitative discussions on the influence of the bank conflicts will be addressed. 6. Optimizations for Vector Caching: Selective Caching and Loop Unrolling In this section, we discuss some optimization techniques for the vector cache to reduce cache miss rates on scientific applications. In particular, we show that the relationship between loop-unrolling and caching. Many compilers automatically optimize programs using the loop-unrolling, which is the most basic loop optimization. 6.1 Effects of Selective Caching The vector cache has a bypass mechanism, and data are selectively cached: selective caching. The simulation methods of the four scientific applications, GPR, APFA, PRF and SFHT, are based on the difference schemes; Finite Difference Time Domain (FDTD) method and Direct Numerical Simulation of flow and heart (DNS), and a part of arrays has a high locality in many cases. As the on-chip cache size is limited, selective caching might be an effective approach to efficient use of the on-chip cache. We evaluate the selective caching for two applications, GPR and PRF, which are typical examples of FDTD and DNS, respectively. Figure 14 shows one of kernel loops of GPR; the loop is a main routine of the FDTD method, and many routines of GPR are similar to this kernel loop. The innermost loop is indexed with j. Usually, FORTRAN compilers interchange loops j and i, however, FORTRAN90/SX compiler does not interchange the loops, because the compiler decides that the computational performance cannot be improved due to the short length of loop i. Loop i has a length of 50, and j is 750. As the size of each array used in GPR is 330 MB, the cache cannot store all the arrays. However, the difference scheme as shown in Figure 14 generally has temporal locality in accessing arrays; H x, H y and H z. We select these arrays for selective caching. Figure 15 shows the effect of selective caching on the 2 B/FLOP system with the vector cache in GPR. Figure 16 shows the recovery rate of the performance from the 2 B/FLOP system without the cache to the 4 B/FLOP system by selective caching in GPR. We discuss the cache sizes from 256 KB up to 8 MB. Here, ‘‘ALL’’ in the figures indicates that all the arrays in the loop are cached. ‘‘Selective’’ shows that some of the arrays are selectively cached. ‘‘Cache Usage Rate’’ indicates the ratio of number of cache hit references to the total memory references. 0 2 4 6 8 10 12 14 16 18 15% 50% 85% 100% Efficiency(%) Cache Latency Case Kernel Loop (1) 64 128 256 0 5 10 15 20 25 30 15% 50% 85% 100% Efficiency(%) Cache Latency Case Kernel Loop (2) 64 128 256 Figure 12. Computation efficiency of two kernel loops on four cache latency cases. 0 2 4 6 8 10 12 14 16 18 20 64 128 256 Efficiency(%) Loop Length Kernel Loop (1) 4B/F 4B/F 2MB 2B/F 2B/F 2MB 0 5 10 15 20 25 30 35 64 128 256 Efficiency(%) Loop Length Kernel Loop (2) 4B/F 4B/F 2MB 2B/F 2B/F 2MB Figure 13. Computation efficiency of two kernel loops on 4 B/FLOP and 2 B/FLOP. 60 MUSA et al.
  11. 11. In the 256 KB cache case, the efficiencies of ‘‘ALL’’ and ‘‘Selective’’ are 33.3 and 34.1%, and the cache usage rate are 9.6 and 16.6%, respectively. The arrays H yði; j; kÞ, H yði À 1; j; kÞ (line 08 in Figure 14), H xði; j; kÞ (line 11) and H zði À 1; j; kÞ (line 12) can load data from the cache on ‘‘Selective.’’ In ‘‘ALL,’’ H yði À 1; j; kÞ (line 08) and 01 DO 10 k=0,Nz 02 DO 10 i=0,Nx 03 DO 10 j=0,Ny 04 E_x(i,j,k) = C_x_a(i,j,k)*E_x(i,j,k) 05 & + C_x_b(i,j,k) * (( H_z(i,j,k) - H_z(i,j-1,k))/dy 06 & - (H_y(i,j,k) - H_y(i,j,k-1))/dz - E_x_Current(i,j,k)) 07 E_z(i,j,k) = C_z_a(i,j,k)*E_z(i,j,k) 08 & + C_z_b(i,j,k) * ((H_y(i,j,k)-H_y(i-1,j,k) )/dx 09 & - (H_x(i,j,k)-H_x(i,j-1,k) )/dy - E_z_Current(i,j,k)) 10 E_y(i,j,k) = C_y_a(i,j,k)*E_y(i,j,k) 11 & + C_y_b(i,j,k) * ((H_x(i,j,k) - H_x(i,j,k-1) )/dz 12 & - (H_z(i,j,k) - H_z(i-1,j,k))/dx - E_y_Current(i,j,k)) 13 10 CONTINU Figure 14. A kernel loop of GPR. 0 10 20 30 30 32 34 36 38 40 256K 512K 1M 2M 4M 8M CacheUsageRate(%) Efficiency(%) Cache Size (B) ALL (Performance) Selective (Performance) ALL (Cache usage rate) Selective (Cache usage rate) Figure 15. Efficiency of selective caching and cache hit rate on 2 B/FLOP system (GPR). 0 10 20 30 0 10 20 30 40 256K 512K 1M 2M 4M 8M CacheUsageRate(%) RecoveryRateofPerformanc(%) Cache Size (B) ALL (Recover Rate) Selective (Recover Rate) ALL (Cache usage rate) Selective (Cache usage rate) Figure 16. Recovery rate of performance and cache hit rate on selective caching (GPR). Characteristics of an On-Chip Cache on NEC SX Vector Architecture 61
  12. 12. H zði À 1; j; kÞ (line 12) cause cache misses, because the reused data of these arrays are replaced by other array data. Figures 15 and 16 indicate that the performance and the cache usage rate increase as the cache size increase. On the 4 MB cache, the cache hit rate of ‘‘Selective’’ is 24% and its efficiency is 3% higher than that of ‘‘ALL.’’ The recovery rate of performance is mere 34%. However, GPR cannot achieve high cache usage rates and recovery rates of performance by selective caching. This is because this loop has many non-locality arrays; C x a, E x, E x Current etc. Figure 17 shows one of kernel loops of PRF; the loop is a high cost routine. The size of each array in PRF is 18 MB, and many loops are doubly nested loop. Thus, the arrays are treated as two dimensions array, and the size of cached data per array is 2 MB only. wDY1YC1, wDY1YC2 and wDY1YC3 in Figure 17 are defined in the preceding loop. In addition, wDY1YC1 and wDY1YC2 have the spatial locality. We select these arrays for selective caching. Here, wDY1YC3ðI; KK À 1; LÞ (line 04 in Figure 17), wDY1YC2ðI; KK; LÞ (line 05), wDY1YC1ðI; KK; LÞ (line 06) and wDY1YC2ðI; KK; LÞ (line 06) can reuse data on the register files. Thus, these arrays do not access the cache. Figures 18 and 19 show the efficiency and the recovery rate of the performance from the 2 B/FLOP system to the 4 B/FLOP system by selective caching in PRF. Figure 18 indicates that the cache hit rates and efficiency of ‘‘Selective’’ are the same as that of ‘‘ALL’’ until 2 MB cache size. wDY1YC2ðI; KK À 1; LÞ (line 03) and wDY1YC1ðI; KK À 1; LÞ (line 08) are cache hit arrays in each case, and the cache hit rate is 33%. Over 4 MB cache size, ‘‘Selective’’ achieves a 66% cache hit rate, resulting in a 30+% improvement in efficiency. Moreover, Figure 19 indicates that the recovery rate is 78%. The results of GPR and PRF show that the selective caching improves the performance of the applications. Compared with GPR, PRF has higher values regarding both the cache usage rate and improved efficiency, because the ratio of arrays with locality per loop on PRF is higher than that of GPR. In GPR the reused data are only defined in the loop, and the cache miss always occurs in this loop. For example, arrays H yði; j; kÞ (line 08 in Figure 14) hits the data of H yði; j; kÞ (line 06), but H yði; j; kÞ (line 06) cannot hit the data due to their first accesses (cold miss). On the other hand, the cached data of PRF are provided in the immediately preceding loop, thus cache misses do not occur like the first access in PRF. 01 DO KK = 2,NJ 02 DO I = 1,NI 03 wDY1YC3(I,KK-1,L) = wDY1YC3(I,KK-1,L) * wDY1YC2(I,KK-1,L) 04 wDY1YC2(I,KK,L) = wDY1YC2(I,KK,L) – wDY1YC1(I,KK,L) * wDY1YC3(I,KK-1,L) 05 wDY1YC2(I,KK,L) = 1.D0 / wDY1YC2(I,KK,L) 06 wDY1YC(I,KK,L) = (wPHIY12(I,KK) – wDY1YC1(I,KK,L) * wDY1YC1(I,KK-1,L)) * wDY1YC2(I,KK,L) 07 END DO 08 END DO Figure 17. High cost kernel loop of PRF. 0 10 20 30 40 50 60 70 20 25 30 35 256K 512K 1M 2M 4M 8M CacheUsageRate(%) Efficiency(%) Cache Size (B) ALL (Performance) Selective (Performance) ALL (Cache usage rate) Selective (Cache usage rate) Figure 18. Efficiency and cache hit rate of selective caching on performance (PRF). 62 MUSA et al.
  13. 13. 6.2 Effects of Loop-unrolling and Caching Loop unrolling is effective for higher utilization of vector pipelines, however, loop unrolling also needs more cache capacity to capture all the arrays of unrolled loops. Therefore, we discuss the relationship between loop unrolling and vector caching on the performance. Figure 20 shows a code outer unrolled 4 times, whose original code is shown in Figure 7. In the outer unrolling case, array wary of the outer loop index i (line 03 in Figure 20) reuses the data of the index i À 1 on the cache and arrays wary (line 04, 05 and 06) reuse the data on register files of wary (line 03), hence, memory references are reduced. However, the cache capacity for gd dip increases due to an increase in the number of arrays reference in the innermost loop. Figure 21 shows the efficiency and the cache hit rate of the outer unrolling case on PBM with the 2 B/FLOP system. ‘‘2B/F’’ indicates the without-cache case, and the efficiency constantly increases as the degree of unrolling increases. The cache hit rate of the 0.5 MB cache case is 49% on the non-unrolling case and becomes poor as the loop unrolling proceeds, because the cached data of wary are evicted from the cache by gd dip. Thus, the efficiency of the 0.5 MB cache case achieves 49% on the non-unrolling case, and it is similar to that of the 2 B/FLOP system with unrolling. In this case, a conflicting effect between caching and unrolling appears. Moreover, the cache hit rates of the 8 MB cache case decrease as the degree of unrolling increases. The cached data remain on the cache in this case, but the cache hit rate decreases due to a decrease in the number of wary (line 03) references. However, the efficiency is approximately constant; 48 or 49% because unrolling covers the losing effect of caching. This case shows that the effects of caching are comparable to that of unrolling. Figure 22 shows a code obtained by unrolling code two times shown in Figure 17. Outer unrolling reduces the memory references of the arrays wDY1YC2ðI; KK; LÞ (line 07 in Figure 22) and wDY1YC1ðI; KK; LÞ (line 10). Figure 23 shows the efficiency and the cache hit rate of outer unrolling on PRF with the 2 B/FLOP system. The cache hit rates of both the 0.5 and 8 MB caches gradually reduce as the degree of unrolling increase, because of an increase in the cache miss rate. In this case, the highest efficiency is 33% on the 8 MB cache with the unrolling degree of 4 since the gain by loop unrolling covers the loss due to the increase in the miss rate. It is higher than the selective caching case; it is 30.7% (Ref. Figure 18). Thus, this case indicates the synergistic effect of both caching and unrolling. 0 10 20 30 40 50 60 70 20 40 60 80 256K 512K 1M 2M 4M 8M CacheUsageRate(%) RecoveryRateofperformance(%) Cache Size (B) ALL (Performance) Selective (Performance) ALL (Cache usage rate) Selective (Cache usage rate) Figure 19. Recovery rate of performance and cache hit rate on selective caching (PRF). 01 do i=1,ncells,4 02 do j=1,ncells 03 wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j) 04 wf_dip(i+1)=wf_dip(i+1)+gd_dip(j,i+1)*wary(j) 05 wf_dip(i+2)=wf_dip(i+2)+gd_dip(j,i+2)*wary(j) 06 wf_dip(i+3)=wf_dip(i+3)+gd_dip(j,i+3)*wary(j) 07 end do 08 end do Figure 20. Outer unrolling loop of PBM. Characteristics of an On-Chip Cache on NEC SX Vector Architecture 63
  14. 14. These results indicate that loop unrolling has both conflicting and synergistic effects of caching. Loop unrolling is a most basic optimization and has beneficial effects in various applications. Therefore, caching has to be used carefully as a complementary tuning option for loop unrolling. 7. Conclusions This paper has presented the characteristics of an on-chip vector cache with the bypass mechanism on the vector supercomputer NEC SX architecture. We have demonstrated the relationship between the cache hit rate and the 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Non-Unroll Unroll2 Unroll4 Unroll8 Unroll16 CacheHitRate(%) Efficiency(%) 2B/F 0.5MB 8MB Hit rate 0.5MB Hit rate 8MB Figure 21. Efficiency and cache hit rate of outer unrolling on performance (PBM). 01 DO KK = 2,NJ 02 DO I = 1,NI 03 wDY1YC3(I,KK-1,L) = wDY1YC3(I,KK-1,L) * wDY1YC2(I,KK-1,L) 04 wDY1YC2(I,KK,L) = wDY1YC2(I,KK,L) – wDY1YC(I,KK,L) * wDY1YC3(I,KK-1,L) 05 wDY1YC2(I,KK,L) = 1.D0 / wDY1YC2(I,KK,L) 06 wDY1YC(I,KK,L) = (wPHIY12(I,KK) – wDY1YC1(I,KK,L) * wDY1YC1(I,KK-1,L)) * wDY1YC2(I,KK,L) 07 wDY1YC3(I,KK,L) = wDY1YC3(I,KK,L) * wDY1YC2(I,KK,L) 08 wDY1YC2(I,KK+1,L) = wDY1YC2(I,KK+1,L) – wDY1YC(I,KK+1,L) * wDY1YC3(I,KK,L) 09 wDY1YC2(I,KK+1,L) = 1.D0 / wDY1YC2(I,KK+1,L) 10 wDY1YC(I,KK+1,L) = (wPHIY12(I,KK+1) – wDY1YC1(I,KK+1,L) * wDY1YC1(I,KK,L)) * wDY1YC2(I,KK+1,L) 11 END DO 12 END DO Figure 22. Outer unrolling loop of PRF. 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 Non-Unroll Unroll 2 Unroll 4 Unroll 8 Unroll 16 CacheHitRate(%) Efficiency(%) 2B/F 0.5MB 8MB hit rate 0.5MB hit rate 8MB Figure 23. Efficiency and cache hit rate of outer unrolling on performance (PRF). 64 MUSA et al.
  15. 15. performance using two DAXPY-like loops. Moreover, we have clarified the same tendency of the relationship on the five scientific applications. The vector cache can recover the lack of the memory bandwidth, and boosts the computation efficiency of the 2 B/FLOP and 1 B/FLOP systems. Especially, as cache hit rates are 50% or more, the 2 B/FLOP system can achieve a performance comparable to the 4 B/FLOP system. The vector cache with a bypass mechanism can provide the data from both the memory and the cache in time, and the sustained memory bandwidth for the registers are increased. Therefore, the vector cache has a potential for the future SX systems to cover the shortage of their memory bandwidth. We have also discussed the relationship between performance and cache design parameters such as cache associativity and cache latency. We have examined the effect of the cache associativity setting from 2-way to 8-way. When the associativity is two or more, the cache hit rate is not sensitive to the associativity in the five applications. In addition, we have examined the effect of the cache latency on the performance when changing it to 15, 50, 85 and 100% of the memory latency. We have demonstrated the computational efficiencies of five applications are constant across these latency changes, when the vector loop lengths of the applications are 256 or more. In these cases, the latency is hidden by pipelined vector operations. However, in the case of shorter vector loop lengths, the cache latency affects the performance, and the 15% latency case of Kernel (1) has two times higher efficiency than the 100% case in the 2 B/FLOP system. In addition, the 4 B/FLOP system is also boosted due to the effect of the short latency of the cache. The above results indicate that the performance of cache does not largely depend on the cache associativity, however, the cache latency affects the performance of applications with short vector loop lengths. Finally we have discussed selective caching and the relationship between loop-unrolling and caching. We have shown that selective caching, which is controlled by means of the bypass mechanism, is effective for efficient use of limited on-chip caches. We have examined two cases; the ratios of arrays with locality per loop are higher and lower cases. In each case, the higher performance is obtained by selective caching, compared with all the data caching. In addition, the loop unrolling is useful in the improvement of performance, and caching is complementary to the effect of loop unrolling. In the future work, we will evaluate other scientific applications using the vector cache, and quantitatively discussions the relationship between the bank conflicts and the cache latency. In addition, we will also investigate the mechanism of vector cache on multi core vector processors. Acknowledgments The authors would like to gratefully thank: Dr. M. Sato, Dr. A. Hasegawa, Professor G. Masuya, Professor T. Ota, Professor K. Sawaya of Tohoku University, for their advice about applications. We also would like to thank Matsuo Aoyama and Hirofumi Akiba of NEC Software Tohoku for their assistance in experiments. REFERENCES [1] Abts, D., et al., ‘‘The Cray BlackWidow: A Highly Scalable Vector Multiprocessor,’’ In Proceedings of the ACM/IEEE SC2007 conference (2007). [2] Abts, D., Scott, S., and Lilja, D. J., ‘‘So Many States, So Little Time: Verifying Memory Coherence in the Cray X1,’’ In International Parallel and Distributed Processing Symposium (2003, April). [3] Ariyoshi, K., Matsuzawa, T., and Hasegawa, A., ‘‘The key frictional parameters controlling spatial variation in the speed of postseismic slip propagation on a subduction plate boundary,’’ Earth Planet. Sci. Lett., 256: 136–146 (2007). [4] Dunigan Jr., T. H., Vetter, J. S., White III, J. B., and Worley, P. H., ‘‘Performance Evaluation of The Cray X1 Distributed Shared-Memory Architecture,’’ In IEEE MICRO, 25: 30–40 (2005). [5] Gee, J. D., and Smith, A. J., ‘‘Vector Processor Caches,’’ Technical Report UCB/CSD-92-707, University of California (1992, October). [6] Gee, J. D., and Smith, A. J., ‘‘The Effectiveness of Caches for Vector Processors,’’ In The 8th international Conference on Supercomputing 1994, pp. 333–343 (1994, July). [7] Kitagawa, K., Tagaya, S., Hagihara, Y., and Kanoh, Y., ‘‘A Hardware Overview of SX-6 and SX-7 Supercomputer,’’ NEC Research & Development, 44: 2–7 (2003). [8] Kobayashi, H., ‘‘Implication of Memory Performance in Vector-Parallel and Scalar-Parallel HEC,’’ In M. Resch et al. (Eds.), High Performance Computing on Vector Systems 2006, pp. 21–50, Springer-Verlag (2006). [9] Kobayashi, H., Musa, A., Sato, Y., Takizawa, H., and Okabe, K., ‘‘The potential of On-chip Memory Systems for Future Vector Architectures,’’ In M. Resch et al. (Eds.), High Performance Computing on Vector Systems 2007, pp. 247–264, Springer-Verlag (2007). [10] Kobayashi, T., Feng, X., and Sato, M., ‘‘FDTD simulation on array antenna SAR-GPR for land mine detection,’’ In Proceedings of SSR2003: 1st International Symposium on Systems and Human Science, Osaka, Japan, pp. 279–283 (2003, November). [11] Kontothanassis, L. I., Sugumar, R. A., Faanes, G. J., Smith, J. E., and Scott, M. L., ‘‘Cache Performance in Vector Supercomputers,’’ In Proceedings of the 1994 conference on Supercomputing, Washington D.C., pp. 255–264 (1994). [12] Kunz, K. S., and Luebbers, R. J., ‘‘The Finite Difference Time Domain Method for Electromagnetics,’’ Boca Raton, FL: CRC Press (1993). [13] Musa, A., Sato, Y., Egawa, R., Takizawa, H., Okabe, K., and Kobayashi, H., ‘‘An On-Chip cache Design for Vector Characteristics of an On-Chip Cache on NEC SX Vector Architecture 65
  16. 16. Processors,’’ In Proceedings of the 8th MEDEA workshop on Memory performance: Dealing with Applications, systems and architecture, Romania, pp. 17–23 (2007). [14] Musa, A., Takizawa, H., Okabe, K., Soga, T., and Kobayashi, H., ‘‘Implications of Memory Performance for Highly Efficient Supercomputing of Scientific Applications,’’ In Proceedings of the 4th International Symposium on Parallel and Distributed Processing and Application 2006, Italy, pp. 845–858 (2006). [15] Nakajima, M., Yanaoka, H., Yoshikawa, H., and Ota, T., ‘‘Numerical Simulation of Three-Dimensional Separated Flow and Heat Transfer around Staggerd Surface-Mounted Rectangular Blocks in a Channel,’’ Numerical Heat Transfer (PartA), 47: 691–708 (2005). [16] Oliker, L., et al., ‘‘A Performance Evaluation of the Cray X1 for Scientific Applications,’’ In VECPAR’04: 6th International Meeting on High Performance Computing for Computational Science, Valencia, Spain (2004, June). [17] Senta, T., Takahashi, A., Kadoi, T., Itoh, T., Kawaguchi, S., and Kishida, Y., ‘‘Itanium2 32-way Server System Architecture,’’ NEC Research & Development, 44: 8–12 (2003). [18] SGI, ‘‘SGI Altix 4700 Servers and Supercomputers,’’ Technical report, SGI Inc., http://www.sgi.com/pdfs/3867.pdf (2007). [19] Shan, H., and Strohmaier, E., ‘‘Performance Characteristics of the Cray X1 and Their Implications for Application Performance Tuning,’’ In 18th Annual International Conference on Supercomputing, Malo, France, pp. 175–183 (2004). [20] Takagi, Y., Sato, H., Wagatsuma, Y., Mizuno, K., and Sawaya, K., ‘‘Study of High Gain and Broadband Antipodal Fermi Anetnna with Corrugation,’’ In 2004 International Symposium on Antennas and Propagation, Vol. 1, pp. 69–72 (2004). [21] Tsuboi, K., and Masuya, G., ‘‘Direct Numerical Simulations for Instabilities of Remixed Planar Flames,’’ In The Fourth Asia- Pacific Conference on Combustion, Nanjing, China, pp. 136–139 (2003, November). 66 MUSA et al.

×