Successfully reported this slideshow.
Your SlideShare is downloading. ×

A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 13 Ad

A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets

Download to read offline

CUDA based software designed to calculate the normalized cross- correlation coefficient (NCC) between a collection of selected template waveforms and the continuous waveform recordings of seismic instruments to evaluate the waveform similarity among the waveforms and/or the relative travel-time differences.

CUDA based software designed to calculate the normalized cross- correlation coefficient (NCC) between a collection of selected template waveforms and the continuous waveform recordings of seismic instruments to evaluate the waveform similarity among the waveforms and/or the relative travel-time differences.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets (20)

Advertisement

Recently uploaded (20)

A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets

  1. 1. A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets Dawei Mu, Pietro Cicotti, Yifeng Cui, Enjui Lee, Po Chen
  2. 2. Outlines 1. Introduction of cuNCC code 2. Realistic Application 3. Performance Analysis 4. Memory Buffer Approach and I/O Analysis 5. Future Work
  3. 3. 1. Introduction of cuNCC what is cuNCC ? CUDA based software designed to calculate the normalized cross- correlation coefficient (NCC) between a collection of selected template waveforms and the continuous waveform recordings of seismic instruments to evaluate the waveform similarity among the waveforms and/or the relative travel-time differences. Feb 05, 2016 M6.6
 Meinong aftershocks detection • more uncatalogued aftershocks
 were detected • contributed to earthquake 
 location detection and earthquake 
 source parameters estimation
  4. 4. 2. Realistic Application M6.6 Meinong aftershocks 
 hypocenter re-location. •traditional method using short term 
 long term to detect events and using 
 1-D model to locate hypocenter. fewer
 aftershock detected, the result contains
 less information due to inaccuracy. •3-D waveform template matching detect
 events, using 3D model and waveform 
 travel-time differences to re-locate the 
 hypocenter, the result shows more 
 events and more clustered hypocenters, 
 which give us detailed fault geometry •over 4 trillion NCC calculations involved
  5. 5. 3. Performance Analysis optimization scheme • The cuNCC is bounded by 
 memory bandwidth • The constant memory is used to 
 stack multiple events into 
 single computational kernel • The shared memory is used to improve the memory bandwidth 
 utilization 1. Compute, Bandwidth, or Latency Bound The first step in analyzing an individual kernel is to determine if the performance of the kernel is bounded by com bandwidth, or instruction/memory latency. The results below indicate that the performance of kernel "cuNCC_04 limited by memory bandwidth. You should first examine the information in the "Memory Bandwidth" section to d limiting performance. 1.1. Kernel Performance Is Bound By Memory Bandwidth For device "GeForce GTX 980" the kernel's compute utilization is significantly lower than its memory utilization levels indicate that the performance of the kernel is most likely being limited by the memory system. For this kern factor in the memory system is the bandwidth of the Shared memory.
  6. 6. 3. Performance Analysis performance benchmark This cuNCC code can achieve high performance without high-end hardware or expensive clusters, optimized CPU based NCC code needs 21 hours for one E7-8867 CPU (all 18 cores) to finish mentioned example while a NVIDIA GTX980 only costs 53 minutes. Hardware Runtime (ms) SP FLOP (×1e11) Achieved GFLOPS Max GFLOPS Achieved
 Percentage Speedup E7-8867
 (18 cores) 2968 1.23 41.36 237.6 17.4% 1.0x C2075
 (Fermi) 495 1.8 363.83 1030 35.3% 6.0x GTX980
 (Maxwell) 116 1.8 1552.80 5000 31.0% 25.6x M40
 (Maxwell) 115 1.8 1569.86 7000 22.4% 25.8x P100
 (Pascal) 62 1.8 2911.84 10600 27.5% 47.9x
  7. 7. 4. Memory Buffer Approach and I/O Analysis the I/O bottleneck After improving the computational 
 performance with GPU acceleration, 
 I/O efficiency became the new 
 bottleneck of the cuNCC’s overall 
 performance. The output file of cuNCC is an 1-D vector of similarity coefficients saved in binary format, which size is equal to seismic data file. CPU NCC I/O operations cost roughly 10% of the total runtime, while the GPU code I/O cost more than 75% of total runtime. 0 125 250 375 500 NCC(CPU) cuNCC I/O Compute
  8. 8. 4. Memory Buffer Approach and I/O Analysis test environment The SGI UV300 system has 8 sockets 18 core Intel Xeon E7-8867 V4 processors 16 DDR4 32GB DRAM run at 1600 MHz. 4 TB of DRAM in Non-Uniform Memory Access (NUMA) via SGI’s NUMALink technology. 4x PCIe Intel flash cards for a total of 8TB configured as a RAID 0 device and mounted as “/scratch” with a 975.22 MB/s achieved I/O bandwidth (with IOR). 2x 400GB Intel SSDs configured as a RAID 1 device and mounted as “/home” with a 370.07 MB/s achieved I/O bandwidth. The software we used were GCC-4.4.7 and CUDA-7.5 along with MPI package MPICH-3.2.
  9. 9. 4. Memory Buffer Approach and I/O Analysis use CPU memory as a buffer •Most GPU-enabled computers have more CPU
 memory than GPU memory. 
 (in our case 48GB << 4TB) •Fixed data chuck size (120 days’) 
 with different total workloads •on ”/scratch” partition, for every data size, the 
 buffering technique costs more overall runtime 
 than the no-buffering •on ”/home” partition, buffering starts to help 
 after reaching the 2400-day’s total workload •the high I/O bandwidth filesystem, the 
 improvement brought by the buffering cannot 
 cover up the overhead of the memory transfer.
  10. 10. 4. Memory Buffer Approach and I/O Analysis use shared memory virtual 
 filesystem as a buffer •we set 2 TB of DRAM as a shared memory 
 virtual filesystem, and measured the I/O 
 bandwidth achieved 2228.05 MB/s. •on the ”/dev/shm” partition, the high 
 bandwidth of shared memory improves 
 performance greatly by reducing the 
 time used for output. •we gathered the runtime result without 
 buffering scheme from all three storage 
 partitions, and the shared memory 
 partition obtains the best performance.
  11. 11. 4. Memory Buffer Approach and I/O Analysis I/O test conclusion •for machines support shared memory virtual filesystem, we recommend to use the shared memory as buffer to output cuNCC result, especially when the similarity coefficients are the median result for the following computation. •for those machines do not have shared memory with high bandwidth I/O device, we recommend to directly output the result to storage without the buffering scheme. •for those machines do not support shared memory with low bandwidth I/O device, we should consider to use CPU memory as a buffer to reduce disk access frequency.
  12. 12. 5. Future Work • further optimize the cuNCC code on the Pascal GPU platform. • implement our cuNCC code with “SEISM-IO” library, which interface allows user to switch among “MPI-IO”, “PHDF5”, “NETCDF4”, and “ADIOS” as low level I/O libraries.
  13. 13. Thank you for your time !

×