A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets

A Buffering Approach to Manage I/O in a
Normalized Cross-Correlation Earthquake
Detection Code for Large Seismic Datasets
Dawei Mu, Pietro Cicotti, Yifeng Cui, Enjui Lee, Po Chen

Outlines
1. Introduction of cuNCC code
2. Realistic Application
3. Performance Analysis
4. Memory Buffer Approach and I/O Analysis
5. Future Work

1. Introduction of cuNCC
what is cuNCC ?
CUDA based software designed to calculate the normalized cross-
correlation coefﬁcient (NCC) between a collection of selected
template waveforms and the continuous waveform recordings of
seismic instruments to evaluate the waveform similarity among the
waveforms and/or the relative travel-time differences.
Feb 05, 2016 M6.6 
Meinong aftershocks detection
• more uncatalogued aftershocks 
were detected
• contributed to earthquake  
location detection and earthquake  
source parameters estimation

2. Realistic Application
M6.6 Meinong aftershocks  
hypocenter re-location.
•traditional method using short term  
long term to detect events and using  
1-D model to locate hypocenter. fewer 
aftershock detected, the result contains 
less information due to inaccuracy.
•3-D waveform template matching detect 
events, using 3D model and waveform  
travel-time differences to re-locate the  
hypocenter, the result shows more  
events and more clustered hypocenters,  
which give us detailed fault geometry
•over 4 trillion NCC calculations involved

optimization scheme
• The cuNCC is bounded by  
memory bandwidth
• The constant memory is used to  
stack multiple events into  
single computational kernel
• The shared memory is used to improve the memory bandwidth  
utilization
1. Compute, Bandwidth, or Latency Bound
The first step in analyzing an individual kernel is to determine if the performance of the kernel is bounded by com
bandwidth, or instruction/memory latency. The results below indicate that the performance of kernel "cuNCC_04
limited by memory bandwidth. You should first examine the information in the "Memory Bandwidth" section to d
limiting performance.
1.1. Kernel Performance Is Bound By Memory Bandwidth
For device "GeForce GTX 980" the kernel's compute utilization is significantly lower than its memory utilization
levels indicate that the performance of the kernel is most likely being limited by the memory system. For this kern
factor in the memory system is the bandwidth of the Shared memory.

performance benchmark
This cuNCC code can achieve high performance without high-end
hardware or expensive clusters, optimized CPU based NCC code
needs 21 hours for one E7-8867 CPU (all 18 cores) to ﬁnish
mentioned example while a NVIDIA GTX980 only costs 53 minutes.
Hardware Runtime (ms)
SP FLOP
(×1e11)
Achieved
GFLOPS
Max GFLOPS
Achieved 
Percentage
Speedup
E7-8867 
(18 cores)
2968 1.23 41.36 237.6 17.4% 1.0x
C2075 
(Fermi)
495 1.8 363.83 1030 35.3% 6.0x
GTX980 
(Maxwell)
116 1.8 1552.80 5000 31.0% 25.6x
M40 
(Maxwell)
115 1.8 1569.86 7000 22.4% 25.8x
P100 
(Pascal)
62 1.8 2911.84 10600 27.5% 47.9x

the I/O bottleneck
After improving the computational  
performance with GPU acceleration,  
I/O efficiency became the new  
bottleneck of the cuNCC’s overall  
performance.
The output file of cuNCC is an 1-D vector of similarity coefficients
saved in binary format, which size is equal to seismic data file.
CPU NCC I/O operations cost roughly 10% of the total runtime, while
the GPU code I/O cost more than 75% of total runtime.
0
125
250
375
500
NCC(CPU) cuNCC
I/O Compute

test environment
The SGI UV300 system has 8 sockets 18 core Intel Xeon E7-8867 V4
processors 16 DDR4 32GB DRAM run at 1600 MHz.
4 TB of DRAM in Non-Uniform Memory Access (NUMA) via SGI’s
NUMALink technology.
4x PCIe Intel flash cards for a total of 8TB configured as a RAID 0
device and mounted as “/scratch” with a 975.22 MB/s achieved I/O
bandwidth (with IOR).
2x 400GB Intel SSDs configured as a RAID 1 device and mounted as
“/home” with a 370.07 MB/s achieved I/O bandwidth.
The software we used were GCC-4.4.7 and CUDA-7.5 along with MPI
package MPICH-3.2.

use CPU memory as a buffer
•Most GPU-enabled computers have more CPU 
memory than GPU memory.  
(in our case 48GB << 4TB)
•Fixed data chuck size (120 days’)  
with different total workloads
•on ”/scratch” partition, for every data size, the  
buffering technique costs more overall runtime  
than the no-buffering
•on ”/home” partition, buffering starts to help  
after reaching the 2400-day’s total workload
•the high I/O bandwidth ﬁlesystem, the  
improvement brought by the buffering cannot  
cover up the overhead of the memory transfer.

use shared memory virtual  
ﬁlesystem as a buffer
•we set 2 TB of DRAM as a shared memory  
virtual ﬁlesystem, and measured the I/O  
bandwidth achieved 2228.05 MB/s.
•on the ”/dev/shm” partition, the high  
bandwidth of shared memory improves  
performance greatly by reducing the  
time used for output.
•we gathered the runtime result without  
buffering scheme from all three storage  
partitions, and the shared memory  
partition obtains the best performance.

I/O test conclusion
•for machines support shared memory virtual ﬁlesystem, we
recommend to use the shared memory as buffer to output cuNCC
result, especially when the similarity coefﬁcients are the median
result for the following computation.
•for those machines do not have shared memory with high bandwidth
I/O device, we recommend to directly output the result to storage
without the buffering scheme.
•for those machines do not support shared memory with low
bandwidth I/O device, we should consider to use CPU memory as a
buffer to reduce disk access frequency.

5. Future Work
• further optimize the cuNCC code on the Pascal GPU platform.
• implement our cuNCC code with “SEISM-IO” library, which interface
allows user to switch among “MPI-IO”, “PHDF5”, “NETCDF4”, and
“ADIOS” as low level I/O libraries.

A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets

More Related Content

What's hot

Similar to A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets

Recently uploaded

A Buffering Approach to Manage I/O in a Normalized Cross-Correlation Earthquake Detection Code for Large Seismic Datasets