Big Data solution for CTBT monitoring:CEA-IDC joint global cross correlation project

BIG DATA SOLUTION FOR
CTBT MONITORING:
CEA-IDC JOINT GLOBAL
CROSS CORRELATION
PROJECT
15 mai 2014 CEA | 21 JUIN 2012

International Data Centre 25 October 2010 Page 2
Presenters
Dmitry Bobrov1), Randy Bell1), Nicolas Brachet2), Pierre
Gaillard2),, Jocelyn Guilbert2),, Ivan Kitov3), Mikhail
Rozhkov1)
1)International Data Centre, CTBTO,
2) Commissariat a l ’Energie Atomique,
3) Institute for Dynamics of Geospheres

Scientia
potestas
est
Informatio
n potestas
est

Cross-
correlation
Scientia
potestas
est
Informatio
n potestas
est
Tremendous seismic data
growth dictates:

Repeating seismicity: the IDC view
Dozens to hundreds of events from the same Earth cell.
But how can we populate the aseismic area with quality master event?

IMS seismic network
Blue circles – primary arrays, blue triangles – primary 3-C stations.
Yellow circles – auxiliary arrays, yellow triangles – auxiliary 3-C stations. Red
stars – underground nuclear explosions.
Primary network includes 25 arrays

Global Cross Correlation Grid +
Aftershock Sequence Processing
What is Grid?
• Grid is a set of loci of
hypothetic master events.
• Master is a set of waveform
templates linking array
station and the locus.
• Spacing between masters
~140 km.
• P-wave templates from three
to ten IMS primary arrays
per master.
• At least three IMS stations to
create an REB event.

Templates needed:
Real waveforms – for seismic areas
Grand masters – for adjacent territories
Synthetic waveforms – for aseismic areas
What is Grid?
• Grid is a set of loci of
hypothetic master events.
• Master is a set of waveform
templates linking array
station and the locus.
• P-wave templates from three
to ten IMS primary arrays
per master.
• At least three IMS stations to
create an REB event.

Building Masters:
IDC database comprises
hundreds of thousands seismic
events. Building comprehensive
master event database would
require:
1. To cross-correlate each by
each event (low cost effort).
2. To cross correlate each event
with 10-year time interval
event history of IDC
database - extremely high
cost effort.

Template dimensionality
reduction is crucial
• A repeating seismicity map
showed that one point on a
grid may correspond to
dozens or even hundreds of
templates. Effective
dimensionality reduction
technique to be applied to the
clusters of such events to pick
up a limited number of
master events for each
cluster.
• These techniques must be
applied as well to the sets of
synthetic events generated for
the aseismic areas

BIG DATA
Solution
needed

Data is everything
Data centers (IDC, NDCs) collect, process, analyze, produce data 24 hours a day, 7
days a week
Data is the cornerstone : full of information and source of knowledge
Data sets are :
+ Large and growing Volume
+ Complex and heterogeneous Variety
+ Continuous stream and real time Velocity
+ Sometimes imprecise Veracity
= Big Data 4V
A (big) technological problem
Intrinsic mismatch between Data and IT (Information Technology) :
Data volume increases 100x in 10 years
I/O bandwidth improves ~3x in 10 years
Difficult to process all the data with traditional applications within tolerable elapsed
time
What is Big Data

DataScale
Question is
How to bring a very practical solution to the challenge raised by the
exponential growth of the volume of data to be processed ?
DataScale project
Consortium of 9 partners, from large research
laboratories (CEA/DAM, IPGP) to SMEs,
including also big companies (BULL)
A two-year project, started in September 2013
Supported by the French government
Selected and funded by
the « Investments for the Future » program
DataScale objective
Design efficient Big Data solutions, suited to real use cases

Technological Solutions
High-Performance Computing
HPC already deals with data sets
from large-scale simulation of physical
phenomena
Enrich / Extend HPC solutions with
specific Big Data technological building
blocks
Building blocks
Efficient data processing (Distributed Mining of Data)
 Distribute, parallelize and deploy the application on HPC platform
Efficient data management (Mining of Distributed Data)
 Define hierarchy of data storage (data life cycle, reuse process)
NoSQL DataBase Management System (DBMS) with data mining technologies
 Handle very large data volumes and different types of data
TGCC
Mka3D

CEA Use Cases
A data-driven project
 Evaluation of the relevance of the technological solutions by implementing
demonstrators.
 3 areas, 4 real world applications at real scale :
Area Application
Cluster management
(CEA/DSSI)
Monitoring and enhancement of
HPC platform
Analysis of HPC log journals with data mining
techniques (detection and correlation of failure
patterns)
Social Media
Monitoring
(Linkfluence)
Measuring and reporting daily
web activities (companies, user,
topic,…)
Analysis of millions of conversations and
images (100 countries and 50 languages)
through social accounts (eg. Twitter, Facebook,
Google+)
Seismology
(IPGP)
Tomography of Europe Seismic noise correlation of 200 European
stations (5 years of records)
Seismology
(CEA/DASE)
Event detection Massive correlation between continuous data
stream and event template (Master Event
algorithm)

CEA-PTS Collaboration
Unique data analysis to revise the seismicity :
- of the last 10 years
- at global scale with a network of seismic stations distributed globally
The IDC high-quality dataset is a natural candidate for an extensive cross
correlation study :
- continuous seismic data from the primary IMS stations since 2000.
- 450,000 seismic events in the REB,
- tens of millions of raw detections.
Collaboration with IDC teams to:
- enhance the Master Event algorithm (use of station 3CP, association,
synthetic master event, subspacing)
- test and deploy the application on the secure and powerful HPC
infrastructure of the CEA.

Roadmap
15 mai 2014 | PAGE
Date Phase
Sep. 2013 Kick-Off
Oct. 2014 Design Specification : workflow and NoSQL database
Mar. 2014 Development NoSQL DBMS (Armadillo)
Algorithm enhancement
Workflow integration
Sep. 2014 Test Deployment
Run at reduced scale (3 years, regional network)
Result analysis
Apr. 2015 Demonstration Run at full scale (10 years, global network)
Result analysis
Aug. 2015 Assessment Reflection on the new components integration in the
operational chain

DATASCALE Partners
The DataScale project partners are :
ActiveEon
Armadillo
Bull
CEA (DASE)
CEA (LIST)
CEA (DSSI)
INRIA
IPGP
Linkfluence

CONCLUSION
We are:
Facing a BIG challenge.
Preparing a decisive turn toward a new data management
infrastructure.
Not alone, surrounded with extremely valuable partners.
New approach to nuclear monitoring
Thank you for
your attention!

Big Data solution for CTBT monitoring:CEA-IDC joint global cross correlation project

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Big Data solution for CTBT monitoring:CEA-IDC joint global cross correlation project

Similar to Big Data solution for CTBT monitoring:CEA-IDC joint global cross correlation project (7)

More from Ivan Kitov

More from Ivan Kitov (18)

Big Data solution for CTBT monitoring:CEA-IDC joint global cross correlation project