Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Technologies

Deep Learning on
Apache Spark at CERN
Large Hadron Collider
with Intel Technologies
24th April 2019
Matteo Migliorini, Riccardo
Castellotti, Luca Canali,
Viktor Khristenko, Maria
Girone

International organisation straddling Swiss-French border, founded 1954
Facilities for fundamental research in particle physics
23 members states
1.1 B CHF budget (~1.1B USD)
3’000 members of personnel
15’ 000 associated members from 90 countries
Curiosity: CERN has its own TLD!
http://home.cern

Largest machine in the world
27 km-long Large Hadron Collider (LHC)
Fastest racetrack on Earth
Protons travel at 99.9999991% of the speed of light
Emptiest place in the solar system
Particules circulate in the highest vacuum
Hottest spot in the galaxy
Lead ion collisions create temperatures 100,000x hotter than the hearth of the sun

CMS
ALICE
ATLAS
LHCb
CMS
ATLAS
ALICE
LHCb

CMS
22m long, 15m wide
weighs 14’000 tonnes
Most powerful superconducting solenoid ever built
182 institutes from 42 countries

This can generate up to a petabyte
of data per second.
Filtering the data in real time, selecting potentially interesting events (trigger).
PB/s
40 million
collisions
per
second
100,000
selections
per
second
TB/s
1,000
selections
per
second
GB/s

The CERN Data Centre in Numbers
15 000 Servers 280 000 Cores
280 PB Hot
Storage
350 PB on tape
(+88PB from
LHC in 2018)
50 000 km Fiber
Optics

Worldwide LHC Computing Grid
Tier-0 (CERN):
•Data recording
•Initial data reconstruction
•Data distribution
Tier-1 (14 centres):
•Permanent storage
•Re-processing
•Analysis
Tier-2 (72
Federations, ~150
centres):
• Simulation
• End-user analysis
•900,000 cores
•1 EB

Hadoop and Spark Clusters at CERN
• Clusters: YARN/Hadoop and Spark on Kubernetes
• Hardware: Intel based servers, continuous refresh and capacity
expansion
Accelerator logging
(part of LHC
infrastructure)
30 nodes
(Cores - 800, Mem - 13 TB, Storage – 7.5 PB)
General Purpose 65 nodes
(Cores – 1.3k, Mem – 20 TB, Storage – 12.5 PB)
Cloud containers 60 nodes (Cores - 240, Mem – 480 GB)
Storage: remote HDFS or EOS (for physics data)

Analytics Pipelines – Use Cases
• Physics:
• Analysis on computing metadata
• Development of new ways to process Physics data, scale-out and ML
• IT:
• Analytics on IT monitoring data
• Computer security
• BE (Accelerators):
• NXCALS – next generation accelerator logging platform
• Industrial controls data and analytics

Analytics Platform Outlook
HEP software
Experiments storage
HDFS
Personal storage
Integrating with existing
infrastructure:
• Software
• Data

Code
Monitoring
Visualizations
Analytics with SWAN
Text
All the required tools,
software and data
available in a single
window!

Extending Spark to Read Physics Data
• Physics data is stored in EOS system, accessible with
xrootd protocol: extended HDFS APIs
• Stored in ROOT format: developed a Spark Datasource
• Currently: 300 PBs
• ~90 PBs per year of
operation
• https://github.com/cerndb/hadoop-xrootd
• https://github.com/diana-hep/spark-root
JNI
Hadoop
HDFS
APIHadoop-
XRootD
Connector
EOS
Storage
Service XRootD
Client
C++ Java

Deep Learning Pipeline for Physics Data
1
63%
2
36%
3
1%
Particle
Classifier
W + j
QCD
t-t̅

Data challenges in physics
• Proton-proton collisions in LHC happen at 40MHz.
• Hundreds of TB/s of electrical signals that allow physicists to investigate
particle collision events.
• Storage, limited by bandwidth
• Currently, only 1 every 40K events stored to disk (~10 GB/s).
2018: 5 collisions/beam cross
LHC
2026: 400 collisions/beam cross
High Luminosity-LHC

R&D
• Improve the quality of filtering systems
• Reduce false positives rate
• From rule-based algorithms to Deep Learning classifiers
• Advanced analytics at the edge
• Avoid wasting resources in offline computing
• Reduction of operational costs

Engineering Efforts to Enable Effective ML
• From “Hidden Technical Debt in Machine Learning
Systems”, D. Sculley at al. (Google), paper at NIPS 2015

Spark, Analytics Zoo & BigDL
• Apache Spark
• Leading tools and API for data processing at scale
• Analytics Zoo is a platform for unified analytics
and AI on Apache Spark leveraging BigDL /
Tensorflow
• For service developers: integration with infrastructure
(hardware, data access, operations)
• For users: Keras APIs to run user models, integration
with Spark data structures and pipelines
• BigDL is a distributed deep learning framework for
Apache Spark

Data Pipeline
Data
Ingestion
Feature
Preparation
Model
Development
Training
Read physics
data and feature
engineering
Prepare input
for Deep
Learning
network
1. Specify model
topology
2. Tune model
topology on
small dataset
Train the best
model
Leveraging Apache Spark and Analytics Zoo in Python Notebooks

The Dataset
● Software simulators generate events and
calculate the detector response
● Every event is a 801x19 matrix: for every
particle momentum, position, energy, charge
and particle type are given

Data Ingestion
• Read input files (4 TB) from custom format
• Compute physics-motivated features
• Store to parquet format
54 M events
~4TB
750 GBs
Stored on HDFS
Physics data
storage

Features Engineering
• From the 19 features recorded in the
experiment, 14 more are calculated based
on domain specific knowledge: these are
called High Level Features (HLF)
• A sorting metric to create a sequence of
particles to be fed to a sequence based
classifier

Feature Preparation
• All features need to be
converted to a format
consumable by the network
• One Hot Encoding of
categories
• Sort the particles for the
sequence classifier with a UDF
• Executed in PySpark using
Spark SQL and ML

Models Investigated
1. Fully connected feed-forward
DNN with High Level Features
2. DNN with a recursive layer
(based on GRUs)
3. Combination of (1) + (2)
Complexity
Performance

Hyper-Parameter Tuning– DNN
• Once the network topology is chosen, hyper-parameter
tuning is done with scikit-learn+Keras and parallelized
with Spark
*Other names and brands may be claimed as the property of others.

Model Development – DNN
• Model is instantiated with the Keras-
compatible API provided by Analytics Zoo

Model Development – GRU+HLF
A more complex topology for the network

Distributed Training
Instantiate the estimator using Analytics Zoo / BigDL
The actual training is distributed to Spark executors
Storing the model for later use

Performance and Scalability of Analytics Zoo & BigDL
Analytics Zoo & BigDL
scales very well in the
range tested

Results
• Trained models with
Analytics Zoo and BigDL
• Met the expected
accuracy results

Model Serving
• Using Apache Kafka
and Spark
• In FPGA replacing/integrating
current rule-based algorithms
MODEL
Streaming
platform
MODEL
RTL
translation
FPGA To
storage /
further
online
analysis

Summary
• We have successfully developed a Deep Learning pipeline
using Apache Spark and Analytics Zoo on Intel Xeon
servers
• The use case developed addresses the needs for higher
efficiency in event filtering at LHC experiments
• Spark, Python notebooks and Analytics Zoo provide intuitive APIs
for data preparation at scale on existing Hadoop cluster and cloud
• Analytics Zoo & BigDL solve the problems of scaling DL on Spark
clusters running on Intel Xeon servers, and offering familiar APIs
to researchers

Acknowledgements
• CERN Colleagues
• Hadoop, Spark and Streaming service
• CERN Openlab Data analytics project with Intel and CMS Big Data
project
• Open source Analytics Zoo & BigDL teams
• Colleagues from physics, authors of “Topology classification with deep
learning to improve real-time event selection at the LHC”,
https://arxiv.org/abs/1807.00083 - for discussions and sharing data
*Other names and brands may be claimed as the property of others.

Thank you
References:
Analytics Zoo: https://github.com/intel-analytics/analytics-zoo
BigDL: software.intel.com/bigdl
Code at: https://github.com/cerndb/SparkDLTrigger

Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Technologies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Technologies

Similar to Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Technologies (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Technologies