SlideShare a Scribd company logo
Enric Tejedor, Prasanth Kothuri, CERN
CERN´s Platform for
Data Analysis with Spark
#SAISExp1
Outline
1. What we do at CERN
2. Interactive data analysis with Spark
3. Use cases
1. Physics data
2. Controls data
2#SAISExp1
##SAISExp1
Founded in 1954
Mission: fundamental research in Physics
3
##SAISExp1
Alice
The Large Hadron Collider (LHC)
The largest scientific experiment to date
1 PB / second of raw physics data 4
TOTEM
##SAISExp1
Discovery of the Higgs boson
ATLAS & CMS 2012
5
##SAISExp1
Invention of the WWW
Sir Tim Berners-Lee 1989
http://info.cern.ch
6
##SAISExp1 7
>15,000 scientists
>100 nationalities
Physics Data Processing
and Analysis at CERN
#SAISExp1
LHC Data Pipeline at CERN
9#SAISExp1
Data collection Data analysis
Offline processing
Event selection,
stat. treatment…
Reconstruction
Processing,
skimming
Event filtering
Reco
Results
Raw
data
Analysis
formats
CMS
ALICE
LHCb
ATLAS
ROOT
• The data analysis framework for
high-energy physics (HEP)
• Data processing, statistical
analysis, visualisation, I/O and
storage
• ~1 EB stored in ROOT format
– Binary, compressed
– Columnar
10#SAISExp1
https://root.cern
ROOT Dataset
11#SAISExp1
px py pz e
Row:
collision event
ROOT dataset stored in
one or multiple files
Column:
physics entity
Can be nested
structures
(tree-like)
##SAISExp1
The LHC
Computing Grid
Started in 2002
Provide processing power and data
access to physicists
~170 centres in 42 countries
Running 24/7/365
12
Distributed Computing in HEP
• Physicists use Grid and batch resources to process
LHC data in parallel
13#SAISExp1
Initial
Dataset
Merge
Merge job
Data processing
Job Range 1
Job Range 2
Job Range N
…
Final
Result
…
Partial results
HL-LHC: Even More Data!
• Coming upgrade:
High-Luminosity LHC
• 30x more data
collected
• Big challenge for
software and
computing
14#SAISExp1
LHC HL-LHC
CMS estimated disk space required by tier
Source:HSFcommunitywhitepaper
LHC Data Pipeline at CERN
15#SAISExp1
Data collection Data analysis
Offline processing
Event selection,
stat. treatment…
Reconstruction
Processing,
skimming
Event filtering
Reco
Results
Raw
data
Analysis
formats
CMS
ALICE
LHCb
ATLAS
Final Steps of Data Analysis
16#SAISExp1
Data collection Data analysis
Offline processing
Event selection,
stat. treatment…
Reconstruction
Processing,
skimming
Event filtering
Reco
Results
Raw
data
Analysis
formats
CMS
ALICE
LHCb
ATLAS
Interactive Data Analysis
##SAISExp1
The SWAN Service
• SWAN: Service for Web-based Analysis
• Interactive computing platform for scientists
– Based on Jupyter notebooks
• Analysis with only a web browser
• Easy sharing of results
• Integrated with CERN resources
– Storage, software and computing
18#SAISExp1
https://swan.web.cern.ch
SWAN Pillars
19#SAISExp1
Storage
Computing
Software
SWAN Architecture
20#SAISExp1
Web Portal
User Session Scheduler
CERN Resources
EOS/CERNBox
(Data/User files)
CVMFS
(Software)
User 1 User 2 User n...
CERN
Credentials
SWAN Interface: Notebooks
21#SAISExp1
Sharing in SWAN
22#SAISExp1
Integration with Spark
23
Spark Cluster
Spark Master
Spark Executor
Task
User Notebook
Task Task
Spark Driver
Offload computations to
pluggable resources
#SAISExp1
Spark Connector
24
Configure Spark and
connect to cluster
with a click
#SAISExp1
Spark Monitor
• Bridge the gap between
interactive computing and
distributed data
processing
• Automatically appears
when a Spark job is
submitted from a cell
• Progress bars, task
timeline, resource
utilisation
25#SAISExp1
Code here!
Physics Data
Use Case
#SAISExp1
The HEP DataFrame
• RDataFrame
– Implemented in C++, interfaced also to Python
– Tailored for ROOT and HEP
27
df = RDataFrame(dataset)
df2 = df.Filter('x > 0')
.Define('r2', 'x*x + y*y')
h = df2.Histo1D('r2')
g = df2.Graph('x', 'y')
data
x,y
Filter
x>0
Define
r²=x²+y²
Histo1D
r²
Graph
x,y
#SAISExp1
Distributed RDataFrame
• Exploratory work to parallelise RDataFrame computations with
multiple backends
28
d = RDataFrame(dataset)
f = d.Define(...)
.Define(...)
.Filter(...)
h1 = f.Histo1D(...)
h2 = f.Histo2D(...)
g = f.Graph(...)
Local
…
…
…
…
…
CPU
…
…
…
…
CPU
…
…
…
…
CPU
…
…
…
…
CPU
…
…
…
…
CPU
…
…
…
…
CPU
…
…
…
…
CPU
Computation
Graph
#SAISExp1
Spark Cluster
Executor
Spark Backend for RDataFrame
• Map-reduce workflow where every mapper runs the RDataFrame computation
graph on a range of collision events
• Run analysis in C++ with Spark
– Through Python bindings
29
Spark Backend
Mapper
Driver
px py pz eMake ranges
Read ranges
Schedule tasks
Reducer
#SAISExp1
Real Example: TOTEM Analysis
• TOTEM experiment analysis
coded with RDataFrame
• Spark backend
• 4.7 TB dataset on EOS
• Launched from SWAN to a
dedicated Spark cluster
• Get to physics results faster!
30#SAISExp1
Controls Data
Use Case
#SAISExp1
LHC: Huge machine, highly sophisticated
Control and monitoring are crucial
32
Hardware Controls
• Complex control system for monitoring
• 1000s of devices, 100s of properties each
33#SAISExp1
Controls Log Data Growth
34
Now
1.5 TB/day
#SAISExp1
CERN Accelerator Logging Service
• Old system based on SQL databases
– Hard to scale horizontally
– Slow data extraction
• New system (NXCALS)
– Data pumped into HBase and HDFS (Parquet)
– Spark to extract and process data
– SWAN to visualise + analyse
35#SAISExp1
NXCALS Data Analysis in SWAN
• NXCALS rely on
SWAN as their data
analysis platform
• Connection to Spark
clusters
• Access to software
(data science Python
ecosystem)
36#SAISExp1
Summary
#SAISExp1
Challenges
• The increase in physics and controls data volumes
and complexity is pushing software at CERN
– Adoption of Spark and other big data technologies still in its
early stages
• Large codebase developed over decades
– Cannot change overnight
• Changing the mindset of programmers takes time
– Declarative analysis
– Pushing computations to data
38#SAISExp1
Future Directions
• Bridge the gap between data processing needs and
technology evolution
– Complement traditional ways with new strategies
• Combine interactive analysis with easy access to
more processing power
– Higher-level programming models
– Pluggable computing resources
More on CERN and Spark: stay tuned for Luca’s and
Prasanth’s presentations
39#SAISExp1
Backup Slides
#SAISExp1
JupyterLab
• Jupyter is evolving
towards a desktop-
like environment
– Notebook, terminal,
file browser,
editors, …
– Highly
customisable
41#SAISExp1

More Related Content

What's hot

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Databricks
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
Amit Raj
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 

What's hot (20)

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 

Similar to CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Tejedor

Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Databricks
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
inside-BigData.com
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Databricks
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie Bard
PacificResearchPlatform
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Big Data Spain
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
Robert Grossman
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
HPCC Systems
 
DA-JPL-final
DA-JPL-finalDA-JPL-final
DA-JPL-final
Manuel Martin Marquez
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
Spark Summit
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
(R)Evolution in the CERN IT Department: A 5 year perspective on the Agile Inf...
(R)Evolution in the CERN IT Department: A 5 year perspective on the Agile Inf...(R)Evolution in the CERN IT Department: A 5 year perspective on the Agile Inf...
(R)Evolution in the CERN IT Department: A 5 year perspective on the Agile Inf...
Arne Wiebalck
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
OSMC 2012 | Monitoring at CERN by Christophe Haen
OSMC 2012 | Monitoring at CERN by Christophe HaenOSMC 2012 | Monitoring at CERN by Christophe Haen
OSMC 2012 | Monitoring at CERN by Christophe Haen
NETWAYS
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
Tim Bell
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
inside-BigData.com
 
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
J On The Beach
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Docker, Inc.
 

Similar to CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Tejedor (20)

Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
NERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie BardNERSC, AI and the Superfacility, Debbie Bard
NERSC, AI and the Superfacility, Debbie Bard
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
 
DA-JPL-final
DA-JPL-finalDA-JPL-final
DA-JPL-final
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
(R)Evolution in the CERN IT Department: A 5 year perspective on the Agile Inf...
(R)Evolution in the CERN IT Department: A 5 year perspective on the Agile Inf...(R)Evolution in the CERN IT Department: A 5 year perspective on the Agile Inf...
(R)Evolution in the CERN IT Department: A 5 year perspective on the Agile Inf...
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
OSMC 2012 | Monitoring at CERN by Christophe Haen
OSMC 2012 | Monitoring at CERN by Christophe HaenOSMC 2012 | Monitoring at CERN by Christophe Haen
OSMC 2012 | Monitoring at CERN by Christophe Haen
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
 
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
exukyp
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 

Recently uploaded (20)

一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理一比一原版(UofT毕业证)多伦多大学毕业证如何办理
一比一原版(UofT毕业证)多伦多大学毕业证如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 

CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Tejedor

  • 1. Enric Tejedor, Prasanth Kothuri, CERN CERN´s Platform for Data Analysis with Spark #SAISExp1
  • 2. Outline 1. What we do at CERN 2. Interactive data analysis with Spark 3. Use cases 1. Physics data 2. Controls data 2#SAISExp1
  • 3. ##SAISExp1 Founded in 1954 Mission: fundamental research in Physics 3
  • 4. ##SAISExp1 Alice The Large Hadron Collider (LHC) The largest scientific experiment to date 1 PB / second of raw physics data 4 TOTEM
  • 5. ##SAISExp1 Discovery of the Higgs boson ATLAS & CMS 2012 5
  • 6. ##SAISExp1 Invention of the WWW Sir Tim Berners-Lee 1989 http://info.cern.ch 6
  • 8. Physics Data Processing and Analysis at CERN #SAISExp1
  • 9. LHC Data Pipeline at CERN 9#SAISExp1 Data collection Data analysis Offline processing Event selection, stat. treatment… Reconstruction Processing, skimming Event filtering Reco Results Raw data Analysis formats CMS ALICE LHCb ATLAS
  • 10. ROOT • The data analysis framework for high-energy physics (HEP) • Data processing, statistical analysis, visualisation, I/O and storage • ~1 EB stored in ROOT format – Binary, compressed – Columnar 10#SAISExp1 https://root.cern
  • 11. ROOT Dataset 11#SAISExp1 px py pz e Row: collision event ROOT dataset stored in one or multiple files Column: physics entity Can be nested structures (tree-like)
  • 12. ##SAISExp1 The LHC Computing Grid Started in 2002 Provide processing power and data access to physicists ~170 centres in 42 countries Running 24/7/365 12
  • 13. Distributed Computing in HEP • Physicists use Grid and batch resources to process LHC data in parallel 13#SAISExp1 Initial Dataset Merge Merge job Data processing Job Range 1 Job Range 2 Job Range N … Final Result … Partial results
  • 14. HL-LHC: Even More Data! • Coming upgrade: High-Luminosity LHC • 30x more data collected • Big challenge for software and computing 14#SAISExp1 LHC HL-LHC CMS estimated disk space required by tier Source:HSFcommunitywhitepaper
  • 15. LHC Data Pipeline at CERN 15#SAISExp1 Data collection Data analysis Offline processing Event selection, stat. treatment… Reconstruction Processing, skimming Event filtering Reco Results Raw data Analysis formats CMS ALICE LHCb ATLAS
  • 16. Final Steps of Data Analysis 16#SAISExp1 Data collection Data analysis Offline processing Event selection, stat. treatment… Reconstruction Processing, skimming Event filtering Reco Results Raw data Analysis formats CMS ALICE LHCb ATLAS
  • 18. The SWAN Service • SWAN: Service for Web-based Analysis • Interactive computing platform for scientists – Based on Jupyter notebooks • Analysis with only a web browser • Easy sharing of results • Integrated with CERN resources – Storage, software and computing 18#SAISExp1 https://swan.web.cern.ch
  • 20. SWAN Architecture 20#SAISExp1 Web Portal User Session Scheduler CERN Resources EOS/CERNBox (Data/User files) CVMFS (Software) User 1 User 2 User n... CERN Credentials
  • 23. Integration with Spark 23 Spark Cluster Spark Master Spark Executor Task User Notebook Task Task Spark Driver Offload computations to pluggable resources #SAISExp1
  • 24. Spark Connector 24 Configure Spark and connect to cluster with a click #SAISExp1
  • 25. Spark Monitor • Bridge the gap between interactive computing and distributed data processing • Automatically appears when a Spark job is submitted from a cell • Progress bars, task timeline, resource utilisation 25#SAISExp1 Code here!
  • 27. The HEP DataFrame • RDataFrame – Implemented in C++, interfaced also to Python – Tailored for ROOT and HEP 27 df = RDataFrame(dataset) df2 = df.Filter('x > 0') .Define('r2', 'x*x + y*y') h = df2.Histo1D('r2') g = df2.Graph('x', 'y') data x,y Filter x>0 Define r²=x²+y² Histo1D r² Graph x,y #SAISExp1
  • 28. Distributed RDataFrame • Exploratory work to parallelise RDataFrame computations with multiple backends 28 d = RDataFrame(dataset) f = d.Define(...) .Define(...) .Filter(...) h1 = f.Histo1D(...) h2 = f.Histo2D(...) g = f.Graph(...) Local … … … … … CPU … … … … CPU … … … … CPU … … … … CPU … … … … CPU … … … … CPU … … … … CPU Computation Graph #SAISExp1
  • 29. Spark Cluster Executor Spark Backend for RDataFrame • Map-reduce workflow where every mapper runs the RDataFrame computation graph on a range of collision events • Run analysis in C++ with Spark – Through Python bindings 29 Spark Backend Mapper Driver px py pz eMake ranges Read ranges Schedule tasks Reducer #SAISExp1
  • 30. Real Example: TOTEM Analysis • TOTEM experiment analysis coded with RDataFrame • Spark backend • 4.7 TB dataset on EOS • Launched from SWAN to a dedicated Spark cluster • Get to physics results faster! 30#SAISExp1
  • 32. LHC: Huge machine, highly sophisticated Control and monitoring are crucial 32
  • 33. Hardware Controls • Complex control system for monitoring • 1000s of devices, 100s of properties each 33#SAISExp1
  • 34. Controls Log Data Growth 34 Now 1.5 TB/day #SAISExp1
  • 35. CERN Accelerator Logging Service • Old system based on SQL databases – Hard to scale horizontally – Slow data extraction • New system (NXCALS) – Data pumped into HBase and HDFS (Parquet) – Spark to extract and process data – SWAN to visualise + analyse 35#SAISExp1
  • 36. NXCALS Data Analysis in SWAN • NXCALS rely on SWAN as their data analysis platform • Connection to Spark clusters • Access to software (data science Python ecosystem) 36#SAISExp1
  • 38. Challenges • The increase in physics and controls data volumes and complexity is pushing software at CERN – Adoption of Spark and other big data technologies still in its early stages • Large codebase developed over decades – Cannot change overnight • Changing the mindset of programmers takes time – Declarative analysis – Pushing computations to data 38#SAISExp1
  • 39. Future Directions • Bridge the gap between data processing needs and technology evolution – Complement traditional ways with new strategies • Combine interactive analysis with easy access to more processing power – Higher-level programming models – Pluggable computing resources More on CERN and Spark: stay tuned for Luca’s and Prasanth’s presentations 39#SAISExp1
  • 41. JupyterLab • Jupyter is evolving towards a desktop- like environment – Notebook, terminal, file browser, editors, … – Highly customisable 41#SAISExp1