SlideShare a Scribd company logo
Big Data from the LHC
Commissioning	

!

Practical Lessons from Big Science
Simon/@drsm79
Hello!
Time at places I’ve worked
Bristol University

Cloudant
Python

Perl

Bash

C++

Java

Javascript

Fortran

100

75

50

25

0
2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013
The formula
G*E
The formula
Fixed

G* E

Fixed

Usually fixed
The formula
Grant * Effectiveness
The life of LHC data
1. Detected by experiment	

2. “Online” filtering (hardware and software)	

3. Transferred to CERN main campus, archived & reconstructed	

4. Transferred to T1 sites, archived, reconstructed & skimmed	

5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed	

6. Written into locally analysable files, put on laptops	

7. Turned into a plot in a paper
The life of LHC data
1. Detected by experiment
2. “Online” filtering (hardware and software)	

3. Transferred to CERN main campus, archived & reconstructed	

4. Transferred to T1 sites, archived, reconstructed & skimmed	

5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed	

6. Written into locally analysable files, put on laptops	

7. Turned into a plot in a paper
Dig big
tu nne ls
Chain u p se rie s o f
“ato m smashe rs”
Pu t se nsitive cam eras in
aw kw ard places
Re co rd e ve nts
Process data on
high end
machines
http://www.chilton-computing.org.uk
The life of LHC data
1. Detected by experiment	

2. “Online” filtering (hardware and software)
3. Transferred to CERN main campus, archived & reconstructed	

4. Transferred to T1 sites, archived, reconstructed & skimmed	

5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed	

6. Written into locally analysable files, put on laptops	

7. Turned into a plot in a paper
CMS online data flow
We have a big digital camera
It takes photos of this

courtesy of James Jackson
which come out like this

courtesy of James Jackson
CMS online data flow
We have a big digital camera

Which goes into lots of
computers (the HLT)
CMS online data flow
We have a big digital camera

Which goes into lots of
computers (the HLT)
Which goes into lots of
disk (the Storage Manager)
CMS data flow
Write to digital
We have a big HLT at camera
~200GB/s
Which goes into lots of
Write to Storage
computers ~2GB/s
(the HLT)
Manager at
Which goes into lots of
Write to T0 at ~2GB/s
disk (the Storage Manager)
The life of LHC data
1. Detected by experiment	

2. “Online” filtering (hardware and software)	

3. Transferred to CERN main campus, archived & reconstructed
4. Transferred to T1 sites, archived, reconstructed & skimmed	

5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed	

6. Written into locally analysable files, put on laptops	

7. Turned into a plot in a paper
10PB of data/year
The life of LHC data
1. Detected by experiment	

2. “Online” filtering (hardware and software)	

3. Transferred to CERN main campus, archived & reconstructed	

4. Transferred to T1 sites, archived, reconstructed & skimmed
5. Transferred to T2 sites, reconstructed, skimmed, filtered &
analysed
6. Written into locally analysable files, put on laptops	

7. Turned into a plot in a paper
1PB/week
Why transfer so much
data?
To process all the data
taken in one year on
one computer would
take ~64,000 years
The life of LHC data
1. Detected by experiment	

2. “Online” filtering (hardware and software)	

3. Transferred to CERN main campus, archived & reconstructed	

4. Transferred to T1 sites, archived, reconstructed & skimmed	

5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed	

6. Written into locally analysable files, put on laptops
7. Turned into a plot in a paper
Analysis
• Each analysis is ~unique	

• Query language is C++	

• Runs on distributed system and local resources	

• Series of “cut” selections to identify interesting
events	


• Data in the final plot may be substantially
reduced from the original dataset
Workflow ladder
Number of users
Large datasets (>100 TB)	

Complex computation
Large datasets (>100 TB)	

Simple computation
Shared datasets (>500 GB)	

Complex computation
Shared datasets (10-500 GB)	

Complex computation
Shared datasets (10-100 GB)	

Simple computation
Shared datasets (0.1-10 GB)	

Simple computation
Private datasets (0.1-10 GB)	

Simple computation

}
}
}

Use Grid compute and storage 	

exclusively

Work on departmental resources,	

store resulting datasets to Grid storage

Work on laptop/desktop machine,	

store resulting datasets to Grid storage
The life of LHC
simulated data
1. Simulated by experimentalists at T0/T1/T2 sites	

2. Transferred to T1 sites, archived possibly reconstructed &
skimmed	

3. Transferred to T2 sites, reconstructed, skimmed, filtered &
analysed	

4. Written into locally analysable files, put on laptops	

5. Turned into a plot in a paper
Most events get cut
!

“We are going to die, and that makes us the
lucky ones. Most people are never going to
die because they are never going to be born.”
!

- Richard Dawkins
Adoption & Use
Setup
• Maybe a bit different to other people	

• Many sites (>100) with >100’s TB storage,
10000’s worker nodes	


• Global system	

• Why not at one site?	

• politics, power budget, cost
The grid
We Have a “Big Data”
Problem
We Have a Big “Data
Problem”
Do what you do best,
out source the rest
What's interesting is
that big data isn't
interesting any more
NIH
Define and refine
workflows
Our situation

•

Expert users, who are not
interested in infrastructure	


• Will work around things they
perceive as unnecessary
limitations
Disruptive users
How to engage
disruptive users?
Open access
1PB/week
Open access
Our situation
• Limited resources for integration/
testbed style activities	


• Strange organisation
Data temperature
There is no such thing
as now
Keep things as local as
possible
Defining monitoring is
difficult
Small files are bad,
m'kay
Compartmentalise
metadata
Recognise, embrace and
communicate failures
People are harder than
computers
People are important
The formula

64
Consequences
• Automate all the things	

• Learn to love a configuration management
system	


• Make sure everyone in the team knows
how to interact with it	


• Simple human solutions go a long way
Build good abstractions
Encourage
collaboration
Workflow ladder
Number of users
Large datasets (100 TB)	

Complex computation
Large datasets (100 TB)	

Simple computation
Shared datasets (500 GB)	

Complex computation
Shared datasets (10-500 GB)	

Complex computation
Shared datasets (10-100 GB)	

Simple computation
Shared datasets (0.1-10 GB)	

Simple computation
Private datasets (0.1-10 GB)	

Simple computation

}
}
}

Use Grid compute and storage 	

exclusively

Work on departmental resources,	

store resulting datasets to Grid storage

Work on laptop/desktop machine,	

store resulting datasets to Grid storage
Summary

More Related Content

What's hot

Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
Kim Herzig
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
SAIL_QU
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 
Virtual Science in the Cloud
Virtual Science in the CloudVirtual Science in the Cloud
Virtual Science in the Cloud
thetfoot
 

What's hot (20)

Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
The Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environmentThe Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environment
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformatics
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Computational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methodsComputational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methods
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Rob Davidson at the G3 Workshop: Open Source - Tools for ReproducibilityRob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
Rob Davidson at the G3 Workshop: Open Source - Tools for Reproducibility
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
Virtual Science in the Cloud
Virtual Science in the CloudVirtual Science in the Cloud
Virtual Science in the Cloud
 
FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow software
 

Viewers also liked

Legal and ethical considerations redone
Legal and ethical considerations   redoneLegal and ethical considerations   redone
Legal and ethical considerations redone
Nicole174
 
45 second video proposal
45 second video proposal45 second video proposal
45 second video proposal
Nicole174
 
Interactive media applications
Interactive media applicationsInteractive media applications
Interactive media applications
Nicole174
 

Viewers also liked (20)

Why other ppl_dont_get_it
Why other ppl_dont_get_itWhy other ppl_dont_get_it
Why other ppl_dont_get_it
 
Little words of wisdom for the developer - Guillaume Laforge (Pivotal)
Little words of wisdom for the developer - Guillaume Laforge (Pivotal)Little words of wisdom for the developer - Guillaume Laforge (Pivotal)
Little words of wisdom for the developer - Guillaume Laforge (Pivotal)
 
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...
 
Are you better than a coin toss? - Richard Warbuton & John Oliver (jClarity)
Are you better than a coin toss?  - Richard Warbuton & John Oliver (jClarity)Are you better than a coin toss?  - Richard Warbuton & John Oliver (jClarity)
Are you better than a coin toss? - Richard Warbuton & John Oliver (jClarity)
 
Packed Objects: Fast Talking Java Meets Native Code - Steve Poole (IBM)
Packed Objects: Fast Talking Java Meets Native Code - Steve Poole (IBM)Packed Objects: Fast Talking Java Meets Native Code - Steve Poole (IBM)
Packed Objects: Fast Talking Java Meets Native Code - Steve Poole (IBM)
 
Legal and ethical considerations redone
Legal and ethical considerations   redoneLegal and ethical considerations   redone
Legal and ethical considerations redone
 
45 second video proposal
45 second video proposal45 second video proposal
45 second video proposal
 
How Java got its Mojo Back - James Governor (Redmonk)
How Java got its Mojo Back - James Governor (Redmonk)					How Java got its Mojo Back - James Governor (Redmonk)
How Java got its Mojo Back - James Governor (Redmonk)
 
Real-world polyglot programming on the JVM - Ben Summers (ONEIS)
Real-world polyglot programming on the JVM  - Ben Summers (ONEIS)Real-world polyglot programming on the JVM  - Ben Summers (ONEIS)
Real-world polyglot programming on the JVM - Ben Summers (ONEIS)
 
What You Need to Know About Lambdas - Jamie Allen (Typesafe)
What You Need to Know About Lambdas - Jamie Allen (Typesafe)What You Need to Know About Lambdas - Jamie Allen (Typesafe)
What You Need to Know About Lambdas - Jamie Allen (Typesafe)
 
Garbage Collection: the Useful Parts - Martijn Verburg & Dr John Oliver (jCla...
Garbage Collection: the Useful Parts - Martijn Verburg & Dr John Oliver (jCla...Garbage Collection: the Useful Parts - Martijn Verburg & Dr John Oliver (jCla...
Garbage Collection: the Useful Parts - Martijn Verburg & Dr John Oliver (jCla...
 
The state of the art biorepository at ILRI
The state of the art biorepository at ILRIThe state of the art biorepository at ILRI
The state of the art biorepository at ILRI
 
Bringing your app to the web with Dart - Chris Buckett (Entity Group)
Bringing your app to the web with Dart - Chris Buckett (Entity Group)Bringing your app to the web with Dart - Chris Buckett (Entity Group)
Bringing your app to the web with Dart - Chris Buckett (Entity Group)
 
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
 
How Hailo fuels its growth using NoSQL storage and analytics - Dave Gardner (...
How Hailo fuels its growth using NoSQL storage and analytics - Dave Gardner (...How Hailo fuels its growth using NoSQL storage and analytics - Dave Gardner (...
How Hailo fuels its growth using NoSQL storage and analytics - Dave Gardner (...
 
Design is a Process, not an Artefact - Trisha Gee (MongoDB)
Design is a Process, not an Artefact - Trisha Gee (MongoDB)Design is a Process, not an Artefact - Trisha Gee (MongoDB)
Design is a Process, not an Artefact - Trisha Gee (MongoDB)
 
Practical Performance: Understand the Performance of Your Application - Chris...
Practical Performance: Understand the Performance of Your Application - Chris...Practical Performance: Understand the Performance of Your Application - Chris...
Practical Performance: Understand the Performance of Your Application - Chris...
 
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)
 
Interactive media applications
Interactive media applicationsInteractive media applications
Interactive media applications
 
Databases and agile development - Dwight Merriman (MongoDB)
Databases and agile development - Dwight Merriman (MongoDB)Databases and agile development - Dwight Merriman (MongoDB)
Databases and agile development - Dwight Merriman (MongoDB)
 

Similar to Big data from the LHC commissioning: practical lessons from big science - Simon Metson (Cloudant)

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
Ian Foster
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
Chris Dwan
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
Ian Foster
 

Similar to Big data from the LHC commissioning: practical lessons from big science - Simon Metson (Cloudant) (20)

Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Lessons from lhc
Lessons from lhcLessons from lhc
Lessons from lhc
 
Clouds, Grids and Data
Clouds, Grids and DataClouds, Grids and Data
Clouds, Grids and Data
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Climb bath
Climb bathClimb bath
Climb bath
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 

More from jaxLondonConference

More from jaxLondonConference (19)

Conflict Free Replicated Data-types in Eventually Consistent Systems - Joel J...
Conflict Free Replicated Data-types in Eventually Consistent Systems - Joel J...Conflict Free Replicated Data-types in Eventually Consistent Systems - Joel J...
Conflict Free Replicated Data-types in Eventually Consistent Systems - Joel J...
 
JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)JVM Support for Multitenant Applications - Steve Poole (IBM)
JVM Support for Multitenant Applications - Steve Poole (IBM)
 
Introducing Vert.x 2.0 - Taking polyglot application development to the next ...
Introducing Vert.x 2.0 - Taking polyglot application development to the next ...Introducing Vert.x 2.0 - Taking polyglot application development to the next ...
Introducing Vert.x 2.0 - Taking polyglot application development to the next ...
 
Are Hypermedia APIs Just Hype? - Aaron Phethean (Temenos) & Daniel Feist (Mul...
Are Hypermedia APIs Just Hype? - Aaron Phethean (Temenos) & Daniel Feist (Mul...Are Hypermedia APIs Just Hype? - Aaron Phethean (Temenos) & Daniel Feist (Mul...
Are Hypermedia APIs Just Hype? - Aaron Phethean (Temenos) & Daniel Feist (Mul...
 
Java Testing With Spock - Ken Sipe (Trexin Consulting)
Java Testing With Spock - Ken Sipe (Trexin Consulting)Java Testing With Spock - Ken Sipe (Trexin Consulting)
Java Testing With Spock - Ken Sipe (Trexin Consulting)
 
Streams and Things - Darach Ennis (Ubiquiti Networks)
Streams and Things - Darach Ennis (Ubiquiti Networks)Streams and Things - Darach Ennis (Ubiquiti Networks)
Streams and Things - Darach Ennis (Ubiquiti Networks)
 
What makes Groovy Groovy - Guillaume Laforge (Pivotal)
What makes Groovy Groovy  - Guillaume Laforge (Pivotal)What makes Groovy Groovy  - Guillaume Laforge (Pivotal)
What makes Groovy Groovy - Guillaume Laforge (Pivotal)
 
The Java Virtual Machine is Over - The Polyglot VM is here - Marcus Lagergren...
The Java Virtual Machine is Over - The Polyglot VM is here - Marcus Lagergren...The Java Virtual Machine is Over - The Polyglot VM is here - Marcus Lagergren...
The Java Virtual Machine is Over - The Polyglot VM is here - Marcus Lagergren...
 
Java EE 7 Platform: Boosting Productivity and Embracing HTML5 - Arun Gupta (R...
Java EE 7 Platform: Boosting Productivity and Embracing HTML5 - Arun Gupta (R...Java EE 7 Platform: Boosting Productivity and Embracing HTML5 - Arun Gupta (R...
Java EE 7 Platform: Boosting Productivity and Embracing HTML5 - Arun Gupta (R...
 
Exploring the Talend unified Big Data toolset for sentiment analysis - Ben Br...
Exploring the Talend unified Big Data toolset for sentiment analysis - Ben Br...Exploring the Talend unified Big Data toolset for sentiment analysis - Ben Br...
Exploring the Talend unified Big Data toolset for sentiment analysis - Ben Br...
 
The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)The Curious Clojurist - Neal Ford (Thoughtworks)
The Curious Clojurist - Neal Ford (Thoughtworks)
 
TDD at scale - Mash Badar (UBS)
TDD at scale - Mash Badar (UBS)TDD at scale - Mash Badar (UBS)
TDD at scale - Mash Badar (UBS)
 
Run Your Java Code on Cloud Foundry - Andy Piper (Pivotal)
Run Your Java Code on Cloud Foundry - Andy Piper (Pivotal)Run Your Java Code on Cloud Foundry - Andy Piper (Pivotal)
Run Your Java Code on Cloud Foundry - Andy Piper (Pivotal)
 
Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)
 
Put your Java apps to sleep? Find out how - John Matthew Holt (Waratek)
Put your Java apps to sleep? Find out how - John Matthew Holt (Waratek)Put your Java apps to sleep? Find out how - John Matthew Holt (Waratek)
Put your Java apps to sleep? Find out how - John Matthew Holt (Waratek)
 
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
Project Lambda: Functional Programming Constructs in Java - Simon Ritter (Ora...
 
Do You Like Coffee with Your dessert? Java and the Raspberry Pi - Simon Ritte...
Do You Like Coffee with Your dessert? Java and the Raspberry Pi - Simon Ritte...Do You Like Coffee with Your dessert? Java and the Raspberry Pi - Simon Ritte...
Do You Like Coffee with Your dessert? Java and the Raspberry Pi - Simon Ritte...
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
The architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfThe architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdf
 

Big data from the LHC commissioning: practical lessons from big science - Simon Metson (Cloudant)

  • 1. Big Data from the LHC Commissioning ! Practical Lessons from Big Science Simon/@drsm79
  • 3.
  • 4.
  • 5.
  • 6. Time at places I’ve worked Bristol University Cloudant
  • 10. The formula Grant * Effectiveness
  • 11. The life of LHC data 1. Detected by experiment 2. “Online” filtering (hardware and software) 3. Transferred to CERN main campus, archived & reconstructed 4. Transferred to T1 sites, archived, reconstructed & skimmed 5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed 6. Written into locally analysable files, put on laptops 7. Turned into a plot in a paper
  • 12. The life of LHC data 1. Detected by experiment 2. “Online” filtering (hardware and software) 3. Transferred to CERN main campus, archived & reconstructed 4. Transferred to T1 sites, archived, reconstructed & skimmed 5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed 6. Written into locally analysable files, put on laptops 7. Turned into a plot in a paper
  • 14. Chain u p se rie s o f “ato m smashe rs”
  • 15. Pu t se nsitive cam eras in aw kw ard places
  • 16. Re co rd e ve nts
  • 17. Process data on high end machines http://www.chilton-computing.org.uk
  • 18. The life of LHC data 1. Detected by experiment 2. “Online” filtering (hardware and software) 3. Transferred to CERN main campus, archived & reconstructed 4. Transferred to T1 sites, archived, reconstructed & skimmed 5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed 6. Written into locally analysable files, put on laptops 7. Turned into a plot in a paper
  • 19. CMS online data flow We have a big digital camera
  • 20. It takes photos of this courtesy of James Jackson
  • 21. which come out like this courtesy of James Jackson
  • 22. CMS online data flow We have a big digital camera Which goes into lots of computers (the HLT)
  • 23. CMS online data flow We have a big digital camera Which goes into lots of computers (the HLT) Which goes into lots of disk (the Storage Manager)
  • 24. CMS data flow Write to digital We have a big HLT at camera ~200GB/s Which goes into lots of Write to Storage computers ~2GB/s (the HLT) Manager at Which goes into lots of Write to T0 at ~2GB/s disk (the Storage Manager)
  • 25. The life of LHC data 1. Detected by experiment 2. “Online” filtering (hardware and software) 3. Transferred to CERN main campus, archived & reconstructed 4. Transferred to T1 sites, archived, reconstructed & skimmed 5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed 6. Written into locally analysable files, put on laptops 7. Turned into a plot in a paper
  • 27. The life of LHC data 1. Detected by experiment 2. “Online” filtering (hardware and software) 3. Transferred to CERN main campus, archived & reconstructed 4. Transferred to T1 sites, archived, reconstructed & skimmed 5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed 6. Written into locally analysable files, put on laptops 7. Turned into a plot in a paper
  • 29. Why transfer so much data?
  • 30. To process all the data taken in one year on one computer would take ~64,000 years
  • 31.
  • 32. The life of LHC data 1. Detected by experiment 2. “Online” filtering (hardware and software) 3. Transferred to CERN main campus, archived & reconstructed 4. Transferred to T1 sites, archived, reconstructed & skimmed 5. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed 6. Written into locally analysable files, put on laptops 7. Turned into a plot in a paper
  • 33. Analysis • Each analysis is ~unique • Query language is C++ • Runs on distributed system and local resources • Series of “cut” selections to identify interesting events • Data in the final plot may be substantially reduced from the original dataset
  • 34. Workflow ladder Number of users Large datasets (>100 TB) Complex computation Large datasets (>100 TB) Simple computation Shared datasets (>500 GB) Complex computation Shared datasets (10-500 GB) Complex computation Shared datasets (10-100 GB) Simple computation Shared datasets (0.1-10 GB) Simple computation Private datasets (0.1-10 GB) Simple computation } } } Use Grid compute and storage exclusively Work on departmental resources, store resulting datasets to Grid storage Work on laptop/desktop machine, store resulting datasets to Grid storage
  • 35. The life of LHC simulated data 1. Simulated by experimentalists at T0/T1/T2 sites 2. Transferred to T1 sites, archived possibly reconstructed & skimmed 3. Transferred to T2 sites, reconstructed, skimmed, filtered & analysed 4. Written into locally analysable files, put on laptops 5. Turned into a plot in a paper
  • 37. ! “We are going to die, and that makes us the lucky ones. Most people are never going to die because they are never going to be born.” ! - Richard Dawkins
  • 38.
  • 40. Setup • Maybe a bit different to other people • Many sites (>100) with >100’s TB storage, 10000’s worker nodes • Global system • Why not at one site? • politics, power budget, cost
  • 42. We Have a “Big Data” Problem
  • 43. We Have a Big “Data Problem”
  • 44. Do what you do best, out source the rest
  • 45. What's interesting is that big data isn't interesting any more
  • 46. NIH
  • 48. Our situation • Expert users, who are not interested in infrastructure • Will work around things they perceive as unnecessary limitations
  • 54. Our situation • Limited resources for integration/ testbed style activities • Strange organisation
  • 56. There is no such thing as now
  • 57. Keep things as local as possible
  • 59. Small files are bad, m'kay
  • 62. People are harder than computers
  • 65. Consequences • Automate all the things • Learn to love a configuration management system • Make sure everyone in the team knows how to interact with it • Simple human solutions go a long way
  • 68. Workflow ladder Number of users Large datasets (100 TB) Complex computation Large datasets (100 TB) Simple computation Shared datasets (500 GB) Complex computation Shared datasets (10-500 GB) Complex computation Shared datasets (10-100 GB) Simple computation Shared datasets (0.1-10 GB) Simple computation Private datasets (0.1-10 GB) Simple computation } } } Use Grid compute and storage exclusively Work on departmental resources, store resulting datasets to Grid storage Work on laptop/desktop machine, store resulting datasets to Grid storage