SlideShare a Scribd company logo
Making fitting in RooFit faster
Automated Parallel Computation of Collaborative Statistical Models
Patrick Bos
Sarajevo, 10 Sep 2018
Automated Parallel Computation of Collaborative Statistical Models
Physics: Wouter Verkerke (PI), Vince Croft, Carsten Burgard
eScience: Patrick Bos (yours truly), Inti Pelupessy, Jisk Attema
RooFit: Collaborative Statistical Modeling
Collaborative Statistical Modeling
• RooFit: build models together
• Teams 10-100 physicists
• Collaborations ~3000
à ~100 teams
• 1 goal
• Pretty impressive to an
outsider
Collaborative Statistical Modeling with RooFit
Making RooFit faster (~30x; ~h à ~m)
• More efficient collaboration
• Faster iteration/debugging
• Faster feedback between teams
• Next level physics modeling ambitions,
retaining interactive workflow
1. Complex likelihood models, e.g.
a) Higgs fit to all channels, ~200 datasets, O(1000)
parameter, now O(few) hours
b) EFT framework: again 10-100x more expensive
2. Unbinned ML fits with very large data samples
3. Unbinned ML fits with MC-style numeric integrals
Higgs @ ATLAS
20k+ nodes, 125k hours
Expression tree of C++ objects for mathematical components (variables,
operators, functions, integrals, datasets, etc.)
Couple with data, event “observables”
Goals and Design: Make fitting in RooFit faster
Making fitting in RooFit faster: how?
Serial:
benchmarks show no obvious bottlenecks
RooFit already highly optimized (pre-calculation/memoization, MPFE)
Parallel
Faster fitting: (how) can we do it?
Levels of parallelism
1. Gradient (parameter partial
derivatives) in minimizer
2. Likelihood
3. Integrals (normalization) &
other expensive shared
components
likelihood:
events
likelihood:
(unequal)
components
integrals etc.
“Vector”
Faster fitting: (how) can we do it?
Heterogeneous: sizes, types
• Multiple strategies
• How to split up?
• Small components à need low
latency/overhead
• Large components as well…
• How to divide over cores?
• Load balancing à task-based
approach: work stealing
likelihood:
events
likelihood:
(unequal)
components
integrals etc.
Design: MultiProcess task-stealing framework
Task-stealing, worker pool, executes Job tasks
No threads, process-based: “bipe”
(BidirMMapPipe) handles fork, mmap, pipes
Master Queue
Worker 1
Worker 2
...
bipesbipe
Master: main RooFit process, submits Jobs to queue, waits for results (or does other things in between)
Worker requests
Job task
Queue pops task
Worker executes
task
Worker sends
result Queue
... repeat ...
Job done: Queue
sends to Master
on request
worker loop:
queue loop: act on input from Master or Workers (mainly to avoid loop in Master / user code)
template <class T> class MP::Vector :
public T, public MP::Job
Parallelized
class
MP::Vector
MP::Job
MP::TaskManager
Serial class
likelihood, gradient..
MultiProcess usage for devs
template <class T> class MP::Vector : public T, public MP::Job
class Parallel : public MP:Vector<Serial>
Parallelized
class
MP::Vector
MP::Job
MP::TaskManager
Serial class
MultiProcess usage for devs
class xSquaredSerial {
public:
xSquaredSerial(vector<double> x_init)
: x(move(x_init))
, result(x.size()) {}
virtual void evaluate() {
for (size_t ix = 0; ix < x.size(); ++ix) {
x_squared[ix] = x[ix] * x[ix];
}
}
vector<double> get_result() {
evaluate();
return x_squared;
}
protected:
vector<double> x;
vector<double> x_squared;
};
class xSquaredParallel
: public RooFit::MultiProcess::Vector<xSquaredSerial> {
public:
xSquaredParallel(size_t N_workers, vector<double> x_init) :
RooFit::MultiProcess::Vector<xSquaredSerial>(N_workers, x_init)
{}
private:
void evaluate_task(size_t task) override {
result[task] = x[task] * x[task];
}
public:
void evaluate() override {
if (get_manager()->is_master()) {
// do necessary synchronization before work_mode
// enable work mode: workers will start stealing work from queue
get_manager()->set_work_mode(true);
// master fills queue with tasks
for (size_t task_id = 0; task_id < x.size(); ++task_id) {
get_manager()->to_queue(JobTask(id, task_id));
}
// wait for task results back from workers to master
gather_worker_results();
// end work mode
get_manager()->set_work_mode(false);
// put gathered results in desired container (same as used in serial class)
for (size_t task_id = 0; task_id < x.size(); ++task_id) {
x_squared[task_id] = results[task_id];
}
}
}
};
template <class T> class MP::Vector : public T, public MP::Job
MultiProcess for users
vector<double> x {1, 4, 5, 6.48074};
xSquaredSerial xsq_serial(x);
size_t N_workers = 4;
xSquaredParallel xsq_parallel(N_workers, x);
// get the same results, but now faster:
xsq_serial.get_result();
xsq_parallel.get_result();
// use parallelized version in your existing functions
void some_function(xSquaredSerial* xsq);
some_function(&xsq_parallel); // no problem!
Parallel performance (MPFE & MP)
Likelihood fits (unbinned, binned)
Numerical integrals
Gradients
Parallel likelihood fits: unbinned, MPFE
Before: max ~2x
Now (with CPU affinity fixed):
max ~20x (more for larger fits)
Run-time vs N(cores)
Actual performance
Expected
performance (ideal
parallelization)
Parallel likelihood fits: binned
Run-time vs N(cores) in binned fits
Actual performance
Expected
performance (ideal
parallelization)
CPU time (single core)
Room for
improvement
WIP
Gradient parallelization
0th step: get Minuit to use external derivative
1st step: replicate Minuit2 behavior
• NumericalDerivator (Lorenzo)
• Modified to exactly (floating point bit-wise) replicate Minuit2
• à RooGradMinimizer
2nd step: calculate partial derivative for each parameter in
parallel
Gradient parallelization
First benchmarks (yesterday):
ggF workspace (Carsten), migrad fit
scaling not perfect and erratic (+/- 5s)
similar as we saw for likelihoods without CPU pinning
probably due to too much synchronization
RooMinimizer MultiProcess GradMinimizer
- 1 worker 2 workers 3 workers 4 workers 6 workers 8 workers
28s 33s 20s 15s 14s 17s (…) 11s
Let’s stay in touch
+31 (0)6 10 79 58 74
p.bos@esciencecenter.nl
www.esciencecenter.nl
egpbos
linkedin.com/in/egpbos
blog.esciencecenter.nl
Encore
Future work
Load balancing
PDF timings change dynamically due to RooFit precalculation strategies
… not a problem for numerical integrals
Analytical derivatives (automated? CLAD)
Numerical integrals
“Analytical” integrals
Forced numerical (Monte
Carlo) integrals
(Higgs fits didn’t have them)
Numerical integrals
Maxima
Individual NI timings
(variation in runs and iterations)
Minima
Sum of slowest integrals/cores
per iteration over the entire run
(single core total runtime: 3.2s)
Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
Serial class: likelihood (e.g. RooNLLVar) or gradient (Minuit)
Interface: subclass + MP
Define ”vector elements”
Group elements into tasks (to be executed in parallel)
RooFit::MultiProcess::SharedArg<T>
RooFit::MultiProcess::TaskManager
Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
RooFit::MultiProcess::SharedArg<T>
Normalization integrals or other shared expensive objects
Parallel task definition specific to type of object
… design in progress
RooFit::MultiProcess::TaskManager
Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
RooFit::MultiProcess::SharedArg<T>
RooFit::MultiProcess::TaskManager
Queue gathers tasks and communicates with worker pool
Workers steal tasks from queue
Worker pool: forked processes (BidirMMapPipe)
• performant and already used in RooFit
• no thread-safety concerns
• instead: communication concerns
• … flexible design, implementation can be replaced (e.g. TBB)
Single core profiling and improvements
Faster fitting: single core profiling with Callgrind, Cachegrind, Instruments !
Higgs ggf & 9 channel fits (workspaces by Lydia Brenner)
Most time spent on:
1. Memory access à RooVectorDataStore::get() (4% / 32%), 0.3% LL
cache misses (expensive!)
• Row-wise access pattern on column-wise data store (and std::vector<std::vector>)
2. Logarithms: 12%
3. Interpolation à RooStats::HistFactory::FlexibleInterpVar (10%)
Faster fitting: single core improvements
RooLinkedList::findArg: ~ 5% of memory access instructions
RooLinkedList::At took considerable time in Gaussian test fit (Vince)
std::vector lookup à 1.6x speedup! WIP
Faster fitting: future work
Reorder tree evaluation à CPU cache use, vectorization
Smarter fitting (stochastic minimizer, analytical gradient, CLAD)
Front-end / back-end separation (e.g. TensorFlow back-end)
Faster fitting: single core profiling meta-conclusions
profiling functions & classes
valgrind
gprof
Instruments
… etc.
profiling objects (e.g. call-trees, e.g. RooFit…)
… DIY?
More Multi-Core
Parallel likelihood fits: existing RooFit implementation details
RooRealMPFE / BidirMMapPipe
Custom multi-process message passing protocol
• POSIX fork, pipe, mmap
Communication “overhead” (delay between sending and receiving
messages): ~ 1e-4 seconds
• serverLoop waits for message & runs server-side code
• messages used sparingly
• data transfer over memory-mapped pipes
TensorFlow experiments
Fits on identical model & data (single i7 machine)
TensorFlow: No pre-calculation / caching!
Major advantage of RooFit for binned fits (e.g. morphing histograms)
(feature request for memoization https://github.com/tensorflow/tensorflow/issues/5323)
N.B.: measured before CPU affinity fixing
RooFit now even faster (but limited to running one machine)
RooFit (MINUIT) TensorFlow (BFGS)
Unbinned fit 0.1s 0.01 - 0.1s (dep. on precision)
Binned fit 0.7ms 2.3ms

More Related Content

What's hot

Numba
NumbaNumba
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
MLconf
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
Uday Vakalapudi
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
indico data
 
Buzzwords Numba Presentation
Buzzwords Numba PresentationBuzzwords Numba Presentation
Buzzwords Numba Presentationkammeyer
 
Storm
StormStorm
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
Mani Goswami
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
Travis Oliphant
 
Numba Overview
Numba OverviewNumba Overview
Numba Overview
stan_seibert
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
odsc
 
Data Science at the Command Line
Data Science at the Command LineData Science at the Command Line
Data Science at the Command Line
Héloïse Nonne
 
Numba: Flexible analytics written in Python with machine-code speeds and avo...
Numba:  Flexible analytics written in Python with machine-code speeds and avo...Numba:  Flexible analytics written in Python with machine-code speeds and avo...
Numba: Flexible analytics written in Python with machine-code speeds and avo...
PyData
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
Tiziano De Matteis
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
Jaewook. Kang
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with NumbaSciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with Numba
stan_seibert
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlow
Jeongkyu Shin
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 

What's hot (20)

Numba
NumbaNumba
Numba
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 
Buzzwords Numba Presentation
Buzzwords Numba PresentationBuzzwords Numba Presentation
Buzzwords Numba Presentation
 
Storm
StormStorm
Storm
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
Numba Overview
Numba OverviewNumba Overview
Numba Overview
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
 
Data Science at the Command Line
Data Science at the Command LineData Science at the Command Line
Data Science at the Command Line
 
Numba: Flexible analytics written in Python with machine-code speeds and avo...
Numba:  Flexible analytics written in Python with machine-code speeds and avo...Numba:  Flexible analytics written in Python with machine-code speeds and avo...
Numba: Flexible analytics written in Python with machine-code speeds and avo...
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with NumbaSciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with Numba
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlow
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 

Similar to Making fitting in RooFit faster

introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
Héloïse Nonne
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
Facultad de Informática UCM
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"
Fwdays
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
mark_landry
 
Go from a PHP Perspective
Go from a PHP PerspectiveGo from a PHP Perspective
Go from a PHP Perspective
Barry Jones
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
inside-BigData.com
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with Ray
Jan Margeta
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
Dr Reeja S R
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
Aerospike
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
EUDAT
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
Takeshi Akutsu
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
Yung-Yu Chen
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
Ram Sriharsha
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Eric Haibin Lin
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
clifford sugerman
 

Similar to Making fitting in RooFit faster (20)

introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Go from a PHP Perspective
Go from a PHP PerspectiveGo from a PHP Perspective
Go from a PHP Perspective
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with Ray
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 

Recently uploaded

Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 

Recently uploaded (20)

Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 

Making fitting in RooFit faster

  • 1. Making fitting in RooFit faster Automated Parallel Computation of Collaborative Statistical Models Patrick Bos Sarajevo, 10 Sep 2018
  • 2. Automated Parallel Computation of Collaborative Statistical Models Physics: Wouter Verkerke (PI), Vince Croft, Carsten Burgard eScience: Patrick Bos (yours truly), Inti Pelupessy, Jisk Attema
  • 4. Collaborative Statistical Modeling • RooFit: build models together • Teams 10-100 physicists • Collaborations ~3000 à ~100 teams • 1 goal • Pretty impressive to an outsider
  • 5. Collaborative Statistical Modeling with RooFit Making RooFit faster (~30x; ~h à ~m) • More efficient collaboration • Faster iteration/debugging • Faster feedback between teams • Next level physics modeling ambitions, retaining interactive workflow 1. Complex likelihood models, e.g. a) Higgs fit to all channels, ~200 datasets, O(1000) parameter, now O(few) hours b) EFT framework: again 10-100x more expensive 2. Unbinned ML fits with very large data samples 3. Unbinned ML fits with MC-style numeric integrals Higgs @ ATLAS 20k+ nodes, 125k hours Expression tree of C++ objects for mathematical components (variables, operators, functions, integrals, datasets, etc.) Couple with data, event “observables”
  • 6. Goals and Design: Make fitting in RooFit faster
  • 7. Making fitting in RooFit faster: how? Serial: benchmarks show no obvious bottlenecks RooFit already highly optimized (pre-calculation/memoization, MPFE) Parallel
  • 8. Faster fitting: (how) can we do it? Levels of parallelism 1. Gradient (parameter partial derivatives) in minimizer 2. Likelihood 3. Integrals (normalization) & other expensive shared components likelihood: events likelihood: (unequal) components integrals etc. “Vector”
  • 9. Faster fitting: (how) can we do it? Heterogeneous: sizes, types • Multiple strategies • How to split up? • Small components à need low latency/overhead • Large components as well… • How to divide over cores? • Load balancing à task-based approach: work stealing likelihood: events likelihood: (unequal) components integrals etc.
  • 10. Design: MultiProcess task-stealing framework Task-stealing, worker pool, executes Job tasks No threads, process-based: “bipe” (BidirMMapPipe) handles fork, mmap, pipes Master Queue Worker 1 Worker 2 ... bipesbipe Master: main RooFit process, submits Jobs to queue, waits for results (or does other things in between) Worker requests Job task Queue pops task Worker executes task Worker sends result Queue ... repeat ... Job done: Queue sends to Master on request worker loop: queue loop: act on input from Master or Workers (mainly to avoid loop in Master / user code) template <class T> class MP::Vector : public T, public MP::Job Parallelized class MP::Vector MP::Job MP::TaskManager Serial class likelihood, gradient..
  • 11. MultiProcess usage for devs template <class T> class MP::Vector : public T, public MP::Job class Parallel : public MP:Vector<Serial> Parallelized class MP::Vector MP::Job MP::TaskManager Serial class
  • 12. MultiProcess usage for devs class xSquaredSerial { public: xSquaredSerial(vector<double> x_init) : x(move(x_init)) , result(x.size()) {} virtual void evaluate() { for (size_t ix = 0; ix < x.size(); ++ix) { x_squared[ix] = x[ix] * x[ix]; } } vector<double> get_result() { evaluate(); return x_squared; } protected: vector<double> x; vector<double> x_squared; }; class xSquaredParallel : public RooFit::MultiProcess::Vector<xSquaredSerial> { public: xSquaredParallel(size_t N_workers, vector<double> x_init) : RooFit::MultiProcess::Vector<xSquaredSerial>(N_workers, x_init) {} private: void evaluate_task(size_t task) override { result[task] = x[task] * x[task]; } public: void evaluate() override { if (get_manager()->is_master()) { // do necessary synchronization before work_mode // enable work mode: workers will start stealing work from queue get_manager()->set_work_mode(true); // master fills queue with tasks for (size_t task_id = 0; task_id < x.size(); ++task_id) { get_manager()->to_queue(JobTask(id, task_id)); } // wait for task results back from workers to master gather_worker_results(); // end work mode get_manager()->set_work_mode(false); // put gathered results in desired container (same as used in serial class) for (size_t task_id = 0; task_id < x.size(); ++task_id) { x_squared[task_id] = results[task_id]; } } } }; template <class T> class MP::Vector : public T, public MP::Job
  • 13. MultiProcess for users vector<double> x {1, 4, 5, 6.48074}; xSquaredSerial xsq_serial(x); size_t N_workers = 4; xSquaredParallel xsq_parallel(N_workers, x); // get the same results, but now faster: xsq_serial.get_result(); xsq_parallel.get_result(); // use parallelized version in your existing functions void some_function(xSquaredSerial* xsq); some_function(&xsq_parallel); // no problem!
  • 14. Parallel performance (MPFE & MP) Likelihood fits (unbinned, binned) Numerical integrals Gradients
  • 15. Parallel likelihood fits: unbinned, MPFE Before: max ~2x Now (with CPU affinity fixed): max ~20x (more for larger fits) Run-time vs N(cores) Actual performance Expected performance (ideal parallelization)
  • 16. Parallel likelihood fits: binned Run-time vs N(cores) in binned fits Actual performance Expected performance (ideal parallelization) CPU time (single core) Room for improvement WIP
  • 17. Gradient parallelization 0th step: get Minuit to use external derivative 1st step: replicate Minuit2 behavior • NumericalDerivator (Lorenzo) • Modified to exactly (floating point bit-wise) replicate Minuit2 • à RooGradMinimizer 2nd step: calculate partial derivative for each parameter in parallel
  • 18. Gradient parallelization First benchmarks (yesterday): ggF workspace (Carsten), migrad fit scaling not perfect and erratic (+/- 5s) similar as we saw for likelihoods without CPU pinning probably due to too much synchronization RooMinimizer MultiProcess GradMinimizer - 1 worker 2 workers 3 workers 4 workers 6 workers 8 workers 28s 33s 20s 15s 14s 17s (…) 11s
  • 19. Let’s stay in touch +31 (0)6 10 79 58 74 p.bos@esciencecenter.nl www.esciencecenter.nl egpbos linkedin.com/in/egpbos blog.esciencecenter.nl
  • 21. Future work Load balancing PDF timings change dynamically due to RooFit precalculation strategies … not a problem for numerical integrals Analytical derivatives (automated? CLAD)
  • 22. Numerical integrals “Analytical” integrals Forced numerical (Monte Carlo) integrals (Higgs fits didn’t have them)
  • 23. Numerical integrals Maxima Individual NI timings (variation in runs and iterations) Minima Sum of slowest integrals/cores per iteration over the entire run (single core total runtime: 3.2s)
  • 24. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> Serial class: likelihood (e.g. RooNLLVar) or gradient (Minuit) Interface: subclass + MP Define ”vector elements” Group elements into tasks (to be executed in parallel) RooFit::MultiProcess::SharedArg<T> RooFit::MultiProcess::TaskManager
  • 25. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> RooFit::MultiProcess::SharedArg<T> Normalization integrals or other shared expensive objects Parallel task definition specific to type of object … design in progress RooFit::MultiProcess::TaskManager
  • 26. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> RooFit::MultiProcess::SharedArg<T> RooFit::MultiProcess::TaskManager Queue gathers tasks and communicates with worker pool Workers steal tasks from queue Worker pool: forked processes (BidirMMapPipe) • performant and already used in RooFit • no thread-safety concerns • instead: communication concerns • … flexible design, implementation can be replaced (e.g. TBB)
  • 27. Single core profiling and improvements
  • 28. Faster fitting: single core profiling with Callgrind, Cachegrind, Instruments ! Higgs ggf & 9 channel fits (workspaces by Lydia Brenner) Most time spent on: 1. Memory access à RooVectorDataStore::get() (4% / 32%), 0.3% LL cache misses (expensive!) • Row-wise access pattern on column-wise data store (and std::vector<std::vector>) 2. Logarithms: 12% 3. Interpolation à RooStats::HistFactory::FlexibleInterpVar (10%)
  • 29. Faster fitting: single core improvements RooLinkedList::findArg: ~ 5% of memory access instructions RooLinkedList::At took considerable time in Gaussian test fit (Vince) std::vector lookup à 1.6x speedup! WIP
  • 30. Faster fitting: future work Reorder tree evaluation à CPU cache use, vectorization Smarter fitting (stochastic minimizer, analytical gradient, CLAD) Front-end / back-end separation (e.g. TensorFlow back-end)
  • 31. Faster fitting: single core profiling meta-conclusions profiling functions & classes valgrind gprof Instruments … etc. profiling objects (e.g. call-trees, e.g. RooFit…) … DIY?
  • 33. Parallel likelihood fits: existing RooFit implementation details RooRealMPFE / BidirMMapPipe Custom multi-process message passing protocol • POSIX fork, pipe, mmap Communication “overhead” (delay between sending and receiving messages): ~ 1e-4 seconds • serverLoop waits for message & runs server-side code • messages used sparingly • data transfer over memory-mapped pipes
  • 34. TensorFlow experiments Fits on identical model & data (single i7 machine) TensorFlow: No pre-calculation / caching! Major advantage of RooFit for binned fits (e.g. morphing histograms) (feature request for memoization https://github.com/tensorflow/tensorflow/issues/5323) N.B.: measured before CPU affinity fixing RooFit now even faster (but limited to running one machine) RooFit (MINUIT) TensorFlow (BFGS) Unbinned fit 0.1s 0.01 - 0.1s (dep. on precision) Binned fit 0.7ms 2.3ms