SlideShare a Scribd company logo
1 of 34
Download to read offline
Making fitting in RooFit faster
Automated Parallel Computation of Collaborative Statistical Models
Patrick Bos
Sarajevo, 10 Sep 2018
Automated Parallel Computation of Collaborative Statistical Models
Physics: Wouter Verkerke (PI), Vince Croft, Carsten Burgard
eScience: Patrick Bos (yours truly), Inti Pelupessy, Jisk Attema
RooFit: Collaborative Statistical Modeling
Collaborative Statistical Modeling
• RooFit: build models together
• Teams 10-100 physicists
• Collaborations ~3000
à ~100 teams
• 1 goal
• Pretty impressive to an
outsider
Collaborative Statistical Modeling with RooFit
Making RooFit faster (~30x; ~h à ~m)
• More efficient collaboration
• Faster iteration/debugging
• Faster feedback between teams
• Next level physics modeling ambitions,
retaining interactive workflow
1. Complex likelihood models, e.g.
a) Higgs fit to all channels, ~200 datasets, O(1000)
parameter, now O(few) hours
b) EFT framework: again 10-100x more expensive
2. Unbinned ML fits with very large data samples
3. Unbinned ML fits with MC-style numeric integrals
Higgs @ ATLAS
20k+ nodes, 125k hours
Expression tree of C++ objects for mathematical components (variables,
operators, functions, integrals, datasets, etc.)
Couple with data, event “observables”
Goals and Design: Make fitting in RooFit faster
Making fitting in RooFit faster: how?
Serial:
benchmarks show no obvious bottlenecks
RooFit already highly optimized (pre-calculation/memoization, MPFE)
Parallel
Faster fitting: (how) can we do it?
Levels of parallelism
1. Gradient (parameter partial
derivatives) in minimizer
2. Likelihood
3. Integrals (normalization) &
other expensive shared
components
likelihood:
events
likelihood:
(unequal)
components
integrals etc.
“Vector”
Faster fitting: (how) can we do it?
Heterogeneous: sizes, types
• Multiple strategies
• How to split up?
• Small components à need low
latency/overhead
• Large components as well…
• How to divide over cores?
• Load balancing à task-based
approach: work stealing
likelihood:
events
likelihood:
(unequal)
components
integrals etc.
Design: MultiProcess task-stealing framework
Task-stealing, worker pool, executes Job tasks
No threads, process-based: “bipe”
(BidirMMapPipe) handles fork, mmap, pipes
Master Queue
Worker 1
Worker 2
...
bipesbipe
Master: main RooFit process, submits Jobs to queue, waits for results (or does other things in between)
Worker requests
Job task
Queue pops task
Worker executes
task
Worker sends
result Queue
... repeat ...
Job done: Queue
sends to Master
on request
worker loop:
queue loop: act on input from Master or Workers (mainly to avoid loop in Master / user code)
template <class T> class MP::Vector :
public T, public MP::Job
Parallelized
class
MP::Vector
MP::Job
MP::TaskManager
Serial class
likelihood, gradient..
MultiProcess usage for devs
template <class T> class MP::Vector : public T, public MP::Job
class Parallel : public MP:Vector<Serial>
Parallelized
class
MP::Vector
MP::Job
MP::TaskManager
Serial class
MultiProcess usage for devs
class xSquaredSerial {
public:
xSquaredSerial(vector<double> x_init)
: x(move(x_init))
, result(x.size()) {}
virtual void evaluate() {
for (size_t ix = 0; ix < x.size(); ++ix) {
x_squared[ix] = x[ix] * x[ix];
}
}
vector<double> get_result() {
evaluate();
return x_squared;
}
protected:
vector<double> x;
vector<double> x_squared;
};
class xSquaredParallel
: public RooFit::MultiProcess::Vector<xSquaredSerial> {
public:
xSquaredParallel(size_t N_workers, vector<double> x_init) :
RooFit::MultiProcess::Vector<xSquaredSerial>(N_workers, x_init)
{}
private:
void evaluate_task(size_t task) override {
result[task] = x[task] * x[task];
}
public:
void evaluate() override {
if (get_manager()->is_master()) {
// do necessary synchronization before work_mode
// enable work mode: workers will start stealing work from queue
get_manager()->set_work_mode(true);
// master fills queue with tasks
for (size_t task_id = 0; task_id < x.size(); ++task_id) {
get_manager()->to_queue(JobTask(id, task_id));
}
// wait for task results back from workers to master
gather_worker_results();
// end work mode
get_manager()->set_work_mode(false);
// put gathered results in desired container (same as used in serial class)
for (size_t task_id = 0; task_id < x.size(); ++task_id) {
x_squared[task_id] = results[task_id];
}
}
}
};
template <class T> class MP::Vector : public T, public MP::Job
MultiProcess for users
vector<double> x {1, 4, 5, 6.48074};
xSquaredSerial xsq_serial(x);
size_t N_workers = 4;
xSquaredParallel xsq_parallel(N_workers, x);
// get the same results, but now faster:
xsq_serial.get_result();
xsq_parallel.get_result();
// use parallelized version in your existing functions
void some_function(xSquaredSerial* xsq);
some_function(&xsq_parallel); // no problem!
Parallel performance (MPFE & MP)
Likelihood fits (unbinned, binned)
Numerical integrals
Gradients
Parallel likelihood fits: unbinned, MPFE
Before: max ~2x
Now (with CPU affinity fixed):
max ~20x (more for larger fits)
Run-time vs N(cores)
Actual performance
Expected
performance (ideal
parallelization)
Parallel likelihood fits: binned
Run-time vs N(cores) in binned fits
Actual performance
Expected
performance (ideal
parallelization)
CPU time (single core)
Room for
improvement
WIP
Gradient parallelization
0th step: get Minuit to use external derivative
1st step: replicate Minuit2 behavior
• NumericalDerivator (Lorenzo)
• Modified to exactly (floating point bit-wise) replicate Minuit2
• à RooGradMinimizer
2nd step: calculate partial derivative for each parameter in
parallel
Gradient parallelization
First benchmarks (yesterday):
ggF workspace (Carsten), migrad fit
scaling not perfect and erratic (+/- 5s)
similar as we saw for likelihoods without CPU pinning
probably due to too much synchronization
RooMinimizer MultiProcess GradMinimizer
- 1 worker 2 workers 3 workers 4 workers 6 workers 8 workers
28s 33s 20s 15s 14s 17s (…) 11s
Let’s stay in touch
+31 (0)6 10 79 58 74
p.bos@esciencecenter.nl
www.esciencecenter.nl
egpbos
linkedin.com/in/egpbos
blog.esciencecenter.nl
Encore
Future work
Load balancing
PDF timings change dynamically due to RooFit precalculation strategies
… not a problem for numerical integrals
Analytical derivatives (automated? CLAD)
Numerical integrals
“Analytical” integrals
Forced numerical (Monte
Carlo) integrals
(Higgs fits didn’t have them)
Numerical integrals
Maxima
Individual NI timings
(variation in runs and iterations)
Minima
Sum of slowest integrals/cores
per iteration over the entire run
(single core total runtime: 3.2s)
Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
Serial class: likelihood (e.g. RooNLLVar) or gradient (Minuit)
Interface: subclass + MP
Define ”vector elements”
Group elements into tasks (to be executed in parallel)
RooFit::MultiProcess::SharedArg<T>
RooFit::MultiProcess::TaskManager
Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
RooFit::MultiProcess::SharedArg<T>
Normalization integrals or other shared expensive objects
Parallel task definition specific to type of object
… design in progress
RooFit::MultiProcess::TaskManager
Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
RooFit::MultiProcess::SharedArg<T>
RooFit::MultiProcess::TaskManager
Queue gathers tasks and communicates with worker pool
Workers steal tasks from queue
Worker pool: forked processes (BidirMMapPipe)
• performant and already used in RooFit
• no thread-safety concerns
• instead: communication concerns
• … flexible design, implementation can be replaced (e.g. TBB)
Single core profiling and improvements
Faster fitting: single core profiling with Callgrind, Cachegrind, Instruments !
Higgs ggf & 9 channel fits (workspaces by Lydia Brenner)
Most time spent on:
1. Memory access à RooVectorDataStore::get() (4% / 32%), 0.3% LL
cache misses (expensive!)
• Row-wise access pattern on column-wise data store (and std::vector<std::vector>)
2. Logarithms: 12%
3. Interpolation à RooStats::HistFactory::FlexibleInterpVar (10%)
Faster fitting: single core improvements
RooLinkedList::findArg: ~ 5% of memory access instructions
RooLinkedList::At took considerable time in Gaussian test fit (Vince)
std::vector lookup à 1.6x speedup! WIP
Faster fitting: future work
Reorder tree evaluation à CPU cache use, vectorization
Smarter fitting (stochastic minimizer, analytical gradient, CLAD)
Front-end / back-end separation (e.g. TensorFlow back-end)
Faster fitting: single core profiling meta-conclusions
profiling functions & classes
valgrind
gprof
Instruments
… etc.
profiling objects (e.g. call-trees, e.g. RooFit…)
… DIY?
More Multi-Core
Parallel likelihood fits: existing RooFit implementation details
RooRealMPFE / BidirMMapPipe
Custom multi-process message passing protocol
• POSIX fork, pipe, mmap
Communication “overhead” (delay between sending and receiving
messages): ~ 1e-4 seconds
• serverLoop waits for message & runs server-side code
• messages used sparingly
• data transfer over memory-mapped pipes
TensorFlow experiments
Fits on identical model & data (single i7 machine)
TensorFlow: No pre-calculation / caching!
Major advantage of RooFit for binned fits (e.g. morphing histograms)
(feature request for memoization https://github.com/tensorflow/tensorflow/issues/5323)
N.B.: measured before CPU affinity fixing
RooFit now even faster (but limited to running one machine)
RooFit (MINUIT) TensorFlow (BFGS)
Unbinned fit 0.1s 0.01 - 0.1s (dep. on precision)
Binned fit 0.7ms 2.3ms

More Related Content

What's hot

Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationUday Vakalapudi
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Pythonindico data
 
Buzzwords Numba Presentation
Buzzwords Numba PresentationBuzzwords Numba Presentation
Buzzwords Numba Presentationkammeyer
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyTravis Oliphant
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshopodsc
 
Data Science at the Command Line
Data Science at the Command LineData Science at the Command Line
Data Science at the Command LineHéloïse Nonne
 
Numba: Flexible analytics written in Python with machine-code speeds and avo...
Numba:  Flexible analytics written in Python with machine-code speeds and avo...Numba:  Flexible analytics written in Python with machine-code speeds and avo...
Numba: Flexible analytics written in Python with machine-code speeds and avo...PyData
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pubJaewook. Kang
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with NumbaSciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with Numbastan_seibert
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlowJeongkyu Shin
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 

What's hot (20)

Numba
NumbaNumba
Numba
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 
Buzzwords Numba Presentation
Buzzwords Numba PresentationBuzzwords Numba Presentation
Buzzwords Numba Presentation
 
Storm
StormStorm
Storm
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
Numba Overview
Numba OverviewNumba Overview
Numba Overview
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
 
Data Science at the Command Line
Data Science at the Command LineData Science at the Command Line
Data Science at the Command Line
 
Numba: Flexible analytics written in Python with machine-code speeds and avo...
Numba:  Flexible analytics written in Python with machine-code speeds and avo...Numba:  Flexible analytics written in Python with machine-code speeds and avo...
Numba: Flexible analytics written in Python with machine-code speeds and avo...
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with NumbaSciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with Numba
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlow
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 

Similar to Making fitting in RooFit faster

introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomFacultad de Informática UCM
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Fwdays
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
Go from a PHP Perspective
Go from a PHP PerspectiveGo from a PHP Perspective
Go from a PHP PerspectiveBarry Jones
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayJan Margeta
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpcDr Reeja S R
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyAerospike
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonTakeshi Akutsu
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of pythonYung-Yu Chen
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
 

Similar to Making fitting in RooFit faster (20)

introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Go from a PHP Perspective
Go from a PHP PerspectiveGo from a PHP Perspective
Go from a PHP Perspective
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with Ray
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 

Recently uploaded

OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 

Recently uploaded (20)

OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 

Making fitting in RooFit faster

  • 1. Making fitting in RooFit faster Automated Parallel Computation of Collaborative Statistical Models Patrick Bos Sarajevo, 10 Sep 2018
  • 2. Automated Parallel Computation of Collaborative Statistical Models Physics: Wouter Verkerke (PI), Vince Croft, Carsten Burgard eScience: Patrick Bos (yours truly), Inti Pelupessy, Jisk Attema
  • 4. Collaborative Statistical Modeling • RooFit: build models together • Teams 10-100 physicists • Collaborations ~3000 à ~100 teams • 1 goal • Pretty impressive to an outsider
  • 5. Collaborative Statistical Modeling with RooFit Making RooFit faster (~30x; ~h à ~m) • More efficient collaboration • Faster iteration/debugging • Faster feedback between teams • Next level physics modeling ambitions, retaining interactive workflow 1. Complex likelihood models, e.g. a) Higgs fit to all channels, ~200 datasets, O(1000) parameter, now O(few) hours b) EFT framework: again 10-100x more expensive 2. Unbinned ML fits with very large data samples 3. Unbinned ML fits with MC-style numeric integrals Higgs @ ATLAS 20k+ nodes, 125k hours Expression tree of C++ objects for mathematical components (variables, operators, functions, integrals, datasets, etc.) Couple with data, event “observables”
  • 6. Goals and Design: Make fitting in RooFit faster
  • 7. Making fitting in RooFit faster: how? Serial: benchmarks show no obvious bottlenecks RooFit already highly optimized (pre-calculation/memoization, MPFE) Parallel
  • 8. Faster fitting: (how) can we do it? Levels of parallelism 1. Gradient (parameter partial derivatives) in minimizer 2. Likelihood 3. Integrals (normalization) & other expensive shared components likelihood: events likelihood: (unequal) components integrals etc. “Vector”
  • 9. Faster fitting: (how) can we do it? Heterogeneous: sizes, types • Multiple strategies • How to split up? • Small components à need low latency/overhead • Large components as well… • How to divide over cores? • Load balancing à task-based approach: work stealing likelihood: events likelihood: (unequal) components integrals etc.
  • 10. Design: MultiProcess task-stealing framework Task-stealing, worker pool, executes Job tasks No threads, process-based: “bipe” (BidirMMapPipe) handles fork, mmap, pipes Master Queue Worker 1 Worker 2 ... bipesbipe Master: main RooFit process, submits Jobs to queue, waits for results (or does other things in between) Worker requests Job task Queue pops task Worker executes task Worker sends result Queue ... repeat ... Job done: Queue sends to Master on request worker loop: queue loop: act on input from Master or Workers (mainly to avoid loop in Master / user code) template <class T> class MP::Vector : public T, public MP::Job Parallelized class MP::Vector MP::Job MP::TaskManager Serial class likelihood, gradient..
  • 11. MultiProcess usage for devs template <class T> class MP::Vector : public T, public MP::Job class Parallel : public MP:Vector<Serial> Parallelized class MP::Vector MP::Job MP::TaskManager Serial class
  • 12. MultiProcess usage for devs class xSquaredSerial { public: xSquaredSerial(vector<double> x_init) : x(move(x_init)) , result(x.size()) {} virtual void evaluate() { for (size_t ix = 0; ix < x.size(); ++ix) { x_squared[ix] = x[ix] * x[ix]; } } vector<double> get_result() { evaluate(); return x_squared; } protected: vector<double> x; vector<double> x_squared; }; class xSquaredParallel : public RooFit::MultiProcess::Vector<xSquaredSerial> { public: xSquaredParallel(size_t N_workers, vector<double> x_init) : RooFit::MultiProcess::Vector<xSquaredSerial>(N_workers, x_init) {} private: void evaluate_task(size_t task) override { result[task] = x[task] * x[task]; } public: void evaluate() override { if (get_manager()->is_master()) { // do necessary synchronization before work_mode // enable work mode: workers will start stealing work from queue get_manager()->set_work_mode(true); // master fills queue with tasks for (size_t task_id = 0; task_id < x.size(); ++task_id) { get_manager()->to_queue(JobTask(id, task_id)); } // wait for task results back from workers to master gather_worker_results(); // end work mode get_manager()->set_work_mode(false); // put gathered results in desired container (same as used in serial class) for (size_t task_id = 0; task_id < x.size(); ++task_id) { x_squared[task_id] = results[task_id]; } } } }; template <class T> class MP::Vector : public T, public MP::Job
  • 13. MultiProcess for users vector<double> x {1, 4, 5, 6.48074}; xSquaredSerial xsq_serial(x); size_t N_workers = 4; xSquaredParallel xsq_parallel(N_workers, x); // get the same results, but now faster: xsq_serial.get_result(); xsq_parallel.get_result(); // use parallelized version in your existing functions void some_function(xSquaredSerial* xsq); some_function(&xsq_parallel); // no problem!
  • 14. Parallel performance (MPFE & MP) Likelihood fits (unbinned, binned) Numerical integrals Gradients
  • 15. Parallel likelihood fits: unbinned, MPFE Before: max ~2x Now (with CPU affinity fixed): max ~20x (more for larger fits) Run-time vs N(cores) Actual performance Expected performance (ideal parallelization)
  • 16. Parallel likelihood fits: binned Run-time vs N(cores) in binned fits Actual performance Expected performance (ideal parallelization) CPU time (single core) Room for improvement WIP
  • 17. Gradient parallelization 0th step: get Minuit to use external derivative 1st step: replicate Minuit2 behavior • NumericalDerivator (Lorenzo) • Modified to exactly (floating point bit-wise) replicate Minuit2 • à RooGradMinimizer 2nd step: calculate partial derivative for each parameter in parallel
  • 18. Gradient parallelization First benchmarks (yesterday): ggF workspace (Carsten), migrad fit scaling not perfect and erratic (+/- 5s) similar as we saw for likelihoods without CPU pinning probably due to too much synchronization RooMinimizer MultiProcess GradMinimizer - 1 worker 2 workers 3 workers 4 workers 6 workers 8 workers 28s 33s 20s 15s 14s 17s (…) 11s
  • 19. Let’s stay in touch +31 (0)6 10 79 58 74 p.bos@esciencecenter.nl www.esciencecenter.nl egpbos linkedin.com/in/egpbos blog.esciencecenter.nl
  • 21. Future work Load balancing PDF timings change dynamically due to RooFit precalculation strategies … not a problem for numerical integrals Analytical derivatives (automated? CLAD)
  • 22. Numerical integrals “Analytical” integrals Forced numerical (Monte Carlo) integrals (Higgs fits didn’t have them)
  • 23. Numerical integrals Maxima Individual NI timings (variation in runs and iterations) Minima Sum of slowest integrals/cores per iteration over the entire run (single core total runtime: 3.2s)
  • 24. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> Serial class: likelihood (e.g. RooNLLVar) or gradient (Minuit) Interface: subclass + MP Define ”vector elements” Group elements into tasks (to be executed in parallel) RooFit::MultiProcess::SharedArg<T> RooFit::MultiProcess::TaskManager
  • 25. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> RooFit::MultiProcess::SharedArg<T> Normalization integrals or other shared expensive objects Parallel task definition specific to type of object … design in progress RooFit::MultiProcess::TaskManager
  • 26. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> RooFit::MultiProcess::SharedArg<T> RooFit::MultiProcess::TaskManager Queue gathers tasks and communicates with worker pool Workers steal tasks from queue Worker pool: forked processes (BidirMMapPipe) • performant and already used in RooFit • no thread-safety concerns • instead: communication concerns • … flexible design, implementation can be replaced (e.g. TBB)
  • 27. Single core profiling and improvements
  • 28. Faster fitting: single core profiling with Callgrind, Cachegrind, Instruments ! Higgs ggf & 9 channel fits (workspaces by Lydia Brenner) Most time spent on: 1. Memory access à RooVectorDataStore::get() (4% / 32%), 0.3% LL cache misses (expensive!) • Row-wise access pattern on column-wise data store (and std::vector<std::vector>) 2. Logarithms: 12% 3. Interpolation à RooStats::HistFactory::FlexibleInterpVar (10%)
  • 29. Faster fitting: single core improvements RooLinkedList::findArg: ~ 5% of memory access instructions RooLinkedList::At took considerable time in Gaussian test fit (Vince) std::vector lookup à 1.6x speedup! WIP
  • 30. Faster fitting: future work Reorder tree evaluation à CPU cache use, vectorization Smarter fitting (stochastic minimizer, analytical gradient, CLAD) Front-end / back-end separation (e.g. TensorFlow back-end)
  • 31. Faster fitting: single core profiling meta-conclusions profiling functions & classes valgrind gprof Instruments … etc. profiling objects (e.g. call-trees, e.g. RooFit…) … DIY?
  • 33. Parallel likelihood fits: existing RooFit implementation details RooRealMPFE / BidirMMapPipe Custom multi-process message passing protocol • POSIX fork, pipe, mmap Communication “overhead” (delay between sending and receiving messages): ~ 1e-4 seconds • serverLoop waits for message & runs server-side code • messages used sparingly • data transfer over memory-mapped pipes
  • 34. TensorFlow experiments Fits on identical model & data (single i7 machine) TensorFlow: No pre-calculation / caching! Major advantage of RooFit for binned fits (e.g. morphing histograms) (feature request for memoization https://github.com/tensorflow/tensorflow/issues/5323) N.B.: measured before CPU affinity fixing RooFit now even faster (but limited to running one machine) RooFit (MINUIT) TensorFlow (BFGS) Unbinned fit 0.1s 0.01 - 0.1s (dep. on precision) Binned fit 0.7ms 2.3ms