SlideShare a Scribd company logo
1 of 43
Download to read offline
Non-trivial applications of
boosting
Tatiana Likhomanenko
Lund, MLHEP 2016
*many slides are taken from Alex Rogozhnikov’s presentations
Boosting recapitulation
2
Boosting combines weak learners to obtain a strong one
It is usually built over decision trees
State-of-the-art results in many areas
General-purpose implementations are used for classification and regression
Reweighting problem in HEP
Data/MC disagreement
4
Monte Carlo (MC) simulated samples are used for training and tuning a model
After, trained model is applied to real data (RD)
Real data and Monte Carlo have different distributions
Thus, trained model is biased (and the quality is overestimated on MC samples)
Distributions reweighting
5
Reweighting in HEP is used to minimize the
difference between RD and MC samples
The goal of reweighting: assign weights to MC s.t.
MC and RD distributions coincide
Known process is used, for which RD can be
obtained (MC samples are also available)
MC distribution is original, RD distribution is target
Applications beyond physics
6
Introducing corrections to fight non-response bias: assigning higher weight to
answers from groups with low response.
See e.g. R. Kizilcec, "Reducing non-response bias with survey reweighting:
Applications for online learning researchers", 2014.
Typical approach: histogram reweighting
7
variable(s) is split into bins
in each bin the MC weight is multiplied by:
- total weights of events in a bin for target and original distributions
1. simple and fast
2. number of variables is very limited by statistics (typically only one, two)
3. reweighting in one variable may bring disagreement in others
4. which variable is preferable for reweighting?
multiplierbin =
wbin, target
wbin, original
wbin, target, wbin, original
Typical approach: example
8
Typical approach: example
9
Problems arise when there are too few events in a bin
This can be detected on a holdout (see the latest row)
Issues:
1. few bins - rule is rough
2. many bins - rule is not reliable
Reweighting rule must be checked on a holdout!
Reweighting quality
10
How to check the quality of reweighting?
One dimensional case: two samples tests
(Kolmogorov-Smirnov test, Mann-Whitney
test, …)
Two or more dimensions?
Comparing 1d projections is not a way
Comparing nDim distributions using ML
11
Final goal: classifier doesn’t use data/MC disagreement information
= classifier cannot discriminate data and MC
Comparison of distributions shall be done using ML:
train a classifier to discriminate data and MC
output of the classifier is one-dimensional variable
looking at the ROC curve (alternative of two sample test) on a
holdout

(should be 0.5 if the classifier cannot discriminate data and MC)
Density ratio estimation approach
12
We need to estimate density ratio:
Classifier trained to discriminate MC and RD should reconstruct 

probabilities pMC(x) and pRD(x)
For reweighting we can use
1. Approach is able to reweight in many variables
2. It is successfully tried in HEP, see D. Martschei et al, 

"Advanced event reweighting using multivariate analysis", 2012
3. There is poor reconstruction when ratio is too small / high
4. It is slower than histogram approach
fRD(x)
fMC(x)
fRD(x)
fMC(x)
⇠
pRD(x)
pMC(x)
…
13
Write ML algorithm to solve directly reweighting problem
Remind that in histogram approach few bins is bad, many bins is bad too.
What can we do?
Better idea…
Split space of variables in several large regions
Find this regions ‘intellectually’
Decision tree for reweighting
14
Write ML algorithm to solve directly reweighting problem:
Tree splits the space of variables with orthogonal cuts (each tree leaf is a region, or bin)
There are different criteria to construct a tree (MSE, Gini index, entropy, …)
Find regions with the highest difference between original and target distribution
Spitting criteria
15
Finding regions with high
difference between original and
target distribution by maximizing
symmetrized :
2
2
=
X
leaf
(wleaf, original wleaf, target)2
wleaf, original + wleaf, target
A tree leaf may be considered as ‘a bin’;

- total weights
of events in a leaf for target and original
distributions.
wleaf, original, wleaf, target
AdaBoost (Adaptive Boosting) recall
16
building of weak learners one-by-one, predictions are summed:
each time increase weights of events incorrectly classified by a tree
main idea: provide base estimator (weak learner) with information about which
samples have higher importance
wi wi exp( ↵yid(xi))), yi = ±1
D(x) =
X
j
↵jdj(x)
d(x)
BDT reweighter
17
Many times repeat the following steps:
build a shallow tree to maximize symmetrized
compute predictions in leaves:
reweight distributions (compare with AdaBoost):
Comparison with GBDT:
different tree splitting criterion
different boosting procedure
2
leaf pred = log
wleaf, target
wleaf, original
w =
(
w, if event from target (RD) distribution
w · epred
, if event from original (MC) distribution
BDT reweighter DEMO
after BDT reweightingbefore BDT reweighting
18
KS for 1d projections
19
Bins reweighter uses only

2 last variables (60 × 60 bins);
BDT reweighter uses all
variables
Comparing reweighting with ML
20
hep_ml library
21
Being a variation of GBDT, BDT reweighter is able to
calculate feature importances. Two features used in
reweighting with bins are indeed the most important.
Summary
22
1. Comparison of multidimensional distributions is ML problem
2. Reweighting of distributions is ML problem
3. Check reweighting rule on the holdout
BDT reweighter
uses each time few large bins (construction is done intellectually)
is able to handle many variables
requires less data (for the same performance)
... but slow (being ML algorithm)
Boosting to uniformity
Uniformity
24
Uniformity means that we have constant efficiency (FPR/TPR) against some variable.
Applications:
trigger system (flight time)

flat signal efficiency
particle identification (momentum) 

flat signal efficiency
rare decays (mass)

flat background efficiency
Dalitz analysis (Dalitz variables)

flat signal efficiency
Non-flatness along the mass
25
High correlation with the mass can create from pure
background false peaking signal (specially if we use mass
sidebands for training)
Goal: FPR = const for different regions in mass
FPR = background efficiency
Basic approach
26
reduce the number of features used in training
leave only the set of features, which do not give enough information to reconstruct

the mass of particle
simple and works
sometimes we have to loose information
Can we modify ML to use all features, but provide uniform 

background efficiency (FPR)/signal efficiency (TPR) along the mass?
Gradient boosting recall
27
Gradient boosting greedily builds an ensemble of estimators
by optimizing some loss function. Those could be:
MSE:
AdaLoss:
LogLoss:
Next estimator in series approximates gradient of loss in the space of functions

D(x) =
X
j
↵jdj(x)
L =
X
i
(yi D(xi))2
L =
X
i
e yiD(xi)
, yi = ±1
L =
X
i
log(1 + e yiD(xi)
), yi = ±1
uBoostBDT
28
Aims to get FPRregion=const
Fix target efficiency, for example FPRtarget=30%, and find corresponding threshold
train a tree, its decision function is
increase weight for misclassified events:
increase weight of background events in the regions with high FPR
This way we achieve FPRregion=30% in all regions only for some threshold on training dataset
d(x)
wi wi exp( ↵yid(xi))), yi = ±1
wi wi exp ( (FPRregion FPRtarget))
uBoost
29
uBoost is an ensemble of uBoostBDTs, each uBoostBDT uses own FPRtarget 

(all possible FPRs with step of 1%)
uBoostBDT returns 0 or 1 (passed or not the threshold corresponding to FPRtarget),

simple averaging is used to obtain predictions.
drives to uniform selection
very complex training
many trees
estimation of threshold in uBoostBDT may be biased
Non-uniformity measure
30
difference in the efficiency can be detected by analyzing distributions
uniformity = no dependence between the mass and predictions
Uniform predictions
Non-uniform predictions 

(peak in highlighted
region)
Non-uniformity measure
31
Average contributions (difference between global and local distributions) from different regions 

in the mass: use for this Cramer-von Mises measure (integral characteristic)
CvM =
X
region
Z
|Fregion(s) Fglobal(s)|2
dFglobal(s)
Minimizing non-uniformity
32
why not minimizing CvM as a loss function with GB?
… because we can’t compute the gradient
ROC AUC, classification accuracy are not differentiable too
also, minimizing CvM doesn't encounter classification problem:

the minimum of CvM is achieved i.e. on a classifier with random predictions
Flatness loss (FL)
33
Put an additional term in the loss function which will penalize for non-uniformity predictions:
Flatness loss approximates non-differentiable CvM measure:


L = Ladaloss + ↵LFL
LFL =
X
region
Z
|Fregion(s) Fglobal(s)|2
ds
@
@D(xi)
LFL ⇠ 2(Fregion(s) Fglobal(s)) s=D(xi)
Rare decay analysis DEMO
34
when we train on a sideband vs MC using many
features, we easily can run into problems (there exist
several features which depend on the mass)
Rare decay analysis DEMO
35
all models use the same set of features for discrimination,
but AdaBoost got serious dependence on the mass

PID DEMO
36
Features strongly depend on the momentum and transverse
momentum. Both algorithms use the same set of features.

Used MVA is a specific BDT implementation with flatness loss.
Trigger DEMO
37
Both algorithms use the same set of features. The right
one is uGB+FL.
Dalitz analysis DEMO
38
The right one is uBoost algorithm. Global efficiency is set 70%
hep_ml library
39
Summary
40
1. uBoost approach
2. Non-uniformity measure
3. uGB+FL approach: gradient boosting with flatness loss (FL)
uBoost, uGB+FL:
produce flat predictions along the set of features
there is a trade off between classification quality and uniformity
Boosting summary
41
powerful general-purpose algorithm
most known applications: classification, regression and ranking
widely used, considered to be well-studied
can be adapted to different specific scientific problems
Thanks for attention
Likhomanenko Tatiana
researcher-developer
Contacts
antares@yandex-team.ru, tatiana.likhomanenko@cern.ch

More Related Content

What's hot

Green building technology
Green building technologyGreen building technology
Green building technologybuiltify
 
Caso clinico di lesione meniscale
Caso clinico di lesione meniscaleCaso clinico di lesione meniscale
Caso clinico di lesione meniscaleLuca Dei Giudici
 
Hydrogen & Renewable Energy In The UK : Real Life Case Studies
Hydrogen & Renewable Energy In The UK : Real Life Case StudiesHydrogen & Renewable Energy In The UK : Real Life Case Studies
Hydrogen & Renewable Energy In The UK : Real Life Case StudiesGavin Harper
 
ENHANCED DBS CERTIFICATE JEC Toms front
ENHANCED DBS CERTIFICATE  JEC Toms frontENHANCED DBS CERTIFICATE  JEC Toms front
ENHANCED DBS CERTIFICATE JEC Toms frontJohn Toms
 
Building Integrated Photovoltaics
Building Integrated PhotovoltaicsBuilding Integrated Photovoltaics
Building Integrated PhotovoltaicsGavin Harper
 
John Hancock Tower Boston
John Hancock Tower BostonJohn Hancock Tower Boston
John Hancock Tower BostonHarish Watkar
 

What's hot (7)

Green building technology
Green building technologyGreen building technology
Green building technology
 
Caso clinico di lesione meniscale
Caso clinico di lesione meniscaleCaso clinico di lesione meniscale
Caso clinico di lesione meniscale
 
Hydrogen & Renewable Energy In The UK : Real Life Case Studies
Hydrogen & Renewable Energy In The UK : Real Life Case StudiesHydrogen & Renewable Energy In The UK : Real Life Case Studies
Hydrogen & Renewable Energy In The UK : Real Life Case Studies
 
ENHANCED DBS CERTIFICATE JEC Toms front
ENHANCED DBS CERTIFICATE  JEC Toms frontENHANCED DBS CERTIFICATE  JEC Toms front
ENHANCED DBS CERTIFICATE JEC Toms front
 
Building Integrated Photovoltaics
Building Integrated PhotovoltaicsBuilding Integrated Photovoltaics
Building Integrated Photovoltaics
 
John Hancock Tower Boston
John Hancock Tower BostonJohn Hancock Tower Boston
John Hancock Tower Boston
 
Green manufacturing
Green manufacturingGreen manufacturing
Green manufacturing
 

Viewers also liked

MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
Machine learning in science and industry — day 4
Machine learning in science and industry — day 4Machine learning in science and industry — day 4
Machine learning in science and industry — day 4arogozhnikov
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackarogozhnikov
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackarogozhnikov
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3arogozhnikov
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2arogozhnikov
 

Viewers also liked (7)

MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
Machine learning in science and industry — day 4
Machine learning in science and industry — day 4Machine learning in science and industry — day 4
Machine learning in science and industry — day 4
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
 

Similar to Reweighting and Boosting to uniforimty in HEP

A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:Sean Golliher
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
High-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECLHigh-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECLHPCC Systems
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4arogozhnikov
 
Data Science Cheatsheet.pdf
Data Science Cheatsheet.pdfData Science Cheatsheet.pdf
Data Science Cheatsheet.pdfqawali1
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.pptLPrashanthi
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...csandit
 
A fitness landscape analysis of the Travelling Thief Problem
A fitness landscape analysis of the Travelling Thief ProblemA fitness landscape analysis of the Travelling Thief Problem
A fitness landscape analysis of the Travelling Thief ProblemMehdi EL KRARI
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptxMarceloHenriques20
 
Understanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECLUnderstanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECLHPCC Systems
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reductionYan Xu
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Rafael Nogueras
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxSivam Chinna
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakashShivaram Prakash
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Ukraine
 

Similar to Reweighting and Boosting to uniforimty in HEP (20)

Distributed ADMM
Distributed ADMMDistributed ADMM
Distributed ADMM
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
High-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECLHigh-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECL
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Data Science Cheatsheet.pdf
Data Science Cheatsheet.pdfData Science Cheatsheet.pdf
Data Science Cheatsheet.pdf
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.ppt
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
E XTENDED F AST S EARCH C LUSTERING A LGORITHM : W IDELY D ENSITY C LUSTERS ,...
 
A fitness landscape analysis of the Travelling Thief Problem
A fitness landscape analysis of the Travelling Thief ProblemA fitness landscape analysis of the Travelling Thief Problem
A fitness landscape analysis of the Travelling Thief Problem
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx
 
Understanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECLUnderstanding High-dimensional Networks for Continuous Variables Using ECL
Understanding High-dimensional Networks for Continuous Variables Using ECL
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
Self-sampling Strategies for Multimemetic Algorithms in Unstable Computationa...
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
 

Recently uploaded

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicinesherlingomez2
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 

Recently uploaded (20)

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 

Reweighting and Boosting to uniforimty in HEP

  • 1. Non-trivial applications of boosting Tatiana Likhomanenko Lund, MLHEP 2016 *many slides are taken from Alex Rogozhnikov’s presentations
  • 2. Boosting recapitulation 2 Boosting combines weak learners to obtain a strong one It is usually built over decision trees State-of-the-art results in many areas General-purpose implementations are used for classification and regression
  • 4. Data/MC disagreement 4 Monte Carlo (MC) simulated samples are used for training and tuning a model After, trained model is applied to real data (RD) Real data and Monte Carlo have different distributions Thus, trained model is biased (and the quality is overestimated on MC samples)
  • 5. Distributions reweighting 5 Reweighting in HEP is used to minimize the difference between RD and MC samples The goal of reweighting: assign weights to MC s.t. MC and RD distributions coincide Known process is used, for which RD can be obtained (MC samples are also available) MC distribution is original, RD distribution is target
  • 6. Applications beyond physics 6 Introducing corrections to fight non-response bias: assigning higher weight to answers from groups with low response. See e.g. R. Kizilcec, "Reducing non-response bias with survey reweighting: Applications for online learning researchers", 2014.
  • 7. Typical approach: histogram reweighting 7 variable(s) is split into bins in each bin the MC weight is multiplied by: - total weights of events in a bin for target and original distributions 1. simple and fast 2. number of variables is very limited by statistics (typically only one, two) 3. reweighting in one variable may bring disagreement in others 4. which variable is preferable for reweighting? multiplierbin = wbin, target wbin, original wbin, target, wbin, original
  • 9. Typical approach: example 9 Problems arise when there are too few events in a bin This can be detected on a holdout (see the latest row) Issues: 1. few bins - rule is rough 2. many bins - rule is not reliable Reweighting rule must be checked on a holdout!
  • 10. Reweighting quality 10 How to check the quality of reweighting? One dimensional case: two samples tests (Kolmogorov-Smirnov test, Mann-Whitney test, …) Two or more dimensions? Comparing 1d projections is not a way
  • 11. Comparing nDim distributions using ML 11 Final goal: classifier doesn’t use data/MC disagreement information = classifier cannot discriminate data and MC Comparison of distributions shall be done using ML: train a classifier to discriminate data and MC output of the classifier is one-dimensional variable looking at the ROC curve (alternative of two sample test) on a holdout
 (should be 0.5 if the classifier cannot discriminate data and MC)
  • 12. Density ratio estimation approach 12 We need to estimate density ratio: Classifier trained to discriminate MC and RD should reconstruct 
 probabilities pMC(x) and pRD(x) For reweighting we can use 1. Approach is able to reweight in many variables 2. It is successfully tried in HEP, see D. Martschei et al, 
 "Advanced event reweighting using multivariate analysis", 2012 3. There is poor reconstruction when ratio is too small / high 4. It is slower than histogram approach fRD(x) fMC(x) fRD(x) fMC(x) ⇠ pRD(x) pMC(x)
  • 13. … 13 Write ML algorithm to solve directly reweighting problem Remind that in histogram approach few bins is bad, many bins is bad too. What can we do? Better idea… Split space of variables in several large regions Find this regions ‘intellectually’
  • 14. Decision tree for reweighting 14 Write ML algorithm to solve directly reweighting problem: Tree splits the space of variables with orthogonal cuts (each tree leaf is a region, or bin) There are different criteria to construct a tree (MSE, Gini index, entropy, …) Find regions with the highest difference between original and target distribution
  • 15. Spitting criteria 15 Finding regions with high difference between original and target distribution by maximizing symmetrized : 2 2 = X leaf (wleaf, original wleaf, target)2 wleaf, original + wleaf, target A tree leaf may be considered as ‘a bin’;
 - total weights of events in a leaf for target and original distributions. wleaf, original, wleaf, target
  • 16. AdaBoost (Adaptive Boosting) recall 16 building of weak learners one-by-one, predictions are summed: each time increase weights of events incorrectly classified by a tree main idea: provide base estimator (weak learner) with information about which samples have higher importance wi wi exp( ↵yid(xi))), yi = ±1 D(x) = X j ↵jdj(x) d(x)
  • 17. BDT reweighter 17 Many times repeat the following steps: build a shallow tree to maximize symmetrized compute predictions in leaves: reweight distributions (compare with AdaBoost): Comparison with GBDT: different tree splitting criterion different boosting procedure 2 leaf pred = log wleaf, target wleaf, original w = ( w, if event from target (RD) distribution w · epred , if event from original (MC) distribution
  • 18. BDT reweighter DEMO after BDT reweightingbefore BDT reweighting 18
  • 19. KS for 1d projections 19 Bins reweighter uses only
 2 last variables (60 × 60 bins); BDT reweighter uses all variables
  • 21. hep_ml library 21 Being a variation of GBDT, BDT reweighter is able to calculate feature importances. Two features used in reweighting with bins are indeed the most important.
  • 22. Summary 22 1. Comparison of multidimensional distributions is ML problem 2. Reweighting of distributions is ML problem 3. Check reweighting rule on the holdout BDT reweighter uses each time few large bins (construction is done intellectually) is able to handle many variables requires less data (for the same performance) ... but slow (being ML algorithm)
  • 24. Uniformity 24 Uniformity means that we have constant efficiency (FPR/TPR) against some variable. Applications: trigger system (flight time)
 flat signal efficiency particle identification (momentum) 
 flat signal efficiency rare decays (mass)
 flat background efficiency Dalitz analysis (Dalitz variables)
 flat signal efficiency
  • 25. Non-flatness along the mass 25 High correlation with the mass can create from pure background false peaking signal (specially if we use mass sidebands for training) Goal: FPR = const for different regions in mass FPR = background efficiency
  • 26. Basic approach 26 reduce the number of features used in training leave only the set of features, which do not give enough information to reconstruct
 the mass of particle simple and works sometimes we have to loose information Can we modify ML to use all features, but provide uniform 
 background efficiency (FPR)/signal efficiency (TPR) along the mass?
  • 27. Gradient boosting recall 27 Gradient boosting greedily builds an ensemble of estimators by optimizing some loss function. Those could be: MSE: AdaLoss: LogLoss: Next estimator in series approximates gradient of loss in the space of functions
 D(x) = X j ↵jdj(x) L = X i (yi D(xi))2 L = X i e yiD(xi) , yi = ±1 L = X i log(1 + e yiD(xi) ), yi = ±1
  • 28. uBoostBDT 28 Aims to get FPRregion=const Fix target efficiency, for example FPRtarget=30%, and find corresponding threshold train a tree, its decision function is increase weight for misclassified events: increase weight of background events in the regions with high FPR This way we achieve FPRregion=30% in all regions only for some threshold on training dataset d(x) wi wi exp( ↵yid(xi))), yi = ±1 wi wi exp ( (FPRregion FPRtarget))
  • 29. uBoost 29 uBoost is an ensemble of uBoostBDTs, each uBoostBDT uses own FPRtarget 
 (all possible FPRs with step of 1%) uBoostBDT returns 0 or 1 (passed or not the threshold corresponding to FPRtarget),
 simple averaging is used to obtain predictions. drives to uniform selection very complex training many trees estimation of threshold in uBoostBDT may be biased
  • 30. Non-uniformity measure 30 difference in the efficiency can be detected by analyzing distributions uniformity = no dependence between the mass and predictions Uniform predictions Non-uniform predictions 
 (peak in highlighted region)
  • 31. Non-uniformity measure 31 Average contributions (difference between global and local distributions) from different regions 
 in the mass: use for this Cramer-von Mises measure (integral characteristic) CvM = X region Z |Fregion(s) Fglobal(s)|2 dFglobal(s)
  • 32. Minimizing non-uniformity 32 why not minimizing CvM as a loss function with GB? … because we can’t compute the gradient ROC AUC, classification accuracy are not differentiable too also, minimizing CvM doesn't encounter classification problem:
 the minimum of CvM is achieved i.e. on a classifier with random predictions
  • 33. Flatness loss (FL) 33 Put an additional term in the loss function which will penalize for non-uniformity predictions: Flatness loss approximates non-differentiable CvM measure: 
 L = Ladaloss + ↵LFL LFL = X region Z |Fregion(s) Fglobal(s)|2 ds @ @D(xi) LFL ⇠ 2(Fregion(s) Fglobal(s)) s=D(xi)
  • 34. Rare decay analysis DEMO 34 when we train on a sideband vs MC using many features, we easily can run into problems (there exist several features which depend on the mass)
  • 35. Rare decay analysis DEMO 35 all models use the same set of features for discrimination, but AdaBoost got serious dependence on the mass

  • 36. PID DEMO 36 Features strongly depend on the momentum and transverse momentum. Both algorithms use the same set of features.
 Used MVA is a specific BDT implementation with flatness loss.
  • 37. Trigger DEMO 37 Both algorithms use the same set of features. The right one is uGB+FL.
  • 38. Dalitz analysis DEMO 38 The right one is uBoost algorithm. Global efficiency is set 70%
  • 40. Summary 40 1. uBoost approach 2. Non-uniformity measure 3. uGB+FL approach: gradient boosting with flatness loss (FL) uBoost, uGB+FL: produce flat predictions along the set of features there is a trade off between classification quality and uniformity
  • 41. Boosting summary 41 powerful general-purpose algorithm most known applications: classification, regression and ranking widely used, considered to be well-studied can be adapted to different specific scientific problems