SlideShare a Scribd company logo
1 of 17
Download to read offline
Scikit-Learn in particle physics 
Gilles Louppe 
CERN, Switzerland 
November 18, 2014 
1 / 13
High Energy Physics (HEP) 
c CERN 
2 / 13
High Energy Physics (HEP) 
c CERN 
Study the nature of the 
constituents of matter 
2 / 13
High Energy Physics (HEP) 
c CERN 
Study the nature of the 
constituents of matter 
2 / 13
Particle detector 101 
c ATLAS CERN 
3 / 13
Particle detector 101 
c ATLAS CERN 
3 / 13
Data analysis tasks in detectors 
1 Track
nding 
Reconstruction of particle trajectories from hits in detectors 
2 Budgeted classi
cation 
Real-time classi
cation of events in triggers 
3 Classi
cation of signal / background events 
Oine statistical analysis for discovery of new particles 
4 / 13
The Kaggle Higgs Boson challenge (in HEP terms) 
 Data comes as a
nite set 
D = f(xi ; yi ;wi )ji = 0; : : : ;N  1g; 
where xi 2 Rd , yi 2 fsignal; backgroundg and wi 2 R+. 
 The goal is to
nd a region G = fxjg(x) = signalg  Rd , 
de
ned from a binary function g, for which the 
background-only hypothesis can be rejected at a strong 
signi
cance level (p = 2:87  107, i.e., 5 sigma). 
 Empirically, this is approximately equivalent to
nding g from 
D so as to maximize AMS  ps 
b 
, where 
s = 
P 
fi jyi =signal;g(xi)=signalg wi 
b = 
P 
fi jyi =background;g(xi)=signalg wi 
5 / 13

More Related Content

Viewers also liked

Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsGilles Louppe
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeGilles Louppe
 
Scikit-Learn - Or why I joined an open source software project
Scikit-Learn - Or why I joined an open source software projectScikit-Learn - Or why I joined an open source software project
Scikit-Learn - Or why I joined an open source software projectGilles Louppe
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesGilles Louppe
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnGilles Louppe
 

Viewers also liked (6)

Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to Practice
 
Scikit-Learn - Or why I joined an open source software project
Scikit-Learn - Or why I joined an open source software projectScikit-Learn - Or why I joined an open source software project
Scikit-Learn - Or why I joined an open source software project
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized trees
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 

Similar to Scikit-Learn in Particle Physics

Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsBalázs Kégl
 
When The New Science Is In The Outliers
When The New Science Is In The OutliersWhen The New Science Is In The Outliers
When The New Science Is In The Outliersaimsnist
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisDavid Gleich
 
PAISS (PRAIRIE AI Summer School) Digest July 2018
PAISS (PRAIRIE AI Summer School) Digest July 2018 PAISS (PRAIRIE AI Summer School) Digest July 2018
PAISS (PRAIRIE AI Summer School) Digest July 2018 Natalia Díaz Rodríguez
 
New Rough Set Attribute Reduction Algorithm based on Grey Wolf Optimization
New Rough Set Attribute Reduction Algorithm based on Grey Wolf OptimizationNew Rough Set Attribute Reduction Algorithm based on Grey Wolf Optimization
New Rough Set Attribute Reduction Algorithm based on Grey Wolf OptimizationAboul Ella Hassanien
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMsLoic Merckel
 
Optimal transport driven CycleGAN for unsupervised learning in inverse problems
Optimal transport driven CycleGAN for unsupervised learning in inverse problemsOptimal transport driven CycleGAN for unsupervised learning in inverse problems
Optimal transport driven CycleGAN for unsupervised learning in inverse problemsByeongsuSim1
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Individual Brain Charting: third-release dataset validation
Individual Brain Charting: third-release dataset validationIndividual Brain Charting: third-release dataset validation
Individual Brain Charting: third-release dataset validationAna Luísa Pinho
 
Enabling Congestion Control Using Homogeneous Archetypes
Enabling Congestion Control Using Homogeneous ArchetypesEnabling Congestion Control Using Homogeneous Archetypes
Enabling Congestion Control Using Homogeneous ArchetypesJames Johnson
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
Error of Multileaf collimator prediction using recurrent neural network (LSTM)
Error of Multileaf collimator prediction using recurrent neural network (LSTM)Error of Multileaf collimator prediction using recurrent neural network (LSTM)
Error of Multileaf collimator prediction using recurrent neural network (LSTM)WonjoongCheon
 
[David a. coley]_an_introduction_to_genetic_algori(book_fi.org)
[David a. coley]_an_introduction_to_genetic_algori(book_fi.org)[David a. coley]_an_introduction_to_genetic_algori(book_fi.org)
[David a. coley]_an_introduction_to_genetic_algori(book_fi.org)swapnatoya
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa
 
Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Ha Phuong
 
Global Grid of Grapes
Global Grid of GrapesGlobal Grid of Grapes
Global Grid of GrapesDerek Groen
 

Similar to Scikit-Learn in Particle Physics (20)

Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physics
 
When The New Science Is In The Outliers
When The New Science Is In The OutliersWhen The New Science Is In The Outliers
When The New Science Is In The Outliers
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
PAISS (PRAIRIE AI Summer School) Digest July 2018
PAISS (PRAIRIE AI Summer School) Digest July 2018 PAISS (PRAIRIE AI Summer School) Digest July 2018
PAISS (PRAIRIE AI Summer School) Digest July 2018
 
New Rough Set Attribute Reduction Algorithm based on Grey Wolf Optimization
New Rough Set Attribute Reduction Algorithm based on Grey Wolf OptimizationNew Rough Set Attribute Reduction Algorithm based on Grey Wolf Optimization
New Rough Set Attribute Reduction Algorithm based on Grey Wolf Optimization
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
 
Optimal transport driven CycleGAN for unsupervised learning in inverse problems
Optimal transport driven CycleGAN for unsupervised learning in inverse problemsOptimal transport driven CycleGAN for unsupervised learning in inverse problems
Optimal transport driven CycleGAN for unsupervised learning in inverse problems
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Individual Brain Charting: third-release dataset validation
Individual Brain Charting: third-release dataset validationIndividual Brain Charting: third-release dataset validation
Individual Brain Charting: third-release dataset validation
 
Enabling Congestion Control Using Homogeneous Archetypes
Enabling Congestion Control Using Homogeneous ArchetypesEnabling Congestion Control Using Homogeneous Archetypes
Enabling Congestion Control Using Homogeneous Archetypes
 
12_applications.pdf
12_applications.pdf12_applications.pdf
12_applications.pdf
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Error of Multileaf collimator prediction using recurrent neural network (LSTM)
Error of Multileaf collimator prediction using recurrent neural network (LSTM)Error of Multileaf collimator prediction using recurrent neural network (LSTM)
Error of Multileaf collimator prediction using recurrent neural network (LSTM)
 
[David a. coley]_an_introduction_to_genetic_algori(book_fi.org)
[David a. coley]_an_introduction_to_genetic_algori(book_fi.org)[David a. coley]_an_introduction_to_genetic_algori(book_fi.org)
[David a. coley]_an_introduction_to_genetic_algori(book_fi.org)
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)Tutorial of topological data analysis part 3(Mapper algorithm)
Tutorial of topological data analysis part 3(Mapper algorithm)
 
Global Grid of Grapes
Global Grid of GrapesGlobal Grid of Grapes
Global Grid of Grapes
 
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
Machine Learning Tools and Particle Swarm Optimization for Content-Based Sear...
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Scikit-Learn in Particle Physics

  • 1. Scikit-Learn in particle physics Gilles Louppe CERN, Switzerland November 18, 2014 1 / 13
  • 2. High Energy Physics (HEP) c CERN 2 / 13
  • 3. High Energy Physics (HEP) c CERN Study the nature of the constituents of matter 2 / 13
  • 4. High Energy Physics (HEP) c CERN Study the nature of the constituents of matter 2 / 13
  • 5. Particle detector 101 c ATLAS CERN 3 / 13
  • 6. Particle detector 101 c ATLAS CERN 3 / 13
  • 7. Data analysis tasks in detectors 1 Track
  • 8. nding Reconstruction of particle trajectories from hits in detectors 2 Budgeted classi
  • 10. cation of events in triggers 3 Classi
  • 11. cation of signal / background events Oine statistical analysis for discovery of new particles 4 / 13
  • 12. The Kaggle Higgs Boson challenge (in HEP terms) Data comes as a
  • 13. nite set D = f(xi ; yi ;wi )ji = 0; : : : ;N 1g; where xi 2 Rd , yi 2 fsignal; backgroundg and wi 2 R+. The goal is to
  • 14. nd a region G = fxjg(x) = signalg Rd , de
  • 15. ned from a binary function g, for which the background-only hypothesis can be rejected at a strong signi
  • 16. cance level (p = 2:87 107, i.e., 5 sigma). Empirically, this is approximately equivalent to
  • 17. nding g from D so as to maximize AMS ps b , where s = P fi jyi =signal;g(xi)=signalg wi b = P fi jyi =background;g(xi)=signalg wi 5 / 13
  • 18. The Kaggle Higgs Boson challenge (in ML terms) Find a binary classi
  • 19. er g : Rd7! fsignal; backgroundg maximizing the objective function AMS s p b ; where s is the weighted number of true positives b is the weighted number of false positives. 6 / 13
  • 20. Winning methods Ensembles of neural networks (1st and 3rd) ; Ensembles of regularized greedy forests (2nd) ; Boosting with regularization (XGBoost package). Most contestants dit not optimize AMS directly ; But chosed the prediction cut-o maximizing AMS in CV. 7 / 13
  • 21. Lessons learned (for machine learning) AMS is highly unstable, hence the need for Rigorous and stable cross-validation to avoid over
  • 22. tting. Ensembles to reduce variance ; Regularized base models. Support of samples weights wi in classi
  • 23. cation models was key for this challenge. Feature engineering hardly helped. (Because features already incorporated physics knowledge.) 8 / 13
  • 24. Lessons learned (for physicists) Domain knowledge hardly helped. Standard machine learning techniques, run on a single laptop, beat benchmarks without much eorts. Physicists started to realize that collaborations with machine learning experts is likely to be bene
  • 25. cial. I worked on the ATLAS experiment for over a decade [...] It is rather depressing to see how badly I scored. The
  • 26. nal results seem to reinforce the idea that the machine learning experience is vastly more important in a similar contest than the knowledge of particle physics. I think that many people underestimate the computers. It is probably the reason why ML experts and physicists should work together for
  • 29. c software in HEP ROOT and TMVA are standard data analysis tools in HEP. Surprisingly, this HEP software ecosystem proved to be rather limited and easily outperformed (at least in the context of the Kaggle challenge). 10 / 13
  • 30. Scikit-Learn in Particle Physics ? The main technical blocker for the larger adoption of Scikit-Learn in HEP remains the full support of sample weights throughout all essential modules. Since 0.16, weights are supported in all ensembles and in most metrics. Next step is to add support in grid search. In parallel, domain-speci
  • 31. c packages are getting traction ROOTpy, for bridging the gap between ROOT data format and NumPy ; lhcb trigger ml, implementing ML algorithms for HEP (mostly Boosting variants), on top of scikit-learn. 11 / 13
  • 32. Major blocker : social reasons ? The adoption of external solutions (e.g., the scienti
  • 33. c Python stack or Scikit-Learn) appears to be slow and dicult in the HEP community because of No ground-breaking added-value ; The learning curve of new tools ; Lack of understanding of non-HEP methods ; Isolation from the community ; Genuine ignorance. 12 / 13
  • 34. Conclusions Scikit-Learn has the potential to become an important tool in HEP. But we are not there yet [WIP]. Overall, both for data analysis and software aspects, this calls for a larger collaboration between data sciences and HEP. The process of attempting as a physicist to compete against ML experts has given us a new respect for a
  • 35. eld that (through ignorance) none of us held in as high esteem as we do now. 13 / 13
  • 36. c xkcd Questions ? 14 / 13