Scikit-Learn in Particle Physics

1. Scikit-Learn in particle physics Gilles Louppe CERN, Switzerland November 18, 2014 1 / 13

2. High Energy Physics (HEP) c CERN 2 / 13

3. High Energy Physics (HEP) c CERN Study the nature of the constituents of matter 2 / 13

4. High Energy Physics (HEP) c CERN Study the nature of the constituents of matter 2 / 13

5. Particle detector 101 c ATLAS CERN 3 / 13

6. Particle detector 101 c ATLAS CERN 3 / 13

7. Data analysis tasks in detectors 1 Track

8. nding Reconstruction of particle trajectories from hits in detectors 2 Budgeted classi

9. cation Real-time classi

10. cation of events in triggers 3 Classi

11. cation of signal / background events Oine statistical analysis for discovery of new particles 4 / 13

12. The Kaggle Higgs Boson challenge (in HEP terms) Data comes as a

13. nite set D = f(xi ; yi ;wi )ji = 0; : : : ;N 1g; where xi 2 Rd , yi 2 fsignal; backgroundg and wi 2 R+. The goal is to

14. nd a region G = fxjg(x) = signalg Rd , de

15. ned from a binary function g, for which the background-only hypothesis can be rejected at a strong signi

16. cance level (p = 2:87 107, i.e., 5 sigma). Empirically, this is approximately equivalent to

17. nding g from D so as to maximize AMS ps b , where s = P fi jyi =signal;g(xi)=signalg wi b = P fi jyi =background;g(xi)=signalg wi 5 / 13

18. The Kaggle Higgs Boson challenge (in ML terms) Find a binary classi

19. er g : Rd7! fsignal; backgroundg maximizing the objective function AMS s p b ; where s is the weighted number of true positives b is the weighted number of false positives. 6 / 13

20. Winning methods Ensembles of neural networks (1st and 3rd) ; Ensembles of regularized greedy forests (2nd) ; Boosting with regularization (XGBoost package). Most contestants dit not optimize AMS directly ; But chosed the prediction cut-o maximizing AMS in CV. 7 / 13

21. Lessons learned (for machine learning) AMS is highly unstable, hence the need for Rigorous and stable cross-validation to avoid over

22. tting. Ensembles to reduce variance ; Regularized base models. Support of samples weights wi in classi

23. cation models was key for this challenge. Feature engineering hardly helped. (Because features already incorporated physics knowledge.) 8 / 13

24. Lessons learned (for physicists) Domain knowledge hardly helped. Standard machine learning techniques, run on a single laptop, beat benchmarks without much eorts. Physicists started to realize that collaborations with machine learning experts is likely to be bene

25. cial. I worked on the ATLAS experiment for over a decade [...] It is rather depressing to see how badly I scored. The

26. nal results seem to reinforce the idea that the machine learning experience is vastly more important in a similar contest than the knowledge of particle physics. I think that many people underestimate the computers. It is probably the reason why ML experts and physicists should work together for

27. nding the Higgs. 9 / 13

28. Scienti

29. c software in HEP ROOT and TMVA are standard data analysis tools in HEP. Surprisingly, this HEP software ecosystem proved to be rather limited and easily outperformed (at least in the context of the Kaggle challenge). 10 / 13

30. Scikit-Learn in Particle Physics ? The main technical blocker for the larger adoption of Scikit-Learn in HEP remains the full support of sample weights throughout all essential modules. Since 0.16, weights are supported in all ensembles and in most metrics. Next step is to add support in grid search. In parallel, domain-speci

31. c packages are getting traction ROOTpy, for bridging the gap between ROOT data format and NumPy ; lhcb trigger ml, implementing ML algorithms for HEP (mostly Boosting variants), on top of scikit-learn. 11 / 13

32. Major blocker : social reasons ? The adoption of external solutions (e.g., the scienti

33. c Python stack or Scikit-Learn) appears to be slow and dicult in the HEP community because of No ground-breaking added-value ; The learning curve of new tools ; Lack of understanding of non-HEP methods ; Isolation from the community ; Genuine ignorance. 12 / 13

34. Conclusions Scikit-Learn has the potential to become an important tool in HEP. But we are not there yet [WIP]. Overall, both for data analysis and software aspects, this calls for a larger collaboration between data sciences and HEP. The process of attempting as a physicist to compete against ML experts has given us a new respect for a

35. eld that (through ignorance) none of us held in as high esteem as we do now. 13 / 13

36. c xkcd Questions ? 14 / 13

Scikit-Learn in Particle Physics

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Scikit-Learn in Particle Physics

Similar to Scikit-Learn in Particle Physics (20)

Recently uploaded

Recently uploaded (20)

Scikit-Learn in Particle Physics