High energy physics experiments such as currently running Large Hadron Collider (LHC) or the future collider experiments (CEPC, CLIC, ILC, FCC), rely strongly on data science. Only from four LHC experiments the CERN Data Centre stores more than thirty petabytes of data per year, where over hundred petabytes of data are archived permanently. The collider experiments are characterized not only by the vast amount of data, but also with the necessity for the high precision measurement, unfavorable ratio of signal to background, where the tiny signals are covered by the huge pile of background events, with ratio of one per million, or less. In Higgs physics special challenge present the studies with purely hadronic final states, jets, where the lack of the sharp tagging variables lead to strenuous signal and background separation. The presentation will give the overview of the use of data science in the Higgs boson physics at future Circular electron positron collider, CEPC, China.
2. Mila Pandurovic Data Science Conference EUROPE 2023
Why colliders? High energy particle physics
On 8 October 2013, it was announced that Higgs and François Englert would share the 2013 Nobel Prize in
Physics "for the theoretical discovery of a mechanism that contributes to our understanding of the origin of
mass of subatomic particles", and which recently was confirmed through the discovery of the predicted
fundamental particle, by the ATLAS and CMS experiments at CERN’s Large Hadron Collider".
DATA SCIENCE IN HIGH ENERGY PHYSICS
2
2
3. The discovery machine! But lots of data to handle
• The biggest of these experiments, ATLAS and
CMS, use multipurpose detectors to investigate
the largest range of physics possible.
The Large Hadron Collider (LHC) is the world’s largest and most powerful
particle accelerator. It consists of a 27-kilometre ring of superconducting magnets
with a number of accelerating structures to boost the energy of the particles
along the way.
Nine experiments, at the Large Hadron
Collider (LHC) use particle detectors to analyze
“zillions” of particles produced by collisions
AToroidal LHC ApparatuS
The Compact Muon Solenoid (CMS)
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
Particles collide in the Large Hadron Collider (LHC) detectors
approximately one billion times per second, generating about
one petabyte of collision data per second.
3
4. Taming the collision particle ZOO…..
Detectors consist of several subdetector, divided into tiny segments, for the most precise measurement of the decay position, energy, …
Millions of hits in every subdetector.
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
one petabyte of collision data per second in
nine of these ones
DATA science on the run!
15m
12m
4
5. 11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
Standard model holds - limits, limits, limits…
In 2020/2021 the only CMS/ATLAS Collaborations published their respective1000th paper using LHC data
5
6. Precision!
the strength of the Strong Force …….And many many more
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
In 2020/2021 the only CMS/ATLAS Collaborations published their respective1000th paper using LHC data
6
7. Where are we currently with LHC …..Already near the upgrade
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
7
8. STANDARD MODEL
Composite
Higgs
But lets start from the begging! Open questions in particle physics
• Many unexplained phenomena….
• DARK/Missing matter
• Matter/anti matter asymmetry
• Are neutrinos their own antiparticle ?
• Why are there three generations of fermions ?
• What is the origin of the hierarchy of fermion masses ?
• Do forces unify ? Is the proton (ordinary matter) stable ?
• What about Dark Energy ?
• What about Strong CP problem? …
• The Standard Model is scientists ’ best guess at explaining the
universe,” … Don Lincoln
• Full mathematical quantum field theory
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
8
9. We test various theories to address the open question of particle physics!
EVERY STEP IS DATA SCIENCE !!!
DESIGN PHASE
R&D future experiments Building phase
Running phase experiments
Monte Carlo event generation
R&D RESULTS: “feasibility studies” for the
specific collider/detector for unrevealing
processes of interest:
RELATIVE STATISTICAL PRECISION
Detector simulation GEANT 4
Event reconstruction
Analysis (Study)
Triggers / or triggerless ?
Reconstruction
Analysis
IF
REAL DATA RESULTS:
DISCOVERY!!
or
We keep searching….
9
10. Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
10
MORE AND MORE DATA COMING
ENERGY!
• Theory leads the way: .. many theories to
explain observed phenomena
• But what we experimentalists confirm is the
one that is valid: the theory predictions must be
proved experimentally!
• In order to test reach down to the unknown
one must enlarge the energy
STATISTICS: LUMINOSITY!
• In order to reveal the processes of interest we need
Higgs !!!
10
11. Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
FUTURE COLLIDER OPTIONS
TWO LINEAR OPTIONS
TWO CIRCULAR OPTIONS
CHINA
Japan
11
12. Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
RESEARCH & DEVELOPMENT PHASE “FEASIBILITY STUDIES!”
Due to the complexity of the experiments life span from idea to commissioning and start of data taking is … long
….
12
13. Q
We are after Higgs boson properties!
….Higgs boson couplings
Putting it all together… R&D phase for electron positron colliders
• New machine is going to be electron positron collider
• Three main pillars of the new machine physics
• The Higgs boson
• Top quark
• Beyond Standard model physics
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
13
14. Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
CEPC DATA FLOW
THEORY
14
15. How does it all work …. Coupling of the Higgs to theW boson
• Analyzed HZ fully hadronic decay, signal : Z→qq ,
H→WW*→qqqq
• BFH126→WW~23.0% , BFWW→qqqq~45.4% signal ~ 10 %
of Higgs decays
• σHZ, Z →qq ~ 143,39 fb Higgs production cross section
• SIGNAL σ(HZ, Z→qq , H→WW*→qqqq ) ~ 16,12 fb
• Measurement of the relative branching fraction
• Signal signature: 6 central jets in the final state
• Goal of the analysis:
•Calculate the statistical potential for the determination
of the specific Higgs couplings
•Verify the analysis strategy
Studied the statistical potential (the statistical error in this type
of measurement) of future electron positron colliders: CLIC,
ILC, CEPC for the measurement for the measurement of
coupling of the Higgs toW boson!
If that is below necessary theoretical limit than this
measurement is ”worth while”
one moreYES for planning the collider
𝑞
𝑞
𝑞
𝑞
𝑞
W
H
W
𝑞
Mila Pandurovic Data Science Conference EUROPE 2023
15
16. … JETS
• A jet in high energy physics is a term that describes a
product of fragmentation of objects that contain
“colored” objects: quarks or
• The fragmentation of these objects has to obey the
property of QCD confinement: these particles cannot
exist in freely, that is only colorless states can exist
“unbound”.
• When an object containing color charge fragments,
each fragment carries away some of the color charge.
In order to obey confinement, these fragments create
other colored objects around them to form colorless
objects, so that a narrow cone of hadrons and other
particles are formed.
• Many jet algorithms: track by track, particle by particle
𝑞
𝑞
𝑞
𝑞
𝑞
W
H
W
𝑞
OUR ANALYSIS HAS six jets (q)
in the final state that have to be
reconstrued
16
17. Mila Pandurovic Data Science Conference EUROPE 2023
CEPC RECONSTRUCTION software
Arbor
Reconstruction of the ‘tree’ like
structures of the
shower development
TPC hits→trees, forest
Tree decomposition into track
segments →merge →TRACKS
Track refitting →track parameters
Track reconstruction combinatorics, graphs, neural networks …
“Catch me if you can”
Event reconstruction is the process of
interpreting the electronic signals
produced in a high-energy physics (HEP)
experiment’s detector to determine what
original particles passed through the
detector and their characteristics.
11/20/2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
17
18. On the top of physics case …. machine background
Working on it
severe machine background Now it looks easy…well… easier
After dedicated background removal
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
18
19. Analysis flow again DATA science in every bit
Future Colliders Commons standards
• High energy physicists “have agreed” to
work on the common tools, where
most of them are freeware developed
by the particle physicists
• Working on the Linux platforms:
Scientific Linux, Ubuntu, whatever..
• We built common computing resources
GRID throughout the world for
computational purposes
• (not publicly available, only for the
computational purposes of particle
physics people )
“Fast Jet” : gathering tracks making 6 jets
“LCFI vertex”
Finding the exact positions where the collision happned
Preselection
Multivariate analysis
Relative statistical precision!!!!
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
19
20. Statistics and combinatorics!
• Jet reconstruction kT exclusive, particle flow with Arbor v3.1
• Jet formation: force events into 6 jets
Reconstruction of the Higgs, Z andW bosons
• Obtained jets are grouped into three pairs to form the
W,W* and Z bosons, fromWW* pair - the Higgs boson
• The combination which minimizes the 2 is chosen :
• are the W.A. width was taken 2
H,W,Z
H
2
2
H
ijmn
Z
2
2
Z
kl
W
2
2
W
ij
2
m
m
m
m
m
m
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
20
21. Signal and background samples
• Quark energy vs
sample 𝜎 𝑓𝑏 #𝑒𝑣𝑡𝑠
/5𝑎𝑏−1
𝑞𝑞ℎ → 𝑞𝑞𝑊𝑊∗
→ 𝑞𝑞𝑞𝑞𝑞𝑞 16,12 80600
𝑜𝑡ℎ𝑒𝑟 𝐻𝑖𝑔𝑔𝑠 𝑑𝑒𝑐𝑎𝑦𝑠
𝑛𝑜𝑛 𝑞𝑞ℎ → 𝑞𝑞𝑊𝑊∗
→ 𝑞𝑞𝑞𝑞𝑞𝑞
127,27 636350
2𝑓 49561,30 247806500
4𝑓_𝑤𝑤_𝑐𝑢𝑥𝑥 3395,48 16977400
4𝑓_𝑤𝑤_𝑐𝑐𝑏𝑠 5,74 28700
4𝑓_𝑤𝑤_𝑐𝑐𝑑𝑠 165,57 827850
4𝑓_𝑤𝑤_𝑢𝑢𝑏𝑑 0.05 250
4𝑓_𝑤𝑤_𝑢𝑢𝑠𝑑 165,94 829700
4𝑓_𝑀𝑖𝑥_𝑢𝑑𝑢𝑑 1570,40 7852000
4𝑓_𝑀𝑖𝑥_𝑐𝑠𝑐𝑠 1568,94 7844700
4𝑓_𝑧𝑧_utut 83,09 415450
4𝑓_𝑧𝑧_𝑑𝑡𝑑𝑡 226,20 1131000
4𝑓_𝑧𝑧_𝑢𝑢_𝑛𝑜𝑡𝑑 95,65 478250
4𝑓_𝑧𝑧_𝑐𝑐_𝑛𝑜𝑡𝑠 96,04 480200
+..2fermion background, 6 fermion background…
11/20/2023 21
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
21
22. How do we find our Event … separation of signal and background
• Invariant masses:mHiggs mZ mW mw*
• Number of particle flow objects NPFO
• Visible energy Evis
• The highest transverse momentum of the jet in the event –
highestPtJet
• Transverse momentum of the Higgs boson PtOfHiggsJets
• Event shape variables: Jet transitions: y12 y23 y34 y45 y56 y67
• Force event into 2 jet: btag1, btag2, btag1*btag2
• ctag1, ctag2
• Force event into 6 jet: btagi, ctagi
• Angle between jets that comprise W boson:ThetaWqq,
• Angle between jets that comprise Z boson:ThetaZqq
• Angle between W and W* that comprise the Higgs boson :
ThetaHiggsW1W2
• Arithmetic variable Energy*Theta of the W, Higgs and Z boson…
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
22
23. Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
23
HOW ABOUT SOME INPUT
EVENT SHAPE ENERGY ANDTOPOLOGY
• thrust, oblateness, sphericity, aplanarity
Z
W
Higgs
INVARIANT MASSES
Rec Signal
FINAL STATE PARTICLES
23
24. Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
SIGNAL VS BACKGROUND
• After the static cut analysis ~98% of the
background is reduced.
• The obtained relative statistical
uncertainty 1.7/3.6 % with mva and
static cuts respectively
• corresponding signal efficiency of 29%
sample 𝜎 𝑓𝑏 𝜀𝑡𝑜𝑡 𝑚𝑣𝑎
[%]
𝑞𝑞ℎ → 𝑞𝑞𝑊𝑊∗
→ 𝑞𝑞𝑞𝑞𝑞𝑞 16,12 28.85
𝑜𝑡ℎ𝑒𝑟 𝐻𝑖𝑔𝑔𝑠 𝑑𝑒𝑐𝑎𝑦𝑠
𝑛𝑜𝑛 𝑞𝑞ℎ → 𝑞𝑞𝑊𝑊∗
→ 𝑞𝑞𝑞𝑞𝑞𝑞
127,27 6.1
2𝑓 49561,30 0.002
4𝑓_𝑤𝑤_𝑐𝑢𝑥𝑥 3395,48 0.24
4𝑓_𝑤𝑤_𝑐𝑐𝑏𝑠 5,74 0.38
4𝑓_𝑤𝑤_𝑐𝑐𝑑𝑠 165,57 0.28
4𝑓_𝑤𝑤_𝑢𝑢𝑏𝑑 0.05 0.4
4𝑓_𝑤𝑤_𝑢𝑢𝑠𝑑 165,94 0.13
4𝑓_𝑤𝑤_𝑧𝑧_𝑢𝑑𝑢𝑑 1570,40 0.24
4𝑓_𝑤𝑤_𝑧𝑧_𝑐𝑠𝑐𝑠 1568,94 0.29
4𝑓_𝑧𝑧_utut 83,09 1.2
4𝑓_𝑧𝑧_𝑑𝑡𝑑𝑡 226,20 1.8
4𝑓_𝑧𝑧_𝑢𝑢_𝑛𝑜𝑡𝑑 95,65 1.4
4𝑓_𝑧𝑧_𝑐𝑐_𝑛𝑜𝑡𝑠 96,04 1.7
• Final selection removes ~99% of the background
24
25. TOOOLSSS… ROOT
• In order to perform the analysis – standard HEP
tools include the CERN developed software
package for analysis and visual presentation of the
results of the measurement
• It is a freeware/open source with the constant
support of its creator Rene Brun and the ROOT
development team
• C++ based software and it is the successor of
PAW which was Fortran based
• So everything that you will see was produced in
some way using ROOT
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
25
26. MACHINE LEARNING TMVA
• TMVA is the ROOT library that provides the interfaces and implementations of the above mentioned machine learning
techniques.
The package includes:
• Neural networks
• Deep networks
• Multilayer perceptron
• Boosted/Bagged decision trees
• Function discriminant analysis (FDA)
• Multidimensional probability density estimation (PDE - range-search approach)
• Multidimensional k-nearest neighbor classifier
• Predictive learning via rule ensembles (Rule Fit)
• Projective likelihood estimation (PDE approach)
• Rectangular cut optimization
• SupportVector Machine (SVM)
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
26
27. Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
THEWORLDWIDE LHC COMPUTING GRID (WLCG)
• The mission of theWorldwide LHC Computing Grid (WLCG) is to provide global computing resources
for the storage, distribution and analysis of the data generated by the LHC.
• WLCG combines about 1.4 million computer cores and 1.5 exabytes of storage from over 170 sites in
42 countries.This massive distributed computing infrastructure provides more than 12 000 physicists
around the world with near real-time access to LHC data, and the power to process it.
• It runs over 2 million tasks per day and, at the end of the LHC’s LS2, global transfer rates exceeded
260 GB/s.
• These numbers will increase as time goes on and as computing resources and new technologies become
ever more available across the world.
• CERN provides about 20% of the resources of WLCG
27
28. Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
ITWAS “WORTHYOURWHILE”
• The obtained relative statistical precision is 1.7 %
with the corresponding signal efficiency of 29%
Δ𝜎
𝜎
=
𝑆 + 𝐵
𝑆
≈ 1.7%
After preselection After final selection
28
29. ALL THE WAY IT IS A COLLABORATIVE EFFORT
But… we all are in this together, scienece!!
HL-LHC , FCC, CEPC, CLIC, ILC, …
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
Higgs couplings below specified limits!!
29
30. Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
SUMMARY
• Data science plays a crucial role in modern particle physics to study the fundamental building blocks of
the Universe.
• It is applied in every step of collection, triggering, storage, calibration, reconstruction and analyzing data
from running particle accelerators, such as the Large Hadron Collider (LHC),
• It is used for “feasibility studies” that is for the research and development of the future colliders
• In order to identify and reconstruct particles from vast amounts of data generated by detectors, machine
learning algorithms are used
• The presented material are “bits and pieces” from running LHC and future colliders CEPC, CLIC, ILC
• More comprehensive and detailed presentation would take much of our time and multiple
lecturers!
30
31. As for the origin of mass and Higgs …
What did you
have for
breakfast ???
Not-a-thing!!!
Higgs is just
sticking to me!
… for now … we just have to put up with it
11/20/2023
Mila Pandurovic Data Science Conference EUROPE 2023
DATA SCIENCE IN HIGH ENERGY PHYSICS
31
33. Data Preservation
• Besides collecting data for the ongoing analyses, the long term data storage and preservation is of high
importance: re-analysis, novel ideas
• Three pillars to keep:
• the data itself
• "documentation",
• together with the necessary software + environment
• DPHEP is "Data Preservation in HEP" is an International Collaboration of Institutes, Experiments,
Funding agencies and other interested parties to implement the recommendations of the DPHEP study
group.