1. QUARK/GLUON JET TAGGING FOR ALICE:
MACHINE LEARNING FOR PARTICLE PHYSICS
ANDREW JOHN LOWE
Wigner Research Centre for Physics, Hungarian Academy of Sciences
INTRODUCTION
Search strategies for new subatomic particles often depend
on being able to efficiently discriminate between signal and back-
ground processes. Particle physics experiments are expensive, the
competition between rival experiments is intense, and the stakes
are high. This has lead to increased interest in advanced statisti-
cal methods to extend the discovery reach of the experiments. We
present a new method that could be used for differentiating be-
tween decays of quarks and gluons at experiments like those at the
Large Hadron Collider (LHC) at CERN. The power to discriminate
between these two types of particle would have a huge impact on
many new physics searches at CERN and beyond.
THE ALICE EXPERIMENT
ALICE (A Large Ion Collider Experiment) is one of seven de-
tector experiments at the LHC. ALICE is focusing on the physics
of strongly interacting matter in heavy-ion (lead nuclei) collisions.
The resulting temperature and energy density are expected to be
high enough to produce quark-gluon plasma, a state of matter wherein
quarks and gluons are freed. Similar conditions are believed to have
existed a fraction of the second after the Big Bang. Recreating this
primordial form of matter and understanding how it evolves is ex-
pected to shed light on questions about how matter is organized,
the mechanism that confines quarks and gluons, and the nature of
strong interactions and how they result in generating the bulk of the
mass of ordinary matter.
Figure 1: Computer generated cut-away view of ALICE.
WHAT IS A JET?
The production of quarks and gluons (collectively known as
partons) via strong interactions is the dominant high-momentum-
transfer process at the LHC. Quarks and gluons are not observed
individually. Instead, we can only measure their decay products.
What we observe is a cone-shaped spray of particles called a jet. The
measured particles are grouped together by a jet algorithm, and the
resultant jets are viewed as a proxy to the initial quarks and gluons
that we can’t measure.
Figure 2: When two high-energy protons collide, the partons that compose them (here only
quarks are depicted in green, red, and blue) can hit each other. Some of these partons (pink
balls) can fly away and "hadronize", forming directional jets of energetic particles (white balls).
From [1].
THE PROBLEM IN A NUTSHELL
Inside ALICE, beams of energetic protons and/or heavy ions
collide. Quarks and gluons emerge and decay into collimated sprays
of particles, and algorithms cluster these decay products into jets.
For each jet, we’d like to know what initiated it: was it a quark or
a gluon? This is an archetypal classification problem that might be
amenable to machine learning.
FEATURE ENGINEERING
There are several differences between quarks and gluons that
prove useful in motivating observables that might distinguish be-
tween jets initiated by quarks as compared to gluons. Specifically,
we wish to leverage differences in jet substructure to construct dis-
criminant variables. Many candidate discriminant variables (fea-
tures) were found during a thorough and extensive literature search,
but we also consider various unintuitive combinations of variables,
following the example in [2]. Combining particle attributes with
each other (to form sums, differences, or products, for example)
leads to a rapid proliferation of features. Consequently, we ex-
plore hundreds of experimentally motivated, physically motivated,
and unmotivated single-variable discriminants.
GETTING & CLEANING DATA
The ALICE analysis software framework ALIROOT was used
to process Monte-Carlo simulated data that contains lots of jets. We
inserted our own C++ code with handcrafted features into ALI-
ROOT. Unphysical (missing) values are denoted by NaNs. We re-
quire that jets have at least two tracks, are fully contained within
tracker geometrical acceptance and are isolated. We then analyse
the floating-point types contained in the data. This is a new sub-
process in particle physics data analysis that we have invented. The
data is contained in two-dimensional array-like structure, in which
each column contains measurements on one feature, and each row
contains one jet. We plot this below:
2 tracks
3 tracks
4 tracks
≥ 5 tracks
Feature
Jet
Number types
Zero
Normal < ε
Normal > ε
Large unnormal or ∞
NaN
Figure 3: "Missingness" and floating-point type map of the data. Feature names and jet ID
numbers have been omitted for clarity.
We observe that several features appear to be duplicates, and
several that are overwhelmingly NaN or zero. These have no pre-
dictive value and are removed. We reset large unnormalised and
infinite values to the largest representable normalised number on
our hardware. Several features have values below the machine
— these are are due to rounding-error in floating-point arithmetic,
and are essentially equal to zero. The variation in their values is not
real, and could be misleading to a classifier. We note that process-
ing very small floating-point values may significantly slow compu-
tation; in extreme cases, instructions may be as much as 100 times
slower [3, 4]. We flush these values to zero, which should speed up
classifier training.
JET TRUTH LABELLING
The jets are assigned a "ground-truth" label using information
in the data simulator event record. However, the labelling procedure
is not unambiguous, and there is significant class noise (i.e., there
are mislabelled jets) in the assignments. We have devised a new la-
belling scheme to address the problem of mislabelled jets. We adapt
the method in [5] by extending it to form an ensemble of multiple
different labelling schemes. We then reject all jets for which the en-
semble does not reach a consensus, i.e., we employ "the wisdom of
the crowds". Limiting class noise is critical for two reasons: firstly,
many machine learning classifiers are confused by mislabelled ob-
servations, and this will damage performance; secondly, the perfor-
mance of a classifier is measured with respect to its ability to cor-
rectly predict the assigned labels, so performance estimates are less
meaningful if the labels are uncertain.
JET TRUTH LABELLING (cont.)
We use an ensemble labeller with five members. To tab-
ulate their outputs would require ten tables corresponding
to the (5
2) possible adjacency matrices which, in turn, corre-
spond to the margins of the 5-dimensional adjacency matrix
that fully describes the relationships between the schemes.
We use a chord diagram to examine these relationships,
which are in agreement with our expectations:
∅
0
b
0
c
0
g
0
70
140
210
q
0
γ
0
∅
0
b
0
c
0
g 0
70
140
210
q 0
γ
0
∅
0
b
0
c
0
g
210
0
70
140
q
0
γ
0
∅
0
b
0
c
0
g
0
70
140
210
q
0
γ
0
∅
0
b
0
c
0
g
0
70
140
210
q
0
γ
0
ᵀp-xa
m
e
n
o
C
erawa-DCQeno
C
ᵀp-xamAG
G
A
Q C D - a w a r e
R
e
clu
stered
∅: no label
q: light quark
g: gluon
c: charm
b: bottom
γ: photon
Figure 4: Chord diagram showing the relationships between five chosen labelling schemes.
There is overwhelming agreement in the label assignments from each scheme, and (as expected)
variations of the "max-pT" scheme are prone to label a jet as photon-initiated (γ) when a QCD-
aware scheme would label the jet as a gluon or assign no label (as noted in [5]).
FEATURE RANKING RESULTS
Our data initially contains more than 300 features. Removing
duplicate and highly correlated features halves the number of fea-
tures; first we filter on the absolute value of Pearson correlation to
identify linearly correlated features, then we filter on the absolute
value of Spearman correlation to identify monotonically-related fea-
tures. To optimally search the remaining feature space to find the
variables that provide the best predictive power, we invented a fast
filter-based method that involves ranking variables by information
gain (Kullback-Leibler divergence) or Gini impurity and comparing
their rank with that of random probes injected into the data. We do
this repeatedly for a large number of bootstrap resamplings to yield
a median (or, optionally, a mean) and nonparametric confidence in-
terval estimate for the chosen metric for each feature. We then re-
move features with values of the metric less than that for random
probes, or within one standard deviation:
Figure 5: Box-and-whisker plot showing median information gain for features and six ran-
dom probes (denoted "BOGUS"). To the left is worse, to the right is better.
In addition to confirming the power of features already
proposed in past work, this method has found intriguing
new variables that promise better discrimination between
quark- and gluon-initiated jets and are therefore ideal can-
didates for further study.
REFERENCES
[1] C. Manuel. The Stopping Power of Hot Nuclear Matter. Physics, 7(97), 2014. doi:
10.1103/Physics.7.97. URL http://link.aps.org/doi/10.1103/Physics.7.97.
[2] J. Gallicchio, J. Huth, M. Kagan, M. D. Schwartz, K. Black, and B. Tweedie. Mul-
tivariate discrimination and the Higgs + W/Z search. JHEP, 04:069, 2011. doi:
10.1007/JHEP04(2011)069.
[3] E. M. Schwarz, M. Schmookler, and S. D. Trong. FPU implementations with denormalized
numbers. IEEE Transactions on Computers, 54(7):825–836, July 2005. ISSN 0018-9340. doi:
10.1109/TC.2005.118.
[4] I. Dooley and L. Kale. Quantifying the interference caused by subnormal floating-point
values. In Proceedings of the Workshop on Operating System Interference in High Performance
Applications, 2006.
[5] A. Buckley and C. Pollard. QCD-aware partonic jet clustering for truth-jet flavour labelling.
Eur. Phys. J., C76(2):71, 2016. doi: 10.1140/epjc/s10052-016-3925-z.
This work was supported by: Hungarian National Research Fund (OTKA) NK106119 and the Wigner GPU Laboratory of the Wigner RCP, Hungarian Academy of Sciences