Project Report

Summer 2016 Internship – TopFitter Parallel Scan
Thomas Fletcher
2091233F@student.gla.ac.uk
T-Fletcher@outlook.com
Abstract: The TopFitter program calculates constraints on higher dimensional operators
modelling deviations from the Standard Model, specifically with regards to top quarks. The
aim of the summer project was to create a version of TopFitter which could be massively
parallelised on GPUs. The goal was accomplished and a x3.5 speedup was achieved when
compared to the original version running on the data used in previously published papers.
At the end of the internship there still is an unidentified bug when running on the largest
data set, and a considerably more efficient version is formally complete. As a separate task,
an analysis of events generated with models increasingly divergent from SM yielded the
result that interference with SM interactions prevents detection of non-SM effects.
1. Introduction
1.1. Beyond the Standard Model
In the search for Beyond-Standard-Model implementations of the breaking of electroweak
symmetry, all data produced by the Large Hadron Collider (LHC) is usually parametrised with
model-independent parameters representing its deviation from Standard Model predictions[1][2].
So far the data has consistently matched these predictions (although not definitely excluding
new degrees of freedom at those energies), leading to the conclusion that, if present, larger
deviations will have to occur at higher energies.
In trying to parametrise all BSM interactions, the SM Lagrangian ℒ 𝑆𝑀 will be just the first term
in an infinite series of Lagrangian terms ℒ𝑖 constructed from SM operators, constituting an
Effective Lagrangian ℒeff. Note that mass-wise (and not space-time-wise) higher-dimensional
Lagrangian terms will be suppressed at high energy scales, represented by Λ in Equation 1 below:
ℒeff = ℒ 𝑆𝑀 +
1
Λ
ℒ1 +
1
Λ2
ℒ2 +
1
Λ3
ℒ3 + ⋯
Equation 1: Effective Field Lagrangian [1][2]
Modelling the new physics with an infinite series of higher-dimensional effective operators is
an approach which, among others (such as anomalous couplings)[1][2], has the advantage of being
completely general, allowing the exploration of new physical effects without depending on
specific models regarding wider spectra than required (because of the higher energy term

suppression)[2] and also that of preserving the SM 𝑆𝑈(3) 𝐶 × 𝑆𝑈(2) 𝐿 × 𝑈(1) 𝑌 gauge symmetry
(because the ℒ𝑖 terms are combinations of SM operators)[1].
Furthermore, the infinite series collapses to a manageable finite number of terms by choosing
a dimension to model, making the simple assumptions of minimal flavour violation and baryon
number conservation and focusing on a specific set of observables[1][2].
The number of operators for dimension-six (where the relevant leading ℒeff contributions
appear) with the above (and more)[2] assumptions taken into account and focusing on top quarks
is just 14, making the Effective Lagrangian of the form in Equation 2 below:
ℒeff = ℒ 𝑆𝑀 +
1
Λ2
∑ 𝐶𝑖 𝑂𝑖
𝑖
+ 𝒪(Λ−4
)
Equation 2: Specific Effective Field Lagrangian, where the Ci are arbitrary Wilson Coefficients and Oi the
14 relevant operators [1][2]
1.2. The TopFitter Collaboration
Given the great abundance of top quark data from LHC and Tevatron and their important role
in most Standard Model deviations, the TopFitter Collaboration was set up to compute constraints
on the operators which contribute to top quark events.
The Collaboration’s previous work constrained dimension-six operators contributing to single
and pair top quark production; the number of relevant operators (14), although greatly narrowed
down by the aforementioned (and more) assumptions and choices, was still not manageable by
the original TopFitter, which had to set at least half of them to 0 in order to be able run.[1][2]
Moving forward with this research we cannot afford to ignore further dimensions, but the
computation scaling represents a significant obstacle; this is why the original code needed to be
optimised and then either run on supercomputers or heavily parallelised and run on GPUs.
1.3. Data flow from source to TopFitter
The data TopFitter works on comes from a multi-step data flow through software packages
performing Monte Carlo events generation and analyses as shown in Diagram 1 below:
Diagram 1: Data Flow into TopFitter
2. Side Project: Colour Flow Analysis
2.1. Non-Standard Colour Flow
A separate task in the project was that of analysing event files in order to confirm that they indeed
contained the intended non-standard colour flows which were supposed to be generated by
Monte Carlo engines using models increasingly divergent from the Standard Model.
Obviously, colour has to be conserved between inputs and outputs of Feynman diagram
vertices in the same way baryon number and similar quantities are; non-standard colour flow

occurs when the total colour before and after an event is not conserved or it is but the pairings
and allocations of colours among the outputs do not reflect the standard event reconstruction.
A comparison of a SM and a non-SM event resulting in non-SM colour flow can be seen in
Diagram 2 below, where the first Feynman diagram has a vertex producing a bottom quark and a
(colourless) W+ boson producing a colour-matched quark pair, while the second diagram has a
“black box” vertex in its place, directly outputting the bottom quark and a non-colour-matched
quark pair (note that, instead, it is the bottom quark which matches one of the pair quarks’ colour;
regardless of the match, total colour is not conserved).
Diagram 2: Comparison of Standard and Non-Standard events with an emphasis on colour flow
(See colour coded legend at the top of diagram for indices)
2.2. Analysis Results
A C++ program making use of the LHEF library[3] was written in order to isolate only the relevant
pairs of quarks from each event and then analyse their colours.

Two kinds of event files were fed to the program: some generated from only non-Standard
Models and others generated from both Standard and non-Standard Models at the same time.
Many instances of the expected non-Standard colour flows were found in the former, but
surprisingly, none of them were found in the latter.
A brief discussion with the Theory Group confirmed that this empirically found absence of non-
SM colour flows is backed by theory: the Standard Model interactions interfere heavily with the
non-SM ones, effectively preventing their effects from manifesting.
This implies that the flows in question will never be observed, meaning that a different
approach will have to be used to test these non-SM models.
3. Project Steps
3.1. TopFitter Structure
The general structure of the main TopFitter script (both before and after parallelisation) is the
following:
1. Extract data from the input files and package it into useful objects
2. Choose a pair of dimensions to slice through the given n-dimensional space
3. Generate 2D a slice grid in the chosen dimensions for each pixel of the intended output
4. Pre-scan each grid point with a marginalisation function in order to find local minima
5. Find the global minimum starting from the smallest local minimum
Step 4 involves picking evenly spaced points in the resulting (n-2)-dimensional space, and
computing the chi squared function on each of them, making it the most computationally
expensive step, and therefore the parallelisation target.
3.2. Complexity and Scaling
Table 1 below shows the step-by-step algorithmic complexity of TopFitter’s scans, i.e. how quickly
the computation increases for increasing input size.
Definition Expression
Dimensions 𝐷
Pixels = Number of Slice Grids 𝑃
Scanned Points per Dimension (= 5 by default) 𝑆
Number of Slice Grid Points 𝑁 = 𝑆 𝐷−2
Operations per Point 𝑂 = 𝑓((𝐷 − 2)!)
Total Operations 𝑇 = 𝑃 ∙ 𝑁 ∙ 𝑂 = 𝑃 ∙ 𝑆 𝐷−2
∙ 𝑓((𝐷 − 2)!)
Table 1: Algorithmic complexity of TopFitter’s scanning step
For reference, using P = 121 and S = 5:
 D = 7 → N = 3125 and O = f(120), making T = 378125 ∙ f(120)
 D = 12 → N = 9765625 and O = f(3628800), making T = 1181640625 ∙ f(3628800)
It is obvious that an algorithmic complexity of an exponential times a function of a factorial
leads to impossibly long computations extremely quickly, and even the GPU used for this project
(6 GB of RAM, 2816 Cores, 1.19 GHz Max Clock rate and 1024 Max Threads per block) will struggle
at D = 12.

3.3. Used Frameworks
TopFitter is written in Python, and makes use of the Professor2 package[4] (which was developed
alongside TopFitter and shares some code with it), which is a tuning tool for Monte Carlo event
generators written in Python, Cython and C++, in order to extract data from the statistical analyses
input files and present it in the form of a variety of easy to query histogram objects.
The function marginalisation for each data point is carried out with the iMinuit package, which
needs very specific inputs (such as pre-emptively known variable names etc), restricting the
flexibility of the whole codebase.
The parallelisation over GPU cores is made possible by the PyOpenCL package[5], which is an
interface to OpenCL drivers for multi-core hardware; the useful features of PyOpenCL are:
 It interfaces very well with the Numpy package (which TopFitter already makes use
of), allowing easy transfer of Python data onto the parallel device
 It gets rid of most of the boilerplate code from the OpenCL C equivalent code, handling
all the setting up and environment details of the device
 It provides a few common parallel-algorithm-building tools which leave only the
innermost kernel of computation to be written by the user
OpenCL itself imposes many restrictions on the C code that can run on the device, the most
important ones being:
 Pointers to pointers (most importantly multidimensional arrays) are not allowed,
meaning that the user has to flatten data structures (PyOpenCL takes care of that
automatically from Numpy arrays) and then go through them with size modulo
arithmetic.
 Variable length declarations are not allowed, meaning that if necessary the user has to
use the precompiler or string substitution from the Python side in order to get around
this.
3.4. Generalisation to N dimensions
The original TopFitter had hardcoded blocks dealing with each specific number of input
dimensions because of the aforementioned requirements of iMinuit with regards to pre-emptive
input knowledge; the first task was therefore that of generalising the code to N dimensions.
This consisted in procedurally generating variable names and argument counts, with the
interface to iMinuit becoming a locally generated and executed code string containing a function
declaration with all the required behaviours, while in fact making use of more generic functions.
3.5. Data Extraction, Caching and Transfer to GPU
The marginalisation step uses data coming from two different Professor2 Histogram objects:
DataHisto and IpolHisto. The former contains static data for each bin, while the latter contains
all the interpolation information required to calculate a value for each n-dimensional coordinates
tuple, meaning that calling, for example, the value method on an IpolBin with the coordinates as
arguments triggers a series of computations going through Python, Cython, C++ and back again.
While the original TopFitter could afford to extract or calculate each data item from the
Professor2 Histogram objects in the same cycle as the marginalisation, the parallel version cannot
because all the data has to be cached on the GPU memory to be used by each core independently.
Since OpenCL does not allow pointers to pointers, the only ways to store the required data are
(eventually flattened) arrays or some form of OpenCL compliant C structs (which are allowed).
The former is simpler, and was therefore chosen as the preferred method; in the (common)
case of multiple histograms in the input files, all the data is concatenated into a single array per

item type, simplifying the parallelisation process; the length of these arrays is therefore just the
total number of bins irrespective of histograms, and its value is referred to as binsLen in the code.
If the --parallel flag is detected, then TopFitter needs to extract all the required data and cache
it on the parallel device; this is straightforward method calling on the DataBin objects, while, in
order to be computationally efficient, instead of using the value methods on the IpolBin objects,
some non-originally API-exposed internal IpolBin data structures were exposed in a newer
version of Professor2 specifically for the benefit of the parallel TopFitter implementation,
allowing the caching of all the constant data items required to compute the equivalent values for
IpolBin objects, with the only variables left being the coordinates.
All the extracted data is either stored as Numpy arrays first and then transferred onto the
device or it is generated directly on the device if its size is known in advance; the only data types
used are Numpy’s own intc and float64, as they are guaranteed to be the equivalent of C’s int
and double.
The final arrays transferred on the device are the following (the leading “a” indicates that the
variable is an array) (from TopFitter/tf/kernelCode.py):
# Array lengths:
# aChi2s: gridLen
# aGrid: 2D Array (gridLen x polyDim)
# aCoorMins, aCoorMaxs: polyDim
# aDBVals, aDBErrs, aMaxErrs, aIpolRelErrs: binsLen
# aPolyCoeffss: 2D Array (binsLen x polyLen)
# aPolyStruct: 2D Array (polyLen x polyDim)
# aErrsNums: binsLen
# aErr0Coeffss: 2D Array (binsLen x err0Len)
# aErr0Struct: 2D Array (err0Len x polyDim)
# aErr1Coeffss: 2D Array (binsLen x err1Len)
# aErr1Struct: 2D Array (err1Len x polyDim)
 aChi2s is the result array, containing the chi squared value for each slice grid point (in total
gridLen items)
 aGrid is the array of N-tuples of coordinates, polyDim being the dimension of the
interpolation polynomial (= N)
 aCoorMins, aCoorMaxs are the IpolHisto minimum and maximum coordinates values
 aDBVals, aDBErrs, aMaxErrs and aIpolRelErrs are the readily available DataBin values
 aPolyCoeffss is the list of each of the polyLen polynomial terms’ coefficients for each of the
binsLen IpolBin objects’ interpolation polynomials
 aPolyStruct is a list of lists of 0s and 1s representing whether each of the polyLen
coordinates is a factor of each of the polyDim interpolation polynomial coefficients; this
structure is shared by all bins
 aErrsNums is a list of 0s, 1s or 2s representing the number of error interpolation polynomials
for each of the binsLen IpolBins
 aErr0Coeffss, aErr0Struct, aErr0Coeffss and aErr0Struct are the same structures as
aPolyCoeffss and aPolyStruct but for the error interpolation polynomials, of which there
might be 0, 1 or 2.
All the scalar values used in the computations, including the lengths of the above arrays are not
passed directly to the kernel as arguments, but, in order to reduce I/O, are instead procedurally
hardcoded into the C code strings in the same way the precompiler would use define directives.

3.6. Parallel Kernel and Preamble
Having cached all the internal Professor2 IpolBin data structures as arrays, the Python, Cython
and C++ computations making use of them also had to be replicated in OpenCL-restricted C and
implemented in the parallel kernel.
The main structural differences between the original code and the kernel’s C stem from the use
of modulo arithmetic in order to get to specific elements of flattened multidimensional arrays.
In the end, the parallel kernel replicates all the TopFitter normal scan with all the Professor2
background calculations on IpolBin values.
The current version of the program implements a map-chi2-then-find-minimum algorithm,
meaning that the chi squared calculations’ results are computed in parallel and stored on the
device, and after they are all done, a second pass finds their minimum.
A considerably more efficient version is formally complete but not yet working (therefore
commented out throughout the codebase) and is mentioned in section 5.
4. Project Results
4.1. Final Project File Structure
Main program: tf-scan2d-chi2 (Python, 304 lines)
 Imports: chi2.py (Python, 55 lines)
[Professor2 chi squared functions]
 Imports: dataExtraction.py (Python, 103 lines)
[Professor2 data extraction and histogram objects building]
 Imports: debugEffects.py (Python, 61 lines)
[Generic debugging prints and graphs]
 Imports: parallelScanning.py (Python, 305 lines)
[Professor2 histogram objects data extraction and transformation for PyOpenCL,
PyOpenCL context setup, data transfer to GPU, computation & result retrieval]
o Imports: kernelCode.py (Python, 30 lines; OpenCL-restricted C, 238 lines)
[Parallel Kernel OpenCL-restricted C code & minor precompiler instructions]
4.2. Performance comparison with original version
Table 2 below compares the original and parallel algorithm structures:
Table 2: Algorithm structures comparison

Performance wise, on the 7-dimensional test data, the parallel version achieved a x3.5 speedup
compared with the original, and the time benefit increases (asymptotically to a limit imposed by
the GPU’s specifics) with the given load, when the actual processing per core takes significantly
longer than its I/O.
Unfortunately, at the end of the summer project (15/07/2016), the parallel version did not yet
work on the largest (intended) 12-dimensional data set, returning a null value for each point;
there probably is some memory related bug happening at runtime on the GPU.
5. Conclusion & Future Steps
The project was successful, and it is now one bug away from running on the intended data set,
which could not be run on at all by the original TopFitter.
Apart from fixing said bug, the next obvious step is to fix a very low level bug passed through
OpenCL and PyOpenCL when trying to run a more efficient version of the algorithm (see section
3.6 for current version), which is a map-chi2-reduce-with-minimum algorithm, meaning that no
extra memory has to be allocated for a results array, since each value is processed as soon as it is
ready: when a chi squared is calculated in parallel, it is immediately compared to the current
minimum and then it either replaces it or it is discarded; this will save 1/D of the grid memory
and a considerable amount of I/O.
Looking further ahead, TopFitter could easily become a universal tool for fitting data in
parallel, beyond top quarks, perhaps being distributed along with its co-developed project
Professor2.
References
[1] Buckley, A., Englert, C., Ferrando, J., Miller, D. I., Moore, L., Russell, M., and White, C. D. (2015) Global
fit of top quark effective theory to data. Physical Review D, 92, 091501(R)
[2] Buckley, A., Englert, C., Ferrando, J., Miller, D. J., Moore, L., Russell, M., and White, C. D. (2016)
Constraining top quark effective theory in the LHC run II era. Journal of High Energy Physics,
2016, 15. (doi:10.1007/JHEP04(2016)015)
[3] http://home.thep.lu.se/~leif/LHEF/
[4] http://professor.hepforge.org/
[5] https://mathema.tician.de/software/pyopencl/

Project Report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Project Report

Similar to Project Report (20)

Project Report