SlideShare a Scribd company logo
Summer 2016 Internship – TopFitter Parallel Scan
Thomas Fletcher
Abstract: The TopFitter program calculates constraints on higher dimensional operators
modelling deviations from the Standard Model, specifically with regards to top quarks. The
aim of the summer project was to create a version of TopFitter which could be massively
parallelised on GPUs. The goal was accomplished and a x3.5 speedup was achieved when
compared to the original version running on the data used in previously published papers.
At the end of the internship there still is an unidentified bug when running on the largest
data set, and a considerably more efficient version is formally complete. As a separate task,
an analysis of events generated with models increasingly divergent from SM yielded the
result that interference with SM interactions prevents detection of non-SM effects.
1. Introduction
1.1. Beyond the Standard Model
In the search for Beyond-Standard-Model implementations of the breaking of electroweak
symmetry, all data produced by the Large Hadron Collider (LHC) is usually parametrised with
model-independent parameters representing its deviation from Standard Model predictions[1][2].
So far the data has consistently matched these predictions (although not definitely excluding
new degrees of freedom at those energies), leading to the conclusion that, if present, larger
deviations will have to occur at higher energies.
In trying to parametrise all BSM interactions, the SM Lagrangian ℒ 𝑆𝑀 will be just the first term
in an infinite series of Lagrangian terms ℒ𝑖 constructed from SM operators, constituting an
Effective Lagrangian ℒeff. Note that mass-wise (and not space-time-wise) higher-dimensional
Lagrangian terms will be suppressed at high energy scales, represented by Λ in Equation 1 below:
ℒeff = ℒ 𝑆𝑀 +
ℒ1 +
ℒ2 +
ℒ3 + ⋯
Equation 1: Effective Field Lagrangian [1][2]
Modelling the new physics with an infinite series of higher-dimensional effective operators is
an approach which, among others (such as anomalous couplings)[1][2], has the advantage of being
completely general, allowing the exploration of new physical effects without depending on
specific models regarding wider spectra than required (because of the higher energy term
suppression)[2] and also that of preserving the SM 𝑆𝑈(3) 𝐶 × 𝑆𝑈(2) 𝐿 × 𝑈(1) 𝑌 gauge symmetry
(because the ℒ𝑖 terms are combinations of SM operators)[1].
Furthermore, the infinite series collapses to a manageable finite number of terms by choosing
a dimension to model, making the simple assumptions of minimal flavour violation and baryon
number conservation and focusing on a specific set of observables[1][2].
The number of operators for dimension-six (where the relevant leading ℒeff contributions
appear) with the above (and more)[2] assumptions taken into account and focusing on top quarks
is just 14, making the Effective Lagrangian of the form in Equation 2 below:
ℒeff = ℒ 𝑆𝑀 +
∑ 𝐶𝑖 𝑂𝑖
+ 𝒪(Λ−4
Equation 2: Specific Effective Field Lagrangian, where the Ci are arbitrary Wilson Coefficients and Oi the
14 relevant operators [1][2]
1.2. The TopFitter Collaboration
Given the great abundance of top quark data from LHC and Tevatron and their important role
in most Standard Model deviations, the TopFitter Collaboration was set up to compute constraints
on the operators which contribute to top quark events.
The Collaboration’s previous work constrained dimension-six operators contributing to single
and pair top quark production; the number of relevant operators (14), although greatly narrowed
down by the aforementioned (and more) assumptions and choices, was still not manageable by
the original TopFitter, which had to set at least half of them to 0 in order to be able run.[1][2]
Moving forward with this research we cannot afford to ignore further dimensions, but the
computation scaling represents a significant obstacle; this is why the original code needed to be
optimised and then either run on supercomputers or heavily parallelised and run on GPUs.
1.3. Data flow from source to TopFitter
The data TopFitter works on comes from a multi-step data flow through software packages
performing Monte Carlo events generation and analyses as shown in Diagram 1 below:
Diagram 1: Data Flow into TopFitter
2. Side Project: Colour Flow Analysis
2.1. Non-Standard Colour Flow
A separate task in the project was that of analysing event files in order to confirm that they indeed
contained the intended non-standard colour flows which were supposed to be generated by
Monte Carlo engines using models increasingly divergent from the Standard Model.
Obviously, colour has to be conserved between inputs and outputs of Feynman diagram
vertices in the same way baryon number and similar quantities are; non-standard colour flow
occurs when the total colour before and after an event is not conserved or it is but the pairings
and allocations of colours among the outputs do not reflect the standard event reconstruction.
A comparison of a SM and a non-SM event resulting in non-SM colour flow can be seen in
Diagram 2 below, where the first Feynman diagram has a vertex producing a bottom quark and a
(colourless) W+ boson producing a colour-matched quark pair, while the second diagram has a
“black box” vertex in its place, directly outputting the bottom quark and a non-colour-matched
quark pair (note that, instead, it is the bottom quark which matches one of the pair quarks’ colour;
regardless of the match, total colour is not conserved).
Diagram 2: Comparison of Standard and Non-Standard events with an emphasis on colour flow
(See colour coded legend at the top of diagram for indices)
2.2. Analysis Results
A C++ program making use of the LHEF library[3] was written in order to isolate only the relevant
pairs of quarks from each event and then analyse their colours.
Two kinds of event files were fed to the program: some generated from only non-Standard
Models and others generated from both Standard and non-Standard Models at the same time.
Many instances of the expected non-Standard colour flows were found in the former, but
surprisingly, none of them were found in the latter.
A brief discussion with the Theory Group confirmed that this empirically found absence of non-
SM colour flows is backed by theory: the Standard Model interactions interfere heavily with the
non-SM ones, effectively preventing their effects from manifesting.
This implies that the flows in question will never be observed, meaning that a different
approach will have to be used to test these non-SM models.
3. Project Steps
3.1. TopFitter Structure
The general structure of the main TopFitter script (both before and after parallelisation) is the
1. Extract data from the input files and package it into useful objects
2. Choose a pair of dimensions to slice through the given n-dimensional space
3. Generate 2D a slice grid in the chosen dimensions for each pixel of the intended output
4. Pre-scan each grid point with a marginalisation function in order to find local minima
5. Find the global minimum starting from the smallest local minimum
Step 4 involves picking evenly spaced points in the resulting (n-2)-dimensional space, and
computing the chi squared function on each of them, making it the most computationally
expensive step, and therefore the parallelisation target.
3.2. Complexity and Scaling
Table 1 below shows the step-by-step algorithmic complexity of TopFitter’s scans, i.e. how quickly
the computation increases for increasing input size.
Definition Expression
Dimensions 𝐷
Pixels = Number of Slice Grids 𝑃
Scanned Points per Dimension (= 5 by default) 𝑆
Number of Slice Grid Points 𝑁 = 𝑆 𝐷−2
Operations per Point 𝑂 = 𝑓((𝐷 − 2)!)
Total Operations 𝑇 = 𝑃 ∙ 𝑁 ∙ 𝑂 = 𝑃 ∙ 𝑆 𝐷−2
∙ 𝑓((𝐷 − 2)!)
Table 1: Algorithmic complexity of TopFitter’s scanning step
For reference, using P = 121 and S = 5:
 D = 7 → N = 3125 and O = f(120), making T = 378125 ∙ f(120)
 D = 12 → N = 9765625 and O = f(3628800), making T = 1181640625 ∙ f(3628800)
It is obvious that an algorithmic complexity of an exponential times a function of a factorial
leads to impossibly long computations extremely quickly, and even the GPU used for this project
(6 GB of RAM, 2816 Cores, 1.19 GHz Max Clock rate and 1024 Max Threads per block) will struggle
at D = 12.
3.3. Used Frameworks
TopFitter is written in Python, and makes use of the Professor2 package[4] (which was developed
alongside TopFitter and shares some code with it), which is a tuning tool for Monte Carlo event
generators written in Python, Cython and C++, in order to extract data from the statistical analyses
input files and present it in the form of a variety of easy to query histogram objects.
The function marginalisation for each data point is carried out with the iMinuit package, which
needs very specific inputs (such as pre-emptively known variable names etc), restricting the
flexibility of the whole codebase.
The parallelisation over GPU cores is made possible by the PyOpenCL package[5], which is an
interface to OpenCL drivers for multi-core hardware; the useful features of PyOpenCL are:
 It interfaces very well with the Numpy package (which TopFitter already makes use
of), allowing easy transfer of Python data onto the parallel device
 It gets rid of most of the boilerplate code from the OpenCL C equivalent code, handling
all the setting up and environment details of the device
 It provides a few common parallel-algorithm-building tools which leave only the
innermost kernel of computation to be written by the user
OpenCL itself imposes many restrictions on the C code that can run on the device, the most
important ones being:
 Pointers to pointers (most importantly multidimensional arrays) are not allowed,
meaning that the user has to flatten data structures (PyOpenCL takes care of that
automatically from Numpy arrays) and then go through them with size modulo
 Variable length declarations are not allowed, meaning that if necessary the user has to
use the precompiler or string substitution from the Python side in order to get around
3.4. Generalisation to N dimensions
The original TopFitter had hardcoded blocks dealing with each specific number of input
dimensions because of the aforementioned requirements of iMinuit with regards to pre-emptive
input knowledge; the first task was therefore that of generalising the code to N dimensions.
This consisted in procedurally generating variable names and argument counts, with the
interface to iMinuit becoming a locally generated and executed code string containing a function
declaration with all the required behaviours, while in fact making use of more generic functions.
3.5. Data Extraction, Caching and Transfer to GPU
The marginalisation step uses data coming from two different Professor2 Histogram objects:
DataHisto and IpolHisto. The former contains static data for each bin, while the latter contains
all the interpolation information required to calculate a value for each n-dimensional coordinates
tuple, meaning that calling, for example, the value method on an IpolBin with the coordinates as
arguments triggers a series of computations going through Python, Cython, C++ and back again.
While the original TopFitter could afford to extract or calculate each data item from the
Professor2 Histogram objects in the same cycle as the marginalisation, the parallel version cannot
because all the data has to be cached on the GPU memory to be used by each core independently.
Since OpenCL does not allow pointers to pointers, the only ways to store the required data are
(eventually flattened) arrays or some form of OpenCL compliant C structs (which are allowed).
The former is simpler, and was therefore chosen as the preferred method; in the (common)
case of multiple histograms in the input files, all the data is concatenated into a single array per
item type, simplifying the parallelisation process; the length of these arrays is therefore just the
total number of bins irrespective of histograms, and its value is referred to as binsLen in the code.
If the --parallel flag is detected, then TopFitter needs to extract all the required data and cache
it on the parallel device; this is straightforward method calling on the DataBin objects, while, in
order to be computationally efficient, instead of using the value methods on the IpolBin objects,
some non-originally API-exposed internal IpolBin data structures were exposed in a newer
version of Professor2 specifically for the benefit of the parallel TopFitter implementation,
allowing the caching of all the constant data items required to compute the equivalent values for
IpolBin objects, with the only variables left being the coordinates.
All the extracted data is either stored as Numpy arrays first and then transferred onto the
device or it is generated directly on the device if its size is known in advance; the only data types
used are Numpy’s own intc and float64, as they are guaranteed to be the equivalent of C’s int
and double.
The final arrays transferred on the device are the following (the leading “a” indicates that the
variable is an array) (from TopFitter/tf/
# Array lengths:
# aChi2s: gridLen
# aGrid: 2D Array (gridLen x polyDim)
# aCoorMins, aCoorMaxs: polyDim
# aDBVals, aDBErrs, aMaxErrs, aIpolRelErrs: binsLen
# aPolyCoeffss: 2D Array (binsLen x polyLen)
# aPolyStruct: 2D Array (polyLen x polyDim)
# aErrsNums: binsLen
# aErr0Coeffss: 2D Array (binsLen x err0Len)
# aErr0Struct: 2D Array (err0Len x polyDim)
# aErr1Coeffss: 2D Array (binsLen x err1Len)
# aErr1Struct: 2D Array (err1Len x polyDim)
 aChi2s is the result array, containing the chi squared value for each slice grid point (in total
gridLen items)
 aGrid is the array of N-tuples of coordinates, polyDim being the dimension of the
interpolation polynomial (= N)
 aCoorMins, aCoorMaxs are the IpolHisto minimum and maximum coordinates values
 aDBVals, aDBErrs, aMaxErrs and aIpolRelErrs are the readily available DataBin values
 aPolyCoeffss is the list of each of the polyLen polynomial terms’ coefficients for each of the
binsLen IpolBin objects’ interpolation polynomials
 aPolyStruct is a list of lists of 0s and 1s representing whether each of the polyLen
coordinates is a factor of each of the polyDim interpolation polynomial coefficients; this
structure is shared by all bins
 aErrsNums is a list of 0s, 1s or 2s representing the number of error interpolation polynomials
for each of the binsLen IpolBins
 aErr0Coeffss, aErr0Struct, aErr0Coeffss and aErr0Struct are the same structures as
aPolyCoeffss and aPolyStruct but for the error interpolation polynomials, of which there
might be 0, 1 or 2.
All the scalar values used in the computations, including the lengths of the above arrays are not
passed directly to the kernel as arguments, but, in order to reduce I/O, are instead procedurally
hardcoded into the C code strings in the same way the precompiler would use define directives.
3.6. Parallel Kernel and Preamble
Having cached all the internal Professor2 IpolBin data structures as arrays, the Python, Cython
and C++ computations making use of them also had to be replicated in OpenCL-restricted C and
implemented in the parallel kernel.
The main structural differences between the original code and the kernel’s C stem from the use
of modulo arithmetic in order to get to specific elements of flattened multidimensional arrays.
In the end, the parallel kernel replicates all the TopFitter normal scan with all the Professor2
background calculations on IpolBin values.
The current version of the program implements a map-chi2-then-find-minimum algorithm,
meaning that the chi squared calculations’ results are computed in parallel and stored on the
device, and after they are all done, a second pass finds their minimum.
A considerably more efficient version is formally complete but not yet working (therefore
commented out throughout the codebase) and is mentioned in section 5.
4. Project Results
4.1. Final Project File Structure
Main program: tf-scan2d-chi2 (Python, 304 lines)
 Imports: (Python, 55 lines)
[Professor2 chi squared functions]
 Imports: (Python, 103 lines)
[Professor2 data extraction and histogram objects building]
 Imports: (Python, 61 lines)
[Generic debugging prints and graphs]
 Imports: (Python, 305 lines)
[Professor2 histogram objects data extraction and transformation for PyOpenCL,
PyOpenCL context setup, data transfer to GPU, computation & result retrieval]
o Imports: (Python, 30 lines; OpenCL-restricted C, 238 lines)
[Parallel Kernel OpenCL-restricted C code & minor precompiler instructions]
4.2. Performance comparison with original version
Table 2 below compares the original and parallel algorithm structures:
Table 2: Algorithm structures comparison
Performance wise, on the 7-dimensional test data, the parallel version achieved a x3.5 speedup
compared with the original, and the time benefit increases (asymptotically to a limit imposed by
the GPU’s specifics) with the given load, when the actual processing per core takes significantly
longer than its I/O.
Unfortunately, at the end of the summer project (15/07/2016), the parallel version did not yet
work on the largest (intended) 12-dimensional data set, returning a null value for each point;
there probably is some memory related bug happening at runtime on the GPU.
5. Conclusion & Future Steps
The project was successful, and it is now one bug away from running on the intended data set,
which could not be run on at all by the original TopFitter.
Apart from fixing said bug, the next obvious step is to fix a very low level bug passed through
OpenCL and PyOpenCL when trying to run a more efficient version of the algorithm (see section
3.6 for current version), which is a map-chi2-reduce-with-minimum algorithm, meaning that no
extra memory has to be allocated for a results array, since each value is processed as soon as it is
ready: when a chi squared is calculated in parallel, it is immediately compared to the current
minimum and then it either replaces it or it is discarded; this will save 1/D of the grid memory
and a considerable amount of I/O.
Looking further ahead, TopFitter could easily become a universal tool for fitting data in
parallel, beyond top quarks, perhaps being distributed along with its co-developed project
[1] Buckley, A., Englert, C., Ferrando, J., Miller, D. I., Moore, L., Russell, M., and White, C. D. (2015) Global
fit of top quark effective theory to data. Physical Review D, 92, 091501(R)
[2] Buckley, A., Englert, C., Ferrando, J., Miller, D. J., Moore, L., Russell, M., and White, C. D. (2016)
Constraining top quark effective theory in the LHC run II era. Journal of High Energy Physics,
2016, 15. (doi:10.1007/JHEP04(2016)015)

More Related Content

What's hot

Memo Love
Parallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsParallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic Mathematics
IOSR Journals
A Pipelined Fused Processing Unit for DSP Applications
A Pipelined Fused Processing Unit for DSP ApplicationsA Pipelined Fused Processing Unit for DSP Applications
A Pipelined Fused Processing Unit for DSP Applications
ijiert bestjournal
Introduction to simulink (1)
Introduction to simulink (1)Introduction to simulink (1)
Introduction to simulink (1)
Memo Love
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Cross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental DesignCross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental Design
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Till Rohrmann
JGrass-NewAge water budget
JGrass-NewAge water budget JGrass-NewAge water budget
JGrass-NewAge water budget
Marialaura Bancheri
JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component
Marialaura Bancheri
Modified approximate 8-point multiplier less DCT like transform
Modified approximate 8-point multiplier less DCT like transformModified approximate 8-point multiplier less DCT like transform
Modified approximate 8-point multiplier less DCT like transform
IJERA Editor
Fourier Transform Assignment Help
Fourier Transform Assignment HelpFourier Transform Assignment Help
Fourier Transform Assignment Help
Matlab Assignment Experts
Design and Implementation of Boolean Functions using Multiplexer and also usi...
Design and Implementation of Boolean Functions using Multiplexer and also usi...Design and Implementation of Boolean Functions using Multiplexer and also usi...
Design and Implementation of Boolean Functions using Multiplexer and also usi...
IRJET Journal
Lecture 8-9 block-diagram_representation_of_control_systems
Lecture 8-9 block-diagram_representation_of_control_systemsLecture 8-9 block-diagram_representation_of_control_systems
Lecture 8-9 block-diagram_representation_of_control_systemsSaifullah Memon
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsguest084d20
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
NAVER Engineering
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
Kyong-Ha Lee
Digital Signal Processing Homework Help
Digital Signal Processing Homework HelpDigital Signal Processing Homework Help
Digital Signal Processing Homework Help
Matlab Assignment Experts

What's hot (20)

Parallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsParallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic Mathematics
A Pipelined Fused Processing Unit for DSP Applications
A Pipelined Fused Processing Unit for DSP ApplicationsA Pipelined Fused Processing Unit for DSP Applications
A Pipelined Fused Processing Unit for DSP Applications
Introduction to simulink (1)
Introduction to simulink (1)Introduction to simulink (1)
Introduction to simulink (1)
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Cross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental DesignCross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental Design
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...
JGrass-NewAge water budget
JGrass-NewAge water budget JGrass-NewAge water budget
JGrass-NewAge water budget
JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component
Modified approximate 8-point multiplier less DCT like transform
Modified approximate 8-point multiplier less DCT like transformModified approximate 8-point multiplier less DCT like transform
Modified approximate 8-point multiplier less DCT like transform
Fourier Transform Assignment Help
Fourier Transform Assignment HelpFourier Transform Assignment Help
Fourier Transform Assignment Help
Design and Implementation of Boolean Functions using Multiplexer and also usi...
Design and Implementation of Boolean Functions using Multiplexer and also usi...Design and Implementation of Boolean Functions using Multiplexer and also usi...
Design and Implementation of Boolean Functions using Multiplexer and also usi...
Lecture 8-9 block-diagram_representation_of_control_systems
Lecture 8-9 block-diagram_representation_of_control_systemsLecture 8-9 block-diagram_representation_of_control_systems
Lecture 8-9 block-diagram_representation_of_control_systems
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
Digital Signal Processing Homework Help
Digital Signal Processing Homework HelpDigital Signal Processing Homework Help
Digital Signal Processing Homework Help

Viewers also liked

cpl company_overview_ita_20120326
cpl company_overview_ita_20120326cpl company_overview_ita_20120326
cpl company_overview_ita_20120326tommy62dm
Willie sykes 2016
Willie sykes 2016Willie sykes 2016
Willie sykes 2016
Willie Sykes
Dynamic Networks of City Magnetism
Dynamic Networks of City MagnetismDynamic Networks of City Magnetism
Dynamic Networks of City Magnetism
Regional Science Academy
Cit company overview-ita_20111130
Cit company overview-ita_20111130Cit company overview-ita_20111130
Cit company overview-ita_20111130tommy62dm
Presentation SmaileX.comPresentation
Presentation SmaileX.comSmaileX Project
Atsuko Masano
~$Solução conc. p notas
~$Solução conc. p notas~$Solução conc. p notas
~$Solução conc. p notasReggaerussou
Neo Updated CV..docx 2016
Neo Updated CV..docx 2016Neo Updated CV..docx 2016
Neo Updated CV..docx 2016neo khumalo
Chapter 1
Chapter 1Chapter 1
Chapter 1
Eric Aguilar
Trabajo en grupo
Trabajo en grupoTrabajo en grupo
Trabajo en grupoMARCUS MORA
comportamiento reproductivo de las palomas
comportamiento reproductivo de las palomascomportamiento reproductivo de las palomas
comportamiento reproductivo de las palomas
Abraham Correa Guerrero
Ciencias naturales YO SI TENGO PRIMARIA
Ciencias naturales YO SI TENGO PRIMARIA Ciencias naturales YO SI TENGO PRIMARIA
Ciencias naturales YO SI TENGO PRIMARIA
RBI AND Govt Initiative to FI
RBI AND Govt Initiative to FIRBI AND Govt Initiative to FI
RBI AND Govt Initiative to FI
Ansif Ek
Ease of doing business challenges persisting
Ease of doing business challenges persistingEase of doing business challenges persisting
Ease of doing business challenges persisting
Neha Sharma
Young Chartered Accountants - New Age CAs, A New Age Power
Young Chartered Accountants - New Age CAs, A New Age PowerYoung Chartered Accountants - New Age CAs, A New Age Power
Young Chartered Accountants - New Age CAs, A New Age Power
Neha Sharma
Neha Sharma
Organelos de células vegetales
Organelos de  células vegetalesOrganelos de  células vegetales
Organelos de células vegetales
Lista de cotejo para evaluar reporte escrito
Lista de cotejo para evaluar reporte escritoLista de cotejo para evaluar reporte escrito
Lista de cotejo para evaluar reporte escrito
Gaby Hernández

Viewers also liked (20)

cpl company_overview_ita_20120326
cpl company_overview_ita_20120326cpl company_overview_ita_20120326
cpl company_overview_ita_20120326
Willie sykes 2016
Willie sykes 2016Willie sykes 2016
Willie sykes 2016
Dynamic Networks of City Magnetism
Dynamic Networks of City MagnetismDynamic Networks of City Magnetism
Dynamic Networks of City Magnetism
Cit company overview-ita_20111130
Cit company overview-ita_20111130Cit company overview-ita_20111130
Cit company overview-ita_20111130
Presentation SmaileX.comPresentation
~$Solução conc. p notas
~$Solução conc. p notas~$Solução conc. p notas
~$Solução conc. p notas
Neo Updated CV..docx 2016
Neo Updated CV..docx 2016Neo Updated CV..docx 2016
Neo Updated CV..docx 2016
Chapter 1
Chapter 1Chapter 1
Chapter 1
Trabajo en grupo
Trabajo en grupoTrabajo en grupo
Trabajo en grupo
comportamiento reproductivo de las palomas
comportamiento reproductivo de las palomascomportamiento reproductivo de las palomas
comportamiento reproductivo de las palomas
Ciencias naturales YO SI TENGO PRIMARIA
Ciencias naturales YO SI TENGO PRIMARIA Ciencias naturales YO SI TENGO PRIMARIA
Ciencias naturales YO SI TENGO PRIMARIA
RBI AND Govt Initiative to FI
RBI AND Govt Initiative to FIRBI AND Govt Initiative to FI
RBI AND Govt Initiative to FI
Angela_White_Resume.docx 5
Angela_White_Resume.docx  5Angela_White_Resume.docx  5
Angela_White_Resume.docx 5
Ease of doing business challenges persisting
Ease of doing business challenges persistingEase of doing business challenges persisting
Ease of doing business challenges persisting
Young Chartered Accountants - New Age CAs, A New Age Power
Young Chartered Accountants - New Age CAs, A New Age PowerYoung Chartered Accountants - New Age CAs, A New Age Power
Young Chartered Accountants - New Age CAs, A New Age Power
Organelos de células vegetales
Organelos de  células vegetalesOrganelos de  células vegetales
Organelos de células vegetales
Lista de cotejo para evaluar reporte escrito
Lista de cotejo para evaluar reporte escritoLista de cotejo para evaluar reporte escrito
Lista de cotejo para evaluar reporte escrito

Similar to Project Report

PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
NECST Lab @ Politecnico di Milano
High Performance MAC Unit for FFT Implementation
High Performance MAC Unit for FFT Implementation High Performance MAC Unit for FFT Implementation
High Performance MAC Unit for FFT Implementation
High Speed Low Power Veterbi Decoder Design for TCM Decoders
High Speed Low Power Veterbi Decoder Design for TCM DecodersHigh Speed Low Power Veterbi Decoder Design for TCM Decoders
High Speed Low Power Veterbi Decoder Design for TCM Decoders
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm Analysis
Mary Margarat
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
NECST Lab @ Politecnico di Milano
Implementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adderImplementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adder
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
NECST Lab @ Politecnico di Milano
Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...
IRJET Journal
An efficient hardware logarithm generator with modified quasi-symmetrical app...
An efficient hardware logarithm generator with modified quasi-symmetrical app...An efficient hardware logarithm generator with modified quasi-symmetrical app...
An efficient hardware logarithm generator with modified quasi-symmetrical app...
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
Design of High Speed and Low Power Veterbi Decoder for Trellis Coded Modulati...
Design of High Speed and Low Power Veterbi Decoder for Trellis Coded Modulati...Design of High Speed and Low Power Veterbi Decoder for Trellis Coded Modulati...
Design of High Speed and Low Power Veterbi Decoder for Trellis Coded Modulati...
IJCER ( International Journal of computational Engineerin...
IJCER ( International Journal of computational Engineerin...IJCER ( International Journal of computational Engineerin...
IJCER ( International Journal of computational Engineerin...ijceronline
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
Iaetsd Iaetsd
Chapter 1 Data structure.pptx
Chapter 1 Data structure.pptxChapter 1 Data structure.pptx
Chapter 1 Data structure.pptx
Mmc manual
Mmc manualMmc manual
Mmc manual
Urvi Surat
Data Structure and Algorithm chapter two, This material is for Data Structure...
Data Structure and Algorithm chapter two, This material is for Data Structure...Data Structure and Algorithm chapter two, This material is for Data Structure...
Data Structure and Algorithm chapter two, This material is for Data Structure...

Similar to Project Report (20)

PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
High Performance MAC Unit for FFT Implementation
High Performance MAC Unit for FFT Implementation High Performance MAC Unit for FFT Implementation
High Performance MAC Unit for FFT Implementation
High Speed Low Power Veterbi Decoder Design for TCM Decoders
High Speed Low Power Veterbi Decoder Design for TCM DecodersHigh Speed Low Power Veterbi Decoder Design for TCM Decoders
High Speed Low Power Veterbi Decoder Design for TCM Decoders
Data Structures and Algorithm Analysis
Data Structures  and  Algorithm AnalysisData Structures  and  Algorithm Analysis
Data Structures and Algorithm Analysis
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Implementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adderImplementation of an arithmetic logic using area efficient carry lookahead adder
Implementation of an arithmetic logic using area efficient carry lookahead adder
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...
An efficient hardware logarithm generator with modified quasi-symmetrical app...
An efficient hardware logarithm generator with modified quasi-symmetrical app...An efficient hardware logarithm generator with modified quasi-symmetrical app...
An efficient hardware logarithm generator with modified quasi-symmetrical app...
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
Design of High Speed and Low Power Veterbi Decoder for Trellis Coded Modulati...
Design of High Speed and Low Power Veterbi Decoder for Trellis Coded Modulati...Design of High Speed and Low Power Veterbi Decoder for Trellis Coded Modulati...
Design of High Speed and Low Power Veterbi Decoder for Trellis Coded Modulati...
IJCER ( International Journal of computational Engineerin...
IJCER ( International Journal of computational Engineerin...IJCER ( International Journal of computational Engineerin...
IJCER ( International Journal of computational Engineerin...
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
Iaetsd implementation of power efficient iterative logarithmic multiplier usi...
Chapter 1 Data structure.pptx
Chapter 1 Data structure.pptxChapter 1 Data structure.pptx
Chapter 1 Data structure.pptx
Mmc manual
Mmc manualMmc manual
Mmc manual
Data Structure and Algorithm chapter two, This material is for Data Structure...
Data Structure and Algorithm chapter two, This material is for Data Structure...Data Structure and Algorithm chapter two, This material is for Data Structure...
Data Structure and Algorithm chapter two, This material is for Data Structure...

Project Report

  • 1. Summer 2016 Internship – TopFitter Parallel Scan Thomas Fletcher Abstract: The TopFitter program calculates constraints on higher dimensional operators modelling deviations from the Standard Model, specifically with regards to top quarks. The aim of the summer project was to create a version of TopFitter which could be massively parallelised on GPUs. The goal was accomplished and a x3.5 speedup was achieved when compared to the original version running on the data used in previously published papers. At the end of the internship there still is an unidentified bug when running on the largest data set, and a considerably more efficient version is formally complete. As a separate task, an analysis of events generated with models increasingly divergent from SM yielded the result that interference with SM interactions prevents detection of non-SM effects. 1. Introduction 1.1. Beyond the Standard Model In the search for Beyond-Standard-Model implementations of the breaking of electroweak symmetry, all data produced by the Large Hadron Collider (LHC) is usually parametrised with model-independent parameters representing its deviation from Standard Model predictions[1][2]. So far the data has consistently matched these predictions (although not definitely excluding new degrees of freedom at those energies), leading to the conclusion that, if present, larger deviations will have to occur at higher energies. In trying to parametrise all BSM interactions, the SM Lagrangian ℒ 𝑆𝑀 will be just the first term in an infinite series of Lagrangian terms ℒ𝑖 constructed from SM operators, constituting an Effective Lagrangian ℒeff. Note that mass-wise (and not space-time-wise) higher-dimensional Lagrangian terms will be suppressed at high energy scales, represented by Λ in Equation 1 below: ℒeff = ℒ 𝑆𝑀 + 1 Λ ℒ1 + 1 Λ2 ℒ2 + 1 Λ3 ℒ3 + ⋯ Equation 1: Effective Field Lagrangian [1][2] Modelling the new physics with an infinite series of higher-dimensional effective operators is an approach which, among others (such as anomalous couplings)[1][2], has the advantage of being completely general, allowing the exploration of new physical effects without depending on specific models regarding wider spectra than required (because of the higher energy term
  • 2. suppression)[2] and also that of preserving the SM 𝑆𝑈(3) 𝐶 × 𝑆𝑈(2) 𝐿 × 𝑈(1) 𝑌 gauge symmetry (because the ℒ𝑖 terms are combinations of SM operators)[1]. Furthermore, the infinite series collapses to a manageable finite number of terms by choosing a dimension to model, making the simple assumptions of minimal flavour violation and baryon number conservation and focusing on a specific set of observables[1][2]. The number of operators for dimension-six (where the relevant leading ℒeff contributions appear) with the above (and more)[2] assumptions taken into account and focusing on top quarks is just 14, making the Effective Lagrangian of the form in Equation 2 below: ℒeff = ℒ 𝑆𝑀 + 1 Λ2 ∑ 𝐶𝑖 𝑂𝑖 𝑖 + 𝒪(Λ−4 ) Equation 2: Specific Effective Field Lagrangian, where the Ci are arbitrary Wilson Coefficients and Oi the 14 relevant operators [1][2] 1.2. The TopFitter Collaboration Given the great abundance of top quark data from LHC and Tevatron and their important role in most Standard Model deviations, the TopFitter Collaboration was set up to compute constraints on the operators which contribute to top quark events. The Collaboration’s previous work constrained dimension-six operators contributing to single and pair top quark production; the number of relevant operators (14), although greatly narrowed down by the aforementioned (and more) assumptions and choices, was still not manageable by the original TopFitter, which had to set at least half of them to 0 in order to be able run.[1][2] Moving forward with this research we cannot afford to ignore further dimensions, but the computation scaling represents a significant obstacle; this is why the original code needed to be optimised and then either run on supercomputers or heavily parallelised and run on GPUs. 1.3. Data flow from source to TopFitter The data TopFitter works on comes from a multi-step data flow through software packages performing Monte Carlo events generation and analyses as shown in Diagram 1 below: Diagram 1: Data Flow into TopFitter 2. Side Project: Colour Flow Analysis 2.1. Non-Standard Colour Flow A separate task in the project was that of analysing event files in order to confirm that they indeed contained the intended non-standard colour flows which were supposed to be generated by Monte Carlo engines using models increasingly divergent from the Standard Model. Obviously, colour has to be conserved between inputs and outputs of Feynman diagram vertices in the same way baryon number and similar quantities are; non-standard colour flow
  • 3. occurs when the total colour before and after an event is not conserved or it is but the pairings and allocations of colours among the outputs do not reflect the standard event reconstruction. A comparison of a SM and a non-SM event resulting in non-SM colour flow can be seen in Diagram 2 below, where the first Feynman diagram has a vertex producing a bottom quark and a (colourless) W+ boson producing a colour-matched quark pair, while the second diagram has a “black box” vertex in its place, directly outputting the bottom quark and a non-colour-matched quark pair (note that, instead, it is the bottom quark which matches one of the pair quarks’ colour; regardless of the match, total colour is not conserved). Diagram 2: Comparison of Standard and Non-Standard events with an emphasis on colour flow (See colour coded legend at the top of diagram for indices) 2.2. Analysis Results A C++ program making use of the LHEF library[3] was written in order to isolate only the relevant pairs of quarks from each event and then analyse their colours.
  • 4. Two kinds of event files were fed to the program: some generated from only non-Standard Models and others generated from both Standard and non-Standard Models at the same time. Many instances of the expected non-Standard colour flows were found in the former, but surprisingly, none of them were found in the latter. A brief discussion with the Theory Group confirmed that this empirically found absence of non- SM colour flows is backed by theory: the Standard Model interactions interfere heavily with the non-SM ones, effectively preventing their effects from manifesting. This implies that the flows in question will never be observed, meaning that a different approach will have to be used to test these non-SM models. 3. Project Steps 3.1. TopFitter Structure The general structure of the main TopFitter script (both before and after parallelisation) is the following: 1. Extract data from the input files and package it into useful objects 2. Choose a pair of dimensions to slice through the given n-dimensional space 3. Generate 2D a slice grid in the chosen dimensions for each pixel of the intended output 4. Pre-scan each grid point with a marginalisation function in order to find local minima 5. Find the global minimum starting from the smallest local minimum Step 4 involves picking evenly spaced points in the resulting (n-2)-dimensional space, and computing the chi squared function on each of them, making it the most computationally expensive step, and therefore the parallelisation target. 3.2. Complexity and Scaling Table 1 below shows the step-by-step algorithmic complexity of TopFitter’s scans, i.e. how quickly the computation increases for increasing input size. Definition Expression Dimensions 𝐷 Pixels = Number of Slice Grids 𝑃 Scanned Points per Dimension (= 5 by default) 𝑆 Number of Slice Grid Points 𝑁 = 𝑆 𝐷−2 Operations per Point 𝑂 = 𝑓((𝐷 − 2)!) Total Operations 𝑇 = 𝑃 ∙ 𝑁 ∙ 𝑂 = 𝑃 ∙ 𝑆 𝐷−2 ∙ 𝑓((𝐷 − 2)!) Table 1: Algorithmic complexity of TopFitter’s scanning step For reference, using P = 121 and S = 5:  D = 7 → N = 3125 and O = f(120), making T = 378125 ∙ f(120)  D = 12 → N = 9765625 and O = f(3628800), making T = 1181640625 ∙ f(3628800) It is obvious that an algorithmic complexity of an exponential times a function of a factorial leads to impossibly long computations extremely quickly, and even the GPU used for this project (6 GB of RAM, 2816 Cores, 1.19 GHz Max Clock rate and 1024 Max Threads per block) will struggle at D = 12.
  • 5. 3.3. Used Frameworks TopFitter is written in Python, and makes use of the Professor2 package[4] (which was developed alongside TopFitter and shares some code with it), which is a tuning tool for Monte Carlo event generators written in Python, Cython and C++, in order to extract data from the statistical analyses input files and present it in the form of a variety of easy to query histogram objects. The function marginalisation for each data point is carried out with the iMinuit package, which needs very specific inputs (such as pre-emptively known variable names etc), restricting the flexibility of the whole codebase. The parallelisation over GPU cores is made possible by the PyOpenCL package[5], which is an interface to OpenCL drivers for multi-core hardware; the useful features of PyOpenCL are:  It interfaces very well with the Numpy package (which TopFitter already makes use of), allowing easy transfer of Python data onto the parallel device  It gets rid of most of the boilerplate code from the OpenCL C equivalent code, handling all the setting up and environment details of the device  It provides a few common parallel-algorithm-building tools which leave only the innermost kernel of computation to be written by the user OpenCL itself imposes many restrictions on the C code that can run on the device, the most important ones being:  Pointers to pointers (most importantly multidimensional arrays) are not allowed, meaning that the user has to flatten data structures (PyOpenCL takes care of that automatically from Numpy arrays) and then go through them with size modulo arithmetic.  Variable length declarations are not allowed, meaning that if necessary the user has to use the precompiler or string substitution from the Python side in order to get around this. 3.4. Generalisation to N dimensions The original TopFitter had hardcoded blocks dealing with each specific number of input dimensions because of the aforementioned requirements of iMinuit with regards to pre-emptive input knowledge; the first task was therefore that of generalising the code to N dimensions. This consisted in procedurally generating variable names and argument counts, with the interface to iMinuit becoming a locally generated and executed code string containing a function declaration with all the required behaviours, while in fact making use of more generic functions. 3.5. Data Extraction, Caching and Transfer to GPU The marginalisation step uses data coming from two different Professor2 Histogram objects: DataHisto and IpolHisto. The former contains static data for each bin, while the latter contains all the interpolation information required to calculate a value for each n-dimensional coordinates tuple, meaning that calling, for example, the value method on an IpolBin with the coordinates as arguments triggers a series of computations going through Python, Cython, C++ and back again. While the original TopFitter could afford to extract or calculate each data item from the Professor2 Histogram objects in the same cycle as the marginalisation, the parallel version cannot because all the data has to be cached on the GPU memory to be used by each core independently. Since OpenCL does not allow pointers to pointers, the only ways to store the required data are (eventually flattened) arrays or some form of OpenCL compliant C structs (which are allowed). The former is simpler, and was therefore chosen as the preferred method; in the (common) case of multiple histograms in the input files, all the data is concatenated into a single array per
  • 6. item type, simplifying the parallelisation process; the length of these arrays is therefore just the total number of bins irrespective of histograms, and its value is referred to as binsLen in the code. If the --parallel flag is detected, then TopFitter needs to extract all the required data and cache it on the parallel device; this is straightforward method calling on the DataBin objects, while, in order to be computationally efficient, instead of using the value methods on the IpolBin objects, some non-originally API-exposed internal IpolBin data structures were exposed in a newer version of Professor2 specifically for the benefit of the parallel TopFitter implementation, allowing the caching of all the constant data items required to compute the equivalent values for IpolBin objects, with the only variables left being the coordinates. All the extracted data is either stored as Numpy arrays first and then transferred onto the device or it is generated directly on the device if its size is known in advance; the only data types used are Numpy’s own intc and float64, as they are guaranteed to be the equivalent of C’s int and double. The final arrays transferred on the device are the following (the leading “a” indicates that the variable is an array) (from TopFitter/tf/ # Array lengths: # aChi2s: gridLen # aGrid: 2D Array (gridLen x polyDim) # aCoorMins, aCoorMaxs: polyDim # aDBVals, aDBErrs, aMaxErrs, aIpolRelErrs: binsLen # aPolyCoeffss: 2D Array (binsLen x polyLen) # aPolyStruct: 2D Array (polyLen x polyDim) # aErrsNums: binsLen # aErr0Coeffss: 2D Array (binsLen x err0Len) # aErr0Struct: 2D Array (err0Len x polyDim) # aErr1Coeffss: 2D Array (binsLen x err1Len) # aErr1Struct: 2D Array (err1Len x polyDim)  aChi2s is the result array, containing the chi squared value for each slice grid point (in total gridLen items)  aGrid is the array of N-tuples of coordinates, polyDim being the dimension of the interpolation polynomial (= N)  aCoorMins, aCoorMaxs are the IpolHisto minimum and maximum coordinates values  aDBVals, aDBErrs, aMaxErrs and aIpolRelErrs are the readily available DataBin values  aPolyCoeffss is the list of each of the polyLen polynomial terms’ coefficients for each of the binsLen IpolBin objects’ interpolation polynomials  aPolyStruct is a list of lists of 0s and 1s representing whether each of the polyLen coordinates is a factor of each of the polyDim interpolation polynomial coefficients; this structure is shared by all bins  aErrsNums is a list of 0s, 1s or 2s representing the number of error interpolation polynomials for each of the binsLen IpolBins  aErr0Coeffss, aErr0Struct, aErr0Coeffss and aErr0Struct are the same structures as aPolyCoeffss and aPolyStruct but for the error interpolation polynomials, of which there might be 0, 1 or 2. All the scalar values used in the computations, including the lengths of the above arrays are not passed directly to the kernel as arguments, but, in order to reduce I/O, are instead procedurally hardcoded into the C code strings in the same way the precompiler would use define directives.
  • 7. 3.6. Parallel Kernel and Preamble Having cached all the internal Professor2 IpolBin data structures as arrays, the Python, Cython and C++ computations making use of them also had to be replicated in OpenCL-restricted C and implemented in the parallel kernel. The main structural differences between the original code and the kernel’s C stem from the use of modulo arithmetic in order to get to specific elements of flattened multidimensional arrays. In the end, the parallel kernel replicates all the TopFitter normal scan with all the Professor2 background calculations on IpolBin values. The current version of the program implements a map-chi2-then-find-minimum algorithm, meaning that the chi squared calculations’ results are computed in parallel and stored on the device, and after they are all done, a second pass finds their minimum. A considerably more efficient version is formally complete but not yet working (therefore commented out throughout the codebase) and is mentioned in section 5. 4. Project Results 4.1. Final Project File Structure Main program: tf-scan2d-chi2 (Python, 304 lines)  Imports: (Python, 55 lines) [Professor2 chi squared functions]  Imports: (Python, 103 lines) [Professor2 data extraction and histogram objects building]  Imports: (Python, 61 lines) [Generic debugging prints and graphs]  Imports: (Python, 305 lines) [Professor2 histogram objects data extraction and transformation for PyOpenCL, PyOpenCL context setup, data transfer to GPU, computation & result retrieval] o Imports: (Python, 30 lines; OpenCL-restricted C, 238 lines) [Parallel Kernel OpenCL-restricted C code & minor precompiler instructions] 4.2. Performance comparison with original version Table 2 below compares the original and parallel algorithm structures: Table 2: Algorithm structures comparison
  • 8. Performance wise, on the 7-dimensional test data, the parallel version achieved a x3.5 speedup compared with the original, and the time benefit increases (asymptotically to a limit imposed by the GPU’s specifics) with the given load, when the actual processing per core takes significantly longer than its I/O. Unfortunately, at the end of the summer project (15/07/2016), the parallel version did not yet work on the largest (intended) 12-dimensional data set, returning a null value for each point; there probably is some memory related bug happening at runtime on the GPU. 5. Conclusion & Future Steps The project was successful, and it is now one bug away from running on the intended data set, which could not be run on at all by the original TopFitter. Apart from fixing said bug, the next obvious step is to fix a very low level bug passed through OpenCL and PyOpenCL when trying to run a more efficient version of the algorithm (see section 3.6 for current version), which is a map-chi2-reduce-with-minimum algorithm, meaning that no extra memory has to be allocated for a results array, since each value is processed as soon as it is ready: when a chi squared is calculated in parallel, it is immediately compared to the current minimum and then it either replaces it or it is discarded; this will save 1/D of the grid memory and a considerable amount of I/O. Looking further ahead, TopFitter could easily become a universal tool for fitting data in parallel, beyond top quarks, perhaps being distributed along with its co-developed project Professor2. References [1] Buckley, A., Englert, C., Ferrando, J., Miller, D. I., Moore, L., Russell, M., and White, C. D. (2015) Global fit of top quark effective theory to data. Physical Review D, 92, 091501(R) [2] Buckley, A., Englert, C., Ferrando, J., Miller, D. J., Moore, L., Russell, M., and White, C. D. (2016) Constraining top quark effective theory in the LHC run II era. Journal of High Energy Physics, 2016, 15. (doi:10.1007/JHEP04(2016)015) [3] [4] [5]