2. Introduction
Insight Data Science: developed
a machine learning pipeline in a
consulting project.
PhD Particle Physics, McGill
University, researcher on the Large
Hadron Collider.
Lead the search for microscopic
black holes and exotic gravity
states in the ATLAS Collaboration.
2 / 20
3. Consulting Scenario
Company X wishes to maximize user engagement through
optimal placement of advertisements on content URLs.
Ad Type: Tourism
Keyword: Cuba
Keyword:
Package Tour
Keyword: Airplane
Ad Type X
Keyword 1
Keyword 2
Keyword 3
Keyword N
.
.
.
Example: Tourism ads not ideal on investment content URL.
3 / 20
4. A Pipeline to Analyze Textual Features
Developed and implemented a pipeline to analyze
importance of textual feature on content URLs relative to
engagement.
Scrape
URL
Process
Text
Model
Features
Extract
Keywords
Update
Keywords
Collect Data, Reiterate
Begin
4 / 20
6. Modeling
Attempted linear regression.
Classify engagement as yes/no.
Word Count
0 1 2 3 4 5 6 7 8 9 10
Probability[%]
0
0.2
0.4
0.6
0.8
1
Logistic Classification Model
Ad Clicked
Ad Not Clicked
Logistic Classification Model
6 / 20
7. Validation
Randomly split data into training/test sets.
- Distribution of validation scores (shown for 50/50 split).
Precision
0.55 0.6 0.65 0.7 0.75 0.8 0.85
Recall
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
NumberofMCToys
0
10
20
30
40
50
60
70
80
Ad Type 1
Distribution of Precision vs Recall for 50.0% Test/Train
〉Precision, Recall〈
7 / 20
8. Deliverables
Extracted keywords:
Rank Ad Type 1 Ad Type 2 Ad Type 3 Ad Type 4
1 debt coordinator mortgage gold
2 gift administrative home 0
3 profit minimum procurement stock
4 check minimum wage loan fund
5 balance reports trustee event
Pipeline in Python is delivered to company for
implementation.
Project details: http://kuhanw.zohosites.com/.
8 / 20
10. Dissertation Project
Particle Colliders recreate conditions in the early universe.
Searched for signatures of microscopic gravity at the Large Hadron
Collider.
10 / 20
11. The Large Hadron Collider
27 km ring, most powerful particle accelerator built to date.
- 13 TeV collisions.
ATLAS: a giant particle detector.
Produced black holes leave debris due to evaporation inside detector.
2008JINST3S08001
Figure 2.1: Schematic layout of the LHC (Beam 1- clockwise, Beam 2 — anticlockwise).
systems. The insertion at Point 4 contains two RF systems: one independent system for each LHC
beam. The straight section at Point 6 contains the beam dump insertion, where the two beams are
vertically extracted from the machine using a combination of horizontally deflecting fast-pulsed
(’kicker’) magnets and vertically-deflecting double steel septum magnets. Each beam features an
11 / 20
12. Data Processing
Developed complete analysis
pipeline in C++.
- Processed ∼10 TB of LHC data
using distributed computing
methods.
Raw Data From Detector
Processed Data with Objects
Analysis Data Structure
Histogram Data for Final Fitting
~ TB
~ 100 GB
~ GB
~ TB
~ MB
12 / 20
13. Technical Analysis
~Energy
of event
Black Hole
Signals
Background
Prediction
Quantify compatibility with likelihood model.
L(ns|µ, b, θ) = P(ns|s, µ, b, θ) ×
i
Nsyst(θ0, θ, σθ)i . (1)
13 / 20
14. Results
Placed leading constraints on models of microscopic gravity physics.
Models of n
extra
dimensions
Planck
mass of
theory
95% CL
Exclusion
Contours
Black Hole
Mass
Model Type
Public results: JHEP 07 (2015) 032, arXiv:1503.08988 [hep-ex]
14 / 20
17. Large Extra Spatial Dimensions
The size and number of extra spatial dimensions suppress the
observed gravitational strength.
Observed gravity is weaker than intrinsic gravity within the
bulk.
17 / 20
19. FeatureRank
Kuhan Wang1
1. Insight Data Science
October 2, 2015
Abstract
FeatureRank is a software tool for extracting correlations between text
ngram features and user engagement, thereby optimizing the placement
of financial widgets on URL articles.
1 Directory Structure
• /
processing.py
Pre-processing to parse relevant information from engagement csv files.
crawl.py
A simple web crawler that pulls the title and < p > tag text from URLs.
FeatureRank.py
Driver file to execute main functions.
feature_extraction_model.py
The core program that contains the machine learning algorithms.
post_processing.py
Post processing to produce evaluation metrics and ngram rankings.
web_text_data_set_1_2.json 19 / 20
20. Precision
0.55 0.6 0.65 0.7 0.75 0.8 0.85
Recall
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
NumberofMCToys
0
5
10
15
20
25
Ad Type 1
Distribution of Precision vs Recall for 0.33% Test/Train
〉Precision, Recall〈
20 / 20