Renaissance Technologies Presentation
-
Insight Data Science
Kuhan Wang
October 21th, 2015
1 / 20
Introduction
Insight Data Science: developed a
machine learning pipeline in a
consulting project.
PhD Particle Physics, McGill
University, researcher on the Large
Hadron Collider.
Lead the search for microscopic
black holes and exotic gravity
states in the ATLAS Collaboration.
2 / 20
Consulting Scenario
Company X wishes to maximize user engagement through
optimal placement of advertisements on content URLs.
Ad Type: Tourism
Keyword: Cuba
Keyword:
Package Tour
Keyword: Airplane
Ad Type X
Keyword 1
Keyword 2
Keyword 3
Keyword N
.
.
.
Example: Tourism ads not ideal on investment content URL.
3 / 20
A Pipeline to Analyze Textual Features
Developed and implemented a pipeline to analyze
importance of textual feature on content URLs relative to
engagement.
Scrape
URL
Process
Text
Model
Features
Extract
Keywords
Update
Keywords
Collect Data, Reiterate
Begin
4 / 20
User Engagement Data
Occurrences
Counts
Summary of Engagement Data
Page Loaded
Ad Viewed
Ad Clicked
Summary of Engagement Data
5 / 20
Modeling
Attempted linear regression.
Classify engagement as yes/no.
- Features are bags of words from content URL.
Word Count
0 1 2 3 4 5 6 7 8 9 10
Probability[%]
0
0.2
0.4
0.6
0.8
1
Logistic Classification Model
Ad Clicked
Ad Not Clicked
Logistic Classification Model
6 / 20
Validation
Randomly split data into training/test sets.
- Distribution of validation scores (shown for 50/50 split).
Precision
0.55 0.6 0.65 0.7 0.75 0.8 0.85
Recall
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
NumberofMCToys
0
10
20
30
40
50
60
70
80
Ad Type 1
Distribution of Precision vs Recall for 50.0% Test/Train
〉Precision, Recall〈
7 / 20
Deliverables
Extracted keywords:
Rank Ad Type 1 Ad Type 2 Ad Type 3 Ad Type 4
1 debt coordinator mortgage gold
2 gift administrative home 0
3 profit minimum procurement stock
4 check minimum wage loan fund
5 balance reports trustee event
Pipeline in Python is delivered to company for
implementation.
Project details: http://kuhanw.zohosites.com/.
8 / 20
9 / 20
Dissertation Project
Particle Colliders recreate conditions in the early universe.
Searched for signatures of microscopic gravity at the Large Hadron
Collider.
10 / 20
The Large Hadron Collider
27 km ring, most powerful particle accelerator built to date.
- 13 TeV collisions.
ATLAS is one of two major general purpose experiments.
Produced black holes leave debris due to evaporation inside detector.
2008JINST3S08001
Figure 2.1: Schematic layout of the LHC (Beam 1- clockwise, Beam 2 — anticlockwise).
systems. The insertion at Point 4 contains two RF systems: one independent system for each LHC
beam. The straight section at Point 6 contains the beam dump insertion, where the two beams are
vertically extracted from the machine using a combination of horizontally deflecting fast-pulsed
(’kicker’) magnets and vertically-deflecting double steel septum magnets. Each beam features an
11 / 20
Data Processing
Developed complete analysis
pipeline in C++.
- Processed ∼10 TB of LHC data
using distributed computing
methods.
Raw Data From Detector
Processed Data with Objects
Analysis Data Structure
Histogram Data for Final Fitting
~ TB
~ 100 GB
~ GB
~ TB
~ MB
12 / 20
Technical Analysis
~Energy
of event
Black Hole
Signals
Background
Prediction
Quantify compatability with likelihood model.
L(ns|µ, b, θ) = P(ns|s, µ, b, θ) ×
i
Nsyst(θ0, θ, σθ)i . (1)
13 / 20
Results
Placed leading constraints on models of microscopic gravity physics.
Models of n
extra
dimensions
Planck
mass of
theory
95% CL
Exclusion
Contours
Black Hole
Mass
Model Type
Public results: JHEP 07 (2015) 032, arXiv:1503.08988 [hep-ex]
14 / 20
ATLAS Detector
15 / 20
Thank you for your time.
16 / 20
Large Extra Spatial Dimensions
The size and number of extra spatial dimensions suppress the
observed gravitational strength.
Observed gravity is weaker then intrinsic gravity within the
bulk.
17 / 20
Backup
Feature Frequency/Documents
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
RelativeNumberofDocuments[%]
4−
10
3−
10
2−
10
1−
10
1
Ad Type 1Ad Type 1
18 / 20
FeatureRank
Kuhan Wang1
1. Insight Data Science
October 2, 2015
Abstract
FeatureRank is a software tool for extracting correlations between text
ngram features and user engagement, thereby optimizing the placement
of financial widgets on URL articles.
1 Directory Structure
• /
processing.py
Pre-processing to parse relevant information from engagement csv files.
crawl.py
A simple web crawler that pulls the title and < p > tag text from URLs.
FeatureRank.py
Driver file to execute main functions.
feature_extraction_model.py
The core program that contains the machine learning algorithms.
post_processing.py
Post processing to produce evaluation metrics and ngram rankings.
web_text_data_set_1_2.json 19 / 20
Precision
0.55 0.6 0.65 0.7 0.75 0.8 0.85
Recall
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
NumberofMCToys
0
5
10
15
20
25
Ad Type 1
Distribution of Precision vs Recall for 0.33% Test/Train
〉Precision, Recall〈
20 / 20

Renaissance

  • 1.
    Renaissance Technologies Presentation - InsightData Science Kuhan Wang October 21th, 2015 1 / 20
  • 2.
    Introduction Insight Data Science:developed a machine learning pipeline in a consulting project. PhD Particle Physics, McGill University, researcher on the Large Hadron Collider. Lead the search for microscopic black holes and exotic gravity states in the ATLAS Collaboration. 2 / 20
  • 3.
    Consulting Scenario Company Xwishes to maximize user engagement through optimal placement of advertisements on content URLs. Ad Type: Tourism Keyword: Cuba Keyword: Package Tour Keyword: Airplane Ad Type X Keyword 1 Keyword 2 Keyword 3 Keyword N . . . Example: Tourism ads not ideal on investment content URL. 3 / 20
  • 4.
    A Pipeline toAnalyze Textual Features Developed and implemented a pipeline to analyze importance of textual feature on content URLs relative to engagement. Scrape URL Process Text Model Features Extract Keywords Update Keywords Collect Data, Reiterate Begin 4 / 20
  • 5.
    User Engagement Data Occurrences Counts Summaryof Engagement Data Page Loaded Ad Viewed Ad Clicked Summary of Engagement Data 5 / 20
  • 6.
    Modeling Attempted linear regression. Classifyengagement as yes/no. - Features are bags of words from content URL. Word Count 0 1 2 3 4 5 6 7 8 9 10 Probability[%] 0 0.2 0.4 0.6 0.8 1 Logistic Classification Model Ad Clicked Ad Not Clicked Logistic Classification Model 6 / 20
  • 7.
    Validation Randomly split datainto training/test sets. - Distribution of validation scores (shown for 50/50 split). Precision 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Recall 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 NumberofMCToys 0 10 20 30 40 50 60 70 80 Ad Type 1 Distribution of Precision vs Recall for 50.0% Test/Train 〉Precision, Recall〈 7 / 20
  • 8.
    Deliverables Extracted keywords: Rank AdType 1 Ad Type 2 Ad Type 3 Ad Type 4 1 debt coordinator mortgage gold 2 gift administrative home 0 3 profit minimum procurement stock 4 check minimum wage loan fund 5 balance reports trustee event Pipeline in Python is delivered to company for implementation. Project details: http://kuhanw.zohosites.com/. 8 / 20
  • 9.
  • 10.
    Dissertation Project Particle Collidersrecreate conditions in the early universe. Searched for signatures of microscopic gravity at the Large Hadron Collider. 10 / 20
  • 11.
    The Large HadronCollider 27 km ring, most powerful particle accelerator built to date. - 13 TeV collisions. ATLAS is one of two major general purpose experiments. Produced black holes leave debris due to evaporation inside detector. 2008JINST3S08001 Figure 2.1: Schematic layout of the LHC (Beam 1- clockwise, Beam 2 — anticlockwise). systems. The insertion at Point 4 contains two RF systems: one independent system for each LHC beam. The straight section at Point 6 contains the beam dump insertion, where the two beams are vertically extracted from the machine using a combination of horizontally deflecting fast-pulsed (’kicker’) magnets and vertically-deflecting double steel septum magnets. Each beam features an 11 / 20
  • 12.
    Data Processing Developed completeanalysis pipeline in C++. - Processed ∼10 TB of LHC data using distributed computing methods. Raw Data From Detector Processed Data with Objects Analysis Data Structure Histogram Data for Final Fitting ~ TB ~ 100 GB ~ GB ~ TB ~ MB 12 / 20
  • 13.
    Technical Analysis ~Energy of event BlackHole Signals Background Prediction Quantify compatability with likelihood model. L(ns|µ, b, θ) = P(ns|s, µ, b, θ) × i Nsyst(θ0, θ, σθ)i . (1) 13 / 20
  • 14.
    Results Placed leading constraintson models of microscopic gravity physics. Models of n extra dimensions Planck mass of theory 95% CL Exclusion Contours Black Hole Mass Model Type Public results: JHEP 07 (2015) 032, arXiv:1503.08988 [hep-ex] 14 / 20
  • 15.
  • 16.
    Thank you foryour time. 16 / 20
  • 17.
    Large Extra SpatialDimensions The size and number of extra spatial dimensions suppress the observed gravitational strength. Observed gravity is weaker then intrinsic gravity within the bulk. 17 / 20
  • 18.
    Backup Feature Frequency/Documents 0 0.10.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 RelativeNumberofDocuments[%] 4− 10 3− 10 2− 10 1− 10 1 Ad Type 1Ad Type 1 18 / 20
  • 19.
    FeatureRank Kuhan Wang1 1. InsightData Science October 2, 2015 Abstract FeatureRank is a software tool for extracting correlations between text ngram features and user engagement, thereby optimizing the placement of financial widgets on URL articles. 1 Directory Structure • / processing.py Pre-processing to parse relevant information from engagement csv files. crawl.py A simple web crawler that pulls the title and < p > tag text from URLs. FeatureRank.py Driver file to execute main functions. feature_extraction_model.py The core program that contains the machine learning algorithms. post_processing.py Post processing to produce evaluation metrics and ngram rankings. web_text_data_set_1_2.json 19 / 20
  • 20.
    Precision 0.55 0.6 0.650.7 0.75 0.8 0.85 Recall 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 NumberofMCToys 0 5 10 15 20 25 Ad Type 1 Distribution of Precision vs Recall for 0.33% Test/Train 〉Precision, Recall〈 20 / 20