APPLICATIONS OF MACHINE
LEARNING
AlexTellez + Amy Wang + H2OTeam
USC, 4/8/2015
AGENDA
1. Introduction to Big Data / ML
2. What is H2O.ai?
3. Use Cases:
4. Data Science Competition
a) Beat Bill Belichick
b) Fight Crime in Chicago
c) Whiskey Recommendation Engine
d) Bordeaux Wine Vintage
1. INTROTO BIG DATA / ML
BIG DATA IS LIKE TEENAGE SEX:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is
doing it, so everyone claims
they are doing it…
Dan Ariely, Prof. @ Duke
BIGVS. SMALL DATA
When you try to open
file in excel, excel
CRASHES
SMALL = Data fits in RAM
BIG = Data does NOT fit in RAM
Basically…
Big Data is data too big
to process using conventional
methods
(e.g. excel, access)
V +V +V
Today, we have access to more data than we know what to do with!
1) Wearables (fitbit, iWatch, etc)
2) Click streams from web visitors
3. Sensor readings
4. Social Media Outlets (e.g. twitter, facebook, etc)
Volume - Data volumes are becoming unmanageable
Variety - More data types being captured
Velocity - Data arrives rapidly and must
be processed / stored
THE HOPE OF BIG DATA
1. Data contains information of great business / personal value
Examples:
a) Predicting future stock movements = $$$
b) Netflix movie recommendations = Better experience = $$$
2. IF you can extract those insights from the data, you can make better
decisions
Enter, Machine Learning (ML)…
So how the hell do you do it?
MACHINE LEARNING
The Wikipedia Definition:
…a scientific discipline that explores the construction and study
of algorithms that can learn from data. Such algorithms operate
by building a model…. ZZZzzzzzZZZzzzzzz
My Definition:
The development, analysis, and application of algorithms that enable
machines to: make predictions and / or better understand data
2 Types of Learning:
SUPERVISED + UNSUPERVISED
SUPERVISED LEARNING
What is it?
Examples of supervised learning tasks:
1. ClassificationTasks - Benign / Malignant tumor
2. RegressionTasks - Predicting future stock market prices
3. Image Recognition - Highlighting faces in pictures
Methods that infer a function from labeled training data. Key task:
Predicting ________ . (Insert your task here)
UNSUPERVISED LEARNING
What is it?
Examples of unsupervised learning tasks:
1. Clustering - Discovering customer segments
2.Topic Extraction - What topics are people tweeting about?
3. Information Retrieval - IBM Watson: Question + Answer
Methods to understand the general structure of input data where
no predictions is needed.
4.Anomaly Detection - Detecting irregular heart-beats
NO CURATION NEEDED!
2.WHAT IS H2O?
What is H2O? (water, duh!)
It is ALSO an open-source, parallel processing engine for machine
learning.
What makes H2O different?
Cutting-edge algorithms + parallel architecture + ease-of-use
=
Happy Data Scientists / Analysts
TEAM @ H2O.AI
16,000 commits
H2O World Conference 2014
COMMUNITY REACH
120 meetups in 2014
11,000 installations
2,000 corporations
First Friday Hack-A-Thons
TRY IT!
Don’t take my word for it…www.h2o.ai
Simple Instructions
1. CD to Download Location
2. unzip h2o file
3. java -jar h2o.jar
4. Point browser to: localhost:54321
GUI
R
3. USE CASES (LOTS OF EM)
BEAT BILL BELICHICK
TB + BB
Bill Belichick Tom Brady
+ =
15 years together
3 Super Bowls
PASS OR RUN?
On any given offensive play…
Coach Bill can either call a PASS or a RUN
What determines this?
Game situation
Opposing team
Time remaining, etc, etc
Yards to go (until 1st down)
Basically, LOTS of stuff.
Personnel
BUT WHAT IF??
Question:
Can we try to predict whether the next play will be PASS or RUN
using historical data?
Approach:
Download every offensive play from Belichick-Brady era since 2000
Use various Machine Learning approaches to model PASS / RUN
Disclaimer: I’m not a Seahawks fan!
Extract known features to build model inputs
DATA COLLECTION
Data:
13 years of data (2002 -2013 season)
194 games total
14,547 total offensive plays (excludes punts, kickoffs, returns)
Response Variable: PASS / RUN
Model Inputs:
Quarter, Minutes, Seconds, OpposingTeam, Down, Distance,
Line of Scrimmage, NE-Score, OpposingTeam Score, Season,
Formation, Game Status (is NE losing / winning / tied)
FIGHTING CRIME IN CHICAGO
Spark + H2O
OPEN CRIME DATA
Crime Dataset: Crimes from 2001 - Present Day
~ 4.6 million crimes
THE WINDY CITY
Harvest Chicago Weather data since 2001
SOCIOECONOMIC FACTORS
Crimes segmented into Community Area IDs
Percent of households below poverty, unemployed, etc.
SPARK + H2O
Weather CrimesCensusWeatherWeather
Data munging
Spark SQL join
Deep
Learning
Evaluate models
GOAL:
For a given crime,
predict if an
arrest is
more / less
likely to be made!
JOIN DATASETS
crime
data
weather
data
census
data
Using Spark, we join 3 datasets together
to make one mega dataset!
DATAVISUALIZATION
arrest rate season of
crime
temperature
during crime
community
crime is
committed in
SPLIT DATA INTOTEST/TRAIN SETS
training set arrest rate test set arrest rate
train model on this segment, 80% of data
validate the model on this segment (remaining 20%)
~40% of crimes lead to arrest
DEEP LEARNING
Problem:
For a given crime, is an arrest more / less likely?
Deep Learning:
A multi-layer feed-forward
neural network that starts
w/ an input layer
(crime + weather data)
followed by
multiple layers of
non-linear transformations
HOW’D WE DO?
nice!
~ 10 mins
SINGLE-MALT SCOTCH
Single-Malt Scotch
A whiskey made at one particular distillery from a mash that only uses
malted grain (barley)
Solid Standards:
Must be aged at least 3 years in oak casks
Many famous distilleries produced in northern regions of Scotland
OF COURSE,THERE’S A
DATASET FORTHAT!
THE Single Malt Dataset
85 distilleries from Northern Scotland
12 descriptor features:
E.g. Sweetness, Smoky,Tobacco, Honey, Spicy, Malty, etc
Each descriptor rated 0 (weak) to 4 (strong)
Problem:
Can we build a whiskey recommendation engine based on whiskeys I
have tried (and liked!) already?
DIMENSIONALITY
REDUCTION + K-MEANS
First, let’s reduce the 12 features to a lower dimensional space using a
linear transformation (Principal Components Analysis)
7 principal components explain ~ 85% of the variance in dataset
Then let’s use a clustering algorithm to determine unique whiskeys
using the new PCA’d dataset
11 clusters are appropriate
Pipe out the cluster assignments and start buying whiskey!
MODEL RESULTS
I ENJOY:
OTHER WHISKEYS THAT CLUSTER WITH THESE:
OTHER POPULAR BRANDS
APPARENTLY, LOTS OF PEOPLE LIKE:
OTHER WHISKYES THAT CLUSTER WITH THESE:
AUTOENCODER + H2O
Input Output
Hidden
Features
Information Flow
x1
x2
x3
x4
x1
x2
x3
x4
Dogs, Dogs and Dogs
ANOMALY DETECTION OFVINTAGE
YEAR BORDEAUX WINE
BORDEAUX WINE
Largest wine-growing region in France
+ 700 Million bottles of wine produced / year !
Some years better than others: Great ($$$) vs.Typical ($)
Last Great years: 2010, 2009, 2005, 2000
GREATVS.TYPICALVINTAGE?
Question:
Can we study weather patterns in Bordeaux
leading up to harvest to identify ‘anomalous’ weather years >>
correlates to Great ($$$) vs.Typical ($)Vintage?
The Bordeaux Dataset (1952 - 2014 Yearly)
Amount of Winter Rain (Oct > Apr of harvest year)
Average Summer Temp (Apr > Sept of harvest year)
Rain during Harvest (Aug > Sept)
Years since last Great Vintage
AUTOENCODER + ANOMALY
DETECTION
ML Workflow:
1)Train autoencoder to learn ‘typical’ vintage weather pattern
2) Append ‘great’ vintage year weather data to original dataset
3) IF great vintage year weather data does NOT match learned
weather pattern, autoencoder will produce high reconstruction
error (MSE)
‘en primeur of en primeur’ - Can we use weather patterns to identify
anomalous years >> indicates great vintage quality?
Goal:
RESULTS (MSE > 0.10)
Mean	
  Square	
  Error
1961V 2009V
2005V
2000V
1990V
1989V
1982V
2010V
2014 BORDEAUX??
Mean	
  Square	
  Error
2014	
  ?2013
4. DATA SCIENCE
COMPETITION
Apply / Learn More @: apps.h2o.ai
Checkout ourYouTube Channel for last year’s talks @ H2O World

Applications of Machine Learning at USC

  • 1.
    APPLICATIONS OF MACHINE LEARNING AlexTellez+ Amy Wang + H2OTeam USC, 4/8/2015
  • 2.
    AGENDA 1. Introduction toBig Data / ML 2. What is H2O.ai? 3. Use Cases: 4. Data Science Competition a) Beat Bill Belichick b) Fight Crime in Chicago c) Whiskey Recommendation Engine d) Bordeaux Wine Vintage
  • 3.
    1. INTROTO BIGDATA / ML BIG DATA IS LIKE TEENAGE SEX: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it… Dan Ariely, Prof. @ Duke
  • 4.
    BIGVS. SMALL DATA Whenyou try to open file in excel, excel CRASHES SMALL = Data fits in RAM BIG = Data does NOT fit in RAM Basically… Big Data is data too big to process using conventional methods (e.g. excel, access)
  • 5.
    V +V +V Today,we have access to more data than we know what to do with! 1) Wearables (fitbit, iWatch, etc) 2) Click streams from web visitors 3. Sensor readings 4. Social Media Outlets (e.g. twitter, facebook, etc) Volume - Data volumes are becoming unmanageable Variety - More data types being captured Velocity - Data arrives rapidly and must be processed / stored
  • 6.
    THE HOPE OFBIG DATA 1. Data contains information of great business / personal value Examples: a) Predicting future stock movements = $$$ b) Netflix movie recommendations = Better experience = $$$ 2. IF you can extract those insights from the data, you can make better decisions Enter, Machine Learning (ML)… So how the hell do you do it?
  • 7.
    MACHINE LEARNING The WikipediaDefinition: …a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model…. ZZZzzzzzZZZzzzzzz My Definition: The development, analysis, and application of algorithms that enable machines to: make predictions and / or better understand data 2 Types of Learning: SUPERVISED + UNSUPERVISED
  • 8.
    SUPERVISED LEARNING What isit? Examples of supervised learning tasks: 1. ClassificationTasks - Benign / Malignant tumor 2. RegressionTasks - Predicting future stock market prices 3. Image Recognition - Highlighting faces in pictures Methods that infer a function from labeled training data. Key task: Predicting ________ . (Insert your task here)
  • 9.
    UNSUPERVISED LEARNING What isit? Examples of unsupervised learning tasks: 1. Clustering - Discovering customer segments 2.Topic Extraction - What topics are people tweeting about? 3. Information Retrieval - IBM Watson: Question + Answer Methods to understand the general structure of input data where no predictions is needed. 4.Anomaly Detection - Detecting irregular heart-beats NO CURATION NEEDED!
  • 10.
    2.WHAT IS H2O? Whatis H2O? (water, duh!) It is ALSO an open-source, parallel processing engine for machine learning. What makes H2O different? Cutting-edge algorithms + parallel architecture + ease-of-use = Happy Data Scientists / Analysts
  • 11.
    TEAM @ H2O.AI 16,000commits H2O World Conference 2014
  • 12.
    COMMUNITY REACH 120 meetupsin 2014 11,000 installations 2,000 corporations First Friday Hack-A-Thons
  • 13.
    TRY IT! Don’t takemy word for it…www.h2o.ai Simple Instructions 1. CD to Download Location 2. unzip h2o file 3. java -jar h2o.jar 4. Point browser to: localhost:54321 GUI R
  • 14.
    3. USE CASES(LOTS OF EM) BEAT BILL BELICHICK
  • 15.
    TB + BB BillBelichick Tom Brady + = 15 years together 3 Super Bowls
  • 16.
    PASS OR RUN? Onany given offensive play… Coach Bill can either call a PASS or a RUN What determines this? Game situation Opposing team Time remaining, etc, etc Yards to go (until 1st down) Basically, LOTS of stuff. Personnel
  • 17.
    BUT WHAT IF?? Question: Canwe try to predict whether the next play will be PASS or RUN using historical data? Approach: Download every offensive play from Belichick-Brady era since 2000 Use various Machine Learning approaches to model PASS / RUN Disclaimer: I’m not a Seahawks fan! Extract known features to build model inputs
  • 18.
    DATA COLLECTION Data: 13 yearsof data (2002 -2013 season) 194 games total 14,547 total offensive plays (excludes punts, kickoffs, returns) Response Variable: PASS / RUN Model Inputs: Quarter, Minutes, Seconds, OpposingTeam, Down, Distance, Line of Scrimmage, NE-Score, OpposingTeam Score, Season, Formation, Game Status (is NE losing / winning / tied)
  • 19.
    FIGHTING CRIME INCHICAGO Spark + H2O
  • 20.
    OPEN CRIME DATA CrimeDataset: Crimes from 2001 - Present Day ~ 4.6 million crimes
  • 21.
    THE WINDY CITY HarvestChicago Weather data since 2001
  • 22.
    SOCIOECONOMIC FACTORS Crimes segmentedinto Community Area IDs Percent of households below poverty, unemployed, etc.
  • 23.
    SPARK + H2O WeatherCrimesCensusWeatherWeather Data munging Spark SQL join Deep Learning Evaluate models GOAL: For a given crime, predict if an arrest is more / less likely to be made!
  • 24.
    JOIN DATASETS crime data weather data census data Using Spark,we join 3 datasets together to make one mega dataset!
  • 25.
    DATAVISUALIZATION arrest rate seasonof crime temperature during crime community crime is committed in
  • 26.
    SPLIT DATA INTOTEST/TRAINSETS training set arrest rate test set arrest rate train model on this segment, 80% of data validate the model on this segment (remaining 20%) ~40% of crimes lead to arrest
  • 27.
    DEEP LEARNING Problem: For agiven crime, is an arrest more / less likely? Deep Learning: A multi-layer feed-forward neural network that starts w/ an input layer (crime + weather data) followed by multiple layers of non-linear transformations
  • 28.
  • 29.
    SINGLE-MALT SCOTCH Single-Malt Scotch Awhiskey made at one particular distillery from a mash that only uses malted grain (barley) Solid Standards: Must be aged at least 3 years in oak casks Many famous distilleries produced in northern regions of Scotland
  • 30.
    OF COURSE,THERE’S A DATASETFORTHAT! THE Single Malt Dataset 85 distilleries from Northern Scotland 12 descriptor features: E.g. Sweetness, Smoky,Tobacco, Honey, Spicy, Malty, etc Each descriptor rated 0 (weak) to 4 (strong) Problem: Can we build a whiskey recommendation engine based on whiskeys I have tried (and liked!) already?
  • 31.
    DIMENSIONALITY REDUCTION + K-MEANS First,let’s reduce the 12 features to a lower dimensional space using a linear transformation (Principal Components Analysis) 7 principal components explain ~ 85% of the variance in dataset Then let’s use a clustering algorithm to determine unique whiskeys using the new PCA’d dataset 11 clusters are appropriate Pipe out the cluster assignments and start buying whiskey!
  • 32.
    MODEL RESULTS I ENJOY: OTHERWHISKEYS THAT CLUSTER WITH THESE:
  • 33.
    OTHER POPULAR BRANDS APPARENTLY,LOTS OF PEOPLE LIKE: OTHER WHISKYES THAT CLUSTER WITH THESE:
  • 34.
    AUTOENCODER + H2O InputOutput Hidden Features Information Flow x1 x2 x3 x4 x1 x2 x3 x4 Dogs, Dogs and Dogs
  • 35.
  • 36.
    BORDEAUX WINE Largest wine-growingregion in France + 700 Million bottles of wine produced / year ! Some years better than others: Great ($$$) vs.Typical ($) Last Great years: 2010, 2009, 2005, 2000
  • 37.
    GREATVS.TYPICALVINTAGE? Question: Can we studyweather patterns in Bordeaux leading up to harvest to identify ‘anomalous’ weather years >> correlates to Great ($$$) vs.Typical ($)Vintage? The Bordeaux Dataset (1952 - 2014 Yearly) Amount of Winter Rain (Oct > Apr of harvest year) Average Summer Temp (Apr > Sept of harvest year) Rain during Harvest (Aug > Sept) Years since last Great Vintage
  • 38.
    AUTOENCODER + ANOMALY DETECTION MLWorkflow: 1)Train autoencoder to learn ‘typical’ vintage weather pattern 2) Append ‘great’ vintage year weather data to original dataset 3) IF great vintage year weather data does NOT match learned weather pattern, autoencoder will produce high reconstruction error (MSE) ‘en primeur of en primeur’ - Can we use weather patterns to identify anomalous years >> indicates great vintage quality? Goal:
  • 39.
    RESULTS (MSE >0.10) Mean  Square  Error 1961V 2009V 2005V 2000V 1990V 1989V 1982V 2010V
  • 40.
    2014 BORDEAUX?? Mean  Square  Error 2014  ?2013
  • 41.
    4. DATA SCIENCE COMPETITION Apply/ Learn More @: apps.h2o.ai Checkout ourYouTube Channel for last year’s talks @ H2O World