Applications of Machine Learning at UCSB

APPLICATIONS OF MACHINE
LEARNING
AlexTellez + Amy Wang + H2OTeam
UC Santa Barbara, 4/6/15

AGENDA
1. Introduction to Big Data / ML
2. What is H2O.ai?
3. Use Cases:
4. Data Science Competition
a) Beat Bill Belichick
b) Fight Crime in Chicago
c) Ham/Spam Text Messages
d) Cycling Article Search

1. INTROTO BIG DATA / ML
BIG DATA IS LIKE TEENAGE SEX:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is
doing it, so everyone claims
they are doing it…
Dan Ariely, Prof. @ Duke

BIGVS. SMALL DATA
When you try to open
file in excel, excel
CRASHES
SMALL = Data fits in RAM
BIG = Data does NOT fit in RAM
Basically…
Big Data is data too big
to process using conventional
methods
(e.g. excel, access)

V +V +V
Today, we have access to more data than we know what to do with!
1) Wearables (ﬁtbit, iWatch, etc)
2) Click streams from web visitors
3. Sensor readings
4. Social Media Outlets (e.g. twitter, facebook, etc)
Volume - Data volumes are becoming unmanageable
Variety - More data types being captured
Velocity - Data arrives rapidly and must
be processed / stored

THE HOPE OF BIG DATA
1. Data contains information of great business / personal value
Examples:
a) Predicting future stock movements = $$$
b) Netﬂix movie recommendations = Better experience = $$$
2. IF you can extract those insights from the data, you can make better
decisions
Enter, Machine Learning (ML)…
So how the hell do you do it?

MACHINE LEARNING
The Wikipedia Definition:
…a scientific discipline that explores the construction and study
of algorithms that can learn from data. Such algorithms operate
by building a model…. ZZZzzzzzZZZzzzzzz
My Definition:
The development, analysis, and application of algorithms that enable
machines to: make predictions and / or better understand data
2 Types of Learning:
SUPERVISED + UNSUPERVISED

SUPERVISED LEARNING
What is it?
Examples of supervised learning tasks:
1. ClassificationTasks - Benign / Malignant tumor
2. RegressionTasks - Predicting future stock market prices
3. Image Recognition - Highlighting faces in pictures
Methods that infer a function from labeled training data. Key task:
Predicting ________ . (Insert your task here)

UNSUPERVISED LEARNING
What is it?
Examples of unsupervised learning tasks:
1. Clustering - Discovering customer segments
2.Topic Extraction - What topics are people tweeting about?
3. Information Retrieval - IBM Watson: Question + Answer
Methods to understand the general structure of input data where
no predictions is needed.
4.Anomaly Detection - Detecting irregular heart-beats
NO CURATION NEEDED!

2.WHAT IS H2O?
What is H2O? (water, duh!)
It is ALSO an open-source, parallel processing engine for machine
learning.
What makes H2O different?
Cutting-edge algorithms + parallel architecture + ease-of-use
=
Happy Data Scientists / Analysts

TEAM @ H2O.AI
16,000 commits
H2O World Conference 2014

COMMUNITY REACH
120 meetups in 2014
11,000 installations
2,000 corporations
First Friday Hack-A-Thons

TRY IT!
Don’t take my word for it…www.h2o.ai
Simple Instructions
1. CD to Download Location
2. unzip h2o ﬁle
3. java -jar h2o.jar
4. Point browser to: localhost:54321
GUI
R

3. USE CASES (LOTS OF EM)
BEAT BILL BELICHICK

TB + BB
Bill Belichick Tom Brady
+ =
15 years together
3 Super Bowls

PASS OR RUN?
On any given offensive play…
Coach Bill can either call a PASS or a RUN
What determines this?
Game situation
Opposing team
Time remaining, etc, etc
Yards to go (until 1st down)
Basically, LOTS of stuff.
Personnel

BUT WHAT IF??
Question:
Can we try to predict whether the next play will be PASS or RUN
using historical data?
Approach:
Download every offensive play from Belichick-Brady era since 2000
Use various Machine Learning approaches to model PASS / RUN
Disclaimer: I’m not a Seahawks fan!
Extract known features to build model inputs

DATA COLLECTION
Data:
13 years of data (2002 -2013 season)
194 games total
14,547 total offensive plays (excludes punts, kickoffs, returns)
Response Variable: PASS / RUN
Model Inputs:
Quarter, Minutes, Seconds, OpposingTeam, Down, Distance,
Line of Scrimmage, NE-Score, OpposingTeam Score, Season,
Formation, Game Status (is NE losing / winning / tied)

FIGHTING CRIME IN CHICAGO
Spark + H2O

OPEN CITY, OPEN DATA
“…my kind of town” - F. Sinatra
~4.6 Million rows of crimes from 2001, updated weekly*
External data source considerations???
Weather Data ?U.S. Census
Data ?
Crime Data

ML WORKFLOW
1. Collect datasets (Crime + Weather + Census)
2. Do some feature extraction (e.g. dates, times)
3. Join Crime data Weather Data Census Data
4. Build deep learning model to predict
arrest / no arrest made
GOAL:
For a given crime,
predict if an arrest is
more / less likely to be made!

SPARK SQL + H2O RDD
3 table join using Spark SQL
Convert joined table to H2O RDD

HOW’D WE DO?
nice!
~ 10 mins

NEW:TEXT CLASSIFICATION
Text Processing in Spark + H2O Deep Learning!

HAM / SPAMTEXTS
Problem:
No one likes to be spammed. Can we look at text messages and
come up with a ham (real text) / spam classiﬁer using Spark feature
processing + h2o deep learning?
ML Workﬂow:
1.Tokenize words in text messages (1,024 texts)
2.Transform each text using Spark’s implementation of TF-IDF
3. ConvertTF-IDF Spark RDD H2O RDD
4. Run Deep Learning onTrain /Test Data

FEATURE EXTRACTION
Original Text:
“Ok…But they said i’ve got wisdom teeth hidden inside n mayb need 2
remove.”
Post Data Cleaning & Tokenization:
( but, they, said, got, wisdom, teeth, hidden, inside,
maybe, need, remove)
lower case
ignore stopwordsstrip punctuation
remove numbers

FEATURETRANSFORMATION
Post Data Cleaning & Tokenization:
Term Frequency - Inverse Document Frequency (TF-IDF)
1.TF - How often does “wisdom” occur in above text?
2. IDF - Normalization which calc’s frequency of “wisdom” across all
other text messages.
tf-idf(t, d) = tf(t, d) x idf(t) WHERE idf(t) = log(N / n)

SO…WHAT JUST HAPPENED?
0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0[ ]
wisdom teeth remove
Bag-O-Words
0 , 0 , 0 , 0 , 3.5, 2.9, 0 , 0 , 0.85, 0 , 0 …, 0
wisdom teeth remove
TF-IDF

DO IT LIVE!
Let’s ﬁre up H2O and run a model to predict ham / spam!

DEEP AUTOENCODERS + K-
MEANS EXAMPLE
Help cyclists with their health related questions!

CYCLING + __________
Problem:
New and Experienced Cyclists have questions about cycling + ______
(given topic). Let’s build a question + answer system to help!
ML Workﬂow:
1) Scrape thousands of article titles from internet about cycling /
cycling tips / cycling health, etc from various sources.
2) Build Bag-of-Words Dataset on article titles corpus
3) Reduce # of dimensions via deep autoencoder
4) Extract ‘last layer’ of deep features and cluster using k-means
5) Inspect Results!

BAG-OF-WORDS
Build dataset of cycling-related articles from various sources:
The Basics of Exercise Nutrition
0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0
basics exercise nutrition
lower case
remove ‘stopwords’
remove punctuation
Article Title
[ ]

DIMENSIONALITY
REDUCTION
Use deep autoencoder to reduce # features (~2,700 words!)
2,700Words
500hiddenfeatures
250H.F.
125H.F.
50
125H.F.
250H.F.
500hiddenfeatures
2,700Words
Decoder
Encoder
The Basics of
Exercise Nutrition

K-MEANS CLUSTERING
For each article: Extract ‘last’ layer of autoencoder (50 deep features)
The Basics of
Exercise Nutrition 50 ‘deep features’
The Basics of
Exercise Nutrition
-‐0.09330833 0.167881429 -‐0.234307408 0.247723639 -‐0.067700267 -‐0.094107866
DF1 DF2 DF3 DF4 DF5 DF6
K-Means Clustering
Inputs: Extracted 50 deep features for each cycling-related article
K = 50 clusters after grid-search of values

RESULT: CYCLING + A.I.
Now we inspect the clusters!
Test Article Title:
Fluid & Carbohydrate Ingestion Improve Performance During 1Hour of
Intense Exercise
Result:
Clustered w/ 17 other titles (out of ~5,700)
Top 5 similar titles within cluster:
Caffeine ingestion does not alter performance during a 100-km cycling time-trial performance
Immuno-endocrine response to cycling following ingestion of caffeine and carbohydrate
Metabolism and performance following carbohydrate ingestion late in exercise
Increases in cycling performance in response to caffeine ingestion are repeatable
Fluid ingestion does not inﬂuence intense 1-h exercise performance in a mild environment

HOWTO GET FASTER?
Test Article Title:
Muscle Coordination is Key to Power Output & Mechanical Efficiency of
Limb Movements
Result:
Clustered w/ 29 other titles (out of ~5,700)
Top 5 similar titles within cluster:
Muscle fibre type efficiency and mechanical optima affect freely chosen pedal rate during cycling.
Standard mechanical energy analyses do not correlate with muscle work in cycling.
The influence of body position on leg kinematics and muscle recruitment during cycling.
Influence of repeated sprint training on pulmonary O2 uptake and muscle deoxygenation kinetics in humans
Influence of pedaling rate on muscle mechanical energy in low power recumbent pedaling
using forward dynamic simulations

4. DATA SCIENCE
COMPETITION
Apply / Learn More @: apps.h2o.ai
Checkout ourYouTube Channel for last year’s talks @ H2O World

Applications of Machine Learning at UCSB

Recommended

Recommended

More Related Content

Similar to Applications of Machine Learning at UCSB

Similar to Applications of Machine Learning at UCSB (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

Applications of Machine Learning at UCSB