SlideShare a Scribd company logo
1 of 37
Download to read offline
APPLICATIONS OF MACHINE
LEARNING
AlexTellez + Amy Wang + H2OTeam
UC Santa Barbara, 4/6/15
AGENDA
1. Introduction to Big Data / ML
2. What is H2O.ai?
3. Use Cases:
4. Data Science Competition
a) Beat Bill Belichick
b) Fight Crime in Chicago
c) Ham/Spam Text Messages
d) Cycling Article Search
1. INTROTO BIG DATA / ML
BIG DATA IS LIKE TEENAGE SEX:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is
doing it, so everyone claims
they are doing it…
Dan Ariely, Prof. @ Duke
BIGVS. SMALL DATA
When you try to open
file in excel, excel
CRASHES
SMALL = Data fits in RAM
BIG = Data does NOT fit in RAM
Basically…
Big Data is data too big
to process using conventional
methods
(e.g. excel, access)
V +V +V
Today, we have access to more data than we know what to do with!
1) Wearables (fitbit, iWatch, etc)
2) Click streams from web visitors
3. Sensor readings
4. Social Media Outlets (e.g. twitter, facebook, etc)
Volume - Data volumes are becoming unmanageable
Variety - More data types being captured
Velocity - Data arrives rapidly and must
be processed / stored
THE HOPE OF BIG DATA
1. Data contains information of great business / personal value
Examples:
a) Predicting future stock movements = $$$
b) Netflix movie recommendations = Better experience = $$$
2. IF you can extract those insights from the data, you can make better
decisions
Enter, Machine Learning (ML)…
So how the hell do you do it?
MACHINE LEARNING
The Wikipedia Definition:
…a scientific discipline that explores the construction and study
of algorithms that can learn from data. Such algorithms operate
by building a model…. ZZZzzzzzZZZzzzzzz
My Definition:
The development, analysis, and application of algorithms that enable
machines to: make predictions and / or better understand data
2 Types of Learning:
SUPERVISED + UNSUPERVISED
SUPERVISED LEARNING
What is it?
Examples of supervised learning tasks:
1. ClassificationTasks - Benign / Malignant tumor
2. RegressionTasks - Predicting future stock market prices
3. Image Recognition - Highlighting faces in pictures
Methods that infer a function from labeled training data. Key task:
Predicting ________ . (Insert your task here)
UNSUPERVISED LEARNING
What is it?
Examples of unsupervised learning tasks:
1. Clustering - Discovering customer segments
2.Topic Extraction - What topics are people tweeting about?
3. Information Retrieval - IBM Watson: Question + Answer
Methods to understand the general structure of input data where
no predictions is needed.
4.Anomaly Detection - Detecting irregular heart-beats
NO CURATION NEEDED!
2.WHAT IS H2O?
What is H2O? (water, duh!)
It is ALSO an open-source, parallel processing engine for machine
learning.
What makes H2O different?
Cutting-edge algorithms + parallel architecture + ease-of-use
=
Happy Data Scientists / Analysts
TEAM @ H2O.AI
16,000 commits
H2O World Conference 2014
COMMUNITY REACH
120 meetups in 2014
11,000 installations
2,000 corporations
First Friday Hack-A-Thons
TRY IT!
Don’t take my word for it…www.h2o.ai
Simple Instructions
1. CD to Download Location
2. unzip h2o file
3. java -jar h2o.jar
4. Point browser to: localhost:54321
GUI
R
3. USE CASES (LOTS OF EM)
BEAT BILL BELICHICK
TB + BB
Bill Belichick Tom Brady
+ =
15 years together
3 Super Bowls
PASS OR RUN?
On any given offensive play…
Coach Bill can either call a PASS or a RUN
What determines this?
Game situation
Opposing team
Time remaining, etc, etc
Yards to go (until 1st down)
Basically, LOTS of stuff.
Personnel
BUT WHAT IF??
Question:
Can we try to predict whether the next play will be PASS or RUN
using historical data?
Approach:
Download every offensive play from Belichick-Brady era since 2000
Use various Machine Learning approaches to model PASS / RUN
Disclaimer: I’m not a Seahawks fan!
Extract known features to build model inputs
DATA COLLECTION
Data:
13 years of data (2002 -2013 season)
194 games total
14,547 total offensive plays (excludes punts, kickoffs, returns)
Response Variable: PASS / RUN
Model Inputs:
Quarter, Minutes, Seconds, OpposingTeam, Down, Distance,
Line of Scrimmage, NE-Score, OpposingTeam Score, Season,
Formation, Game Status (is NE losing / winning / tied)
FIGHTING CRIME IN CHICAGO
Spark + H2O
OPEN CITY, OPEN DATA
“…my kind of town” - F. Sinatra
~4.6 Million rows of crimes from 2001, updated weekly*
External data source considerations???
Weather Data ?U.S. Census
Data ?
Crime Data
ML WORKFLOW
1. Collect datasets (Crime + Weather + Census)
2. Do some feature extraction (e.g. dates, times)
3. Join Crime data Weather Data Census Data
4. Build deep learning model to predict
arrest / no arrest made
GOAL:
For a given crime,
predict if an arrest is
more / less likely to be made!
SPARK SQL + H2O RDD
3 table join using Spark SQL
Convert joined table to H2O RDD
HOW’D WE DO?
nice!
~ 10 mins
NEW:TEXT CLASSIFICATION
Text Processing in Spark + H2O Deep Learning!
HAM / SPAMTEXTS
Problem:
No one likes to be spammed. Can we look at text messages and
come up with a ham (real text) / spam classifier using Spark feature
processing + h2o deep learning?
ML Workflow:
1.Tokenize words in text messages (1,024 texts)
2.Transform each text using Spark’s implementation of TF-IDF
3. ConvertTF-IDF Spark RDD H2O RDD
4. Run Deep Learning onTrain /Test Data
FEATURE EXTRACTION
Original Text:
“Ok…But they said i’ve got wisdom teeth hidden inside n mayb need 2
remove.”
Post Data Cleaning & Tokenization:
( but, they, said, got, wisdom, teeth, hidden, inside,
maybe, need, remove)
lower case
ignore stopwordsstrip punctuation
remove numbers
FEATURETRANSFORMATION
Post Data Cleaning & Tokenization:
( but, they, said, got, wisdom, teeth, hidden, inside,
maybe, need, remove)
Term Frequency - Inverse Document Frequency (TF-IDF)
1.TF - How often does “wisdom” occur in above text?
2. IDF - Normalization which calc’s frequency of “wisdom” across all
other text messages.
tf-idf(t, d) = tf(t, d) x idf(t) WHERE idf(t) = log(N / n)
SO…WHAT JUST HAPPENED?
0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0[ ]
( but, they, said, got, wisdom, teeth, hidden, inside,
maybe, need, remove)
wisdom teeth remove
Bag-O-Words
0 , 0 , 0 , 0 , 3.5, 2.9, 0 , 0 , 0.85, 0 , 0 …, 0
wisdom teeth remove
TF-IDF
DO IT LIVE!
Let’s fire up H2O and run a model to predict ham / spam!
DEEP AUTOENCODERS + K-
MEANS EXAMPLE
Help cyclists with their health related questions!
CYCLING + __________
Problem:
New and Experienced Cyclists have questions about cycling + ______
(given topic). Let’s build a question + answer system to help!
ML Workflow:
1) Scrape thousands of article titles from internet about cycling /
cycling tips / cycling health, etc from various sources.
2) Build Bag-of-Words Dataset on article titles corpus
3) Reduce # of dimensions via deep autoencoder
4) Extract ‘last layer’ of deep features and cluster using k-means
5) Inspect Results!
BAG-OF-WORDS
Build dataset of cycling-related articles from various sources:
The Basics of Exercise Nutrition
0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0
basics exercise nutrition
lower case
remove ‘stopwords’
remove punctuation
Article Title
[ ]
DIMENSIONALITY
REDUCTION
Use deep autoencoder to reduce # features (~2,700 words!)
2,700Words
500hiddenfeatures
250H.F.
125H.F.
50
125H.F.
250H.F.
500hiddenfeatures
2,700Words
Decoder
Encoder
The Basics of
Exercise Nutrition
K-MEANS CLUSTERING
For each article: Extract ‘last’ layer of autoencoder (50 deep features)
The Basics of
Exercise Nutrition 50 ‘deep features’
The Basics of
Exercise Nutrition
-­‐0.09330833 0.167881429 -­‐0.234307408 0.247723639 -­‐0.067700267 -­‐0.094107866
DF1 DF2 DF3 DF4 DF5 DF6
K-Means Clustering
Inputs: Extracted 50 deep features for each cycling-related article
K = 50 clusters after grid-search of values
RESULT: CYCLING + A.I.
Now we inspect the clusters!
Test Article Title:
Fluid & Carbohydrate Ingestion Improve Performance During 1Hour of
Intense Exercise
Result:
Clustered w/ 17 other titles (out of ~5,700)
Top 5 similar titles within cluster:
Caffeine ingestion does not alter performance during a 100-km cycling time-trial performance
Immuno-endocrine response to cycling following ingestion of caffeine and carbohydrate
Metabolism and performance following carbohydrate ingestion late in exercise
Increases in cycling performance in response to caffeine ingestion are repeatable
Fluid ingestion does not influence intense 1-h exercise performance in a mild environment
HOWTO GET FASTER?
Test Article Title:
Muscle Coordination is Key to Power Output & Mechanical Efficiency of
Limb Movements
Result:
Clustered w/ 29 other titles (out of ~5,700)
Top 5 similar titles within cluster:
Muscle fibre type efficiency and mechanical optima affect freely chosen pedal rate during cycling.
Standard mechanical energy analyses do not correlate with muscle work in cycling.
The influence of body position on leg kinematics and muscle recruitment during cycling.
Influence of repeated sprint training on pulmonary O2 uptake and muscle deoxygenation kinetics in humans
Influence of pedaling rate on muscle mechanical energy in low power recumbent pedaling
using forward dynamic simulations
4. DATA SCIENCE
COMPETITION
Apply / Learn More @: apps.h2o.ai
Checkout ourYouTube Channel for last year’s talks @ H2O World

More Related Content

Similar to Applications of Machine Learning at UCSB

Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Databricks
 

Similar to Applications of Machine Learning at UCSB (20)

SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
 
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Unit 1
Unit 1Unit 1
Unit 1
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
Data science in action
Data science in actionData science in action
Data science in action
 
Discover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and DiscoveryDiscover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and Discovery
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science Demystified
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
M|18 How InfoArmor Harvests Data from the Underground Economy
M|18 How InfoArmor Harvests Data from the Underground EconomyM|18 How InfoArmor Harvests Data from the Underground Economy
M|18 How InfoArmor Harvests Data from the Underground Economy
 
Cs437 lecture 1-6
Cs437 lecture 1-6Cs437 lecture 1-6
Cs437 lecture 1-6
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
 

More from Sri Ambati

More from Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 

Applications of Machine Learning at UCSB

  • 1. APPLICATIONS OF MACHINE LEARNING AlexTellez + Amy Wang + H2OTeam UC Santa Barbara, 4/6/15
  • 2. AGENDA 1. Introduction to Big Data / ML 2. What is H2O.ai? 3. Use Cases: 4. Data Science Competition a) Beat Bill Belichick b) Fight Crime in Chicago c) Ham/Spam Text Messages d) Cycling Article Search
  • 3. 1. INTROTO BIG DATA / ML BIG DATA IS LIKE TEENAGE SEX: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it… Dan Ariely, Prof. @ Duke
  • 4. BIGVS. SMALL DATA When you try to open file in excel, excel CRASHES SMALL = Data fits in RAM BIG = Data does NOT fit in RAM Basically… Big Data is data too big to process using conventional methods (e.g. excel, access)
  • 5. V +V +V Today, we have access to more data than we know what to do with! 1) Wearables (fitbit, iWatch, etc) 2) Click streams from web visitors 3. Sensor readings 4. Social Media Outlets (e.g. twitter, facebook, etc) Volume - Data volumes are becoming unmanageable Variety - More data types being captured Velocity - Data arrives rapidly and must be processed / stored
  • 6. THE HOPE OF BIG DATA 1. Data contains information of great business / personal value Examples: a) Predicting future stock movements = $$$ b) Netflix movie recommendations = Better experience = $$$ 2. IF you can extract those insights from the data, you can make better decisions Enter, Machine Learning (ML)… So how the hell do you do it?
  • 7. MACHINE LEARNING The Wikipedia Definition: …a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model…. ZZZzzzzzZZZzzzzzz My Definition: The development, analysis, and application of algorithms that enable machines to: make predictions and / or better understand data 2 Types of Learning: SUPERVISED + UNSUPERVISED
  • 8. SUPERVISED LEARNING What is it? Examples of supervised learning tasks: 1. ClassificationTasks - Benign / Malignant tumor 2. RegressionTasks - Predicting future stock market prices 3. Image Recognition - Highlighting faces in pictures Methods that infer a function from labeled training data. Key task: Predicting ________ . (Insert your task here)
  • 9. UNSUPERVISED LEARNING What is it? Examples of unsupervised learning tasks: 1. Clustering - Discovering customer segments 2.Topic Extraction - What topics are people tweeting about? 3. Information Retrieval - IBM Watson: Question + Answer Methods to understand the general structure of input data where no predictions is needed. 4.Anomaly Detection - Detecting irregular heart-beats NO CURATION NEEDED!
  • 10. 2.WHAT IS H2O? What is H2O? (water, duh!) It is ALSO an open-source, parallel processing engine for machine learning. What makes H2O different? Cutting-edge algorithms + parallel architecture + ease-of-use = Happy Data Scientists / Analysts
  • 11. TEAM @ H2O.AI 16,000 commits H2O World Conference 2014
  • 12. COMMUNITY REACH 120 meetups in 2014 11,000 installations 2,000 corporations First Friday Hack-A-Thons
  • 13. TRY IT! Don’t take my word for it…www.h2o.ai Simple Instructions 1. CD to Download Location 2. unzip h2o file 3. java -jar h2o.jar 4. Point browser to: localhost:54321 GUI R
  • 14. 3. USE CASES (LOTS OF EM) BEAT BILL BELICHICK
  • 15. TB + BB Bill Belichick Tom Brady + = 15 years together 3 Super Bowls
  • 16. PASS OR RUN? On any given offensive play… Coach Bill can either call a PASS or a RUN What determines this? Game situation Opposing team Time remaining, etc, etc Yards to go (until 1st down) Basically, LOTS of stuff. Personnel
  • 17. BUT WHAT IF?? Question: Can we try to predict whether the next play will be PASS or RUN using historical data? Approach: Download every offensive play from Belichick-Brady era since 2000 Use various Machine Learning approaches to model PASS / RUN Disclaimer: I’m not a Seahawks fan! Extract known features to build model inputs
  • 18. DATA COLLECTION Data: 13 years of data (2002 -2013 season) 194 games total 14,547 total offensive plays (excludes punts, kickoffs, returns) Response Variable: PASS / RUN Model Inputs: Quarter, Minutes, Seconds, OpposingTeam, Down, Distance, Line of Scrimmage, NE-Score, OpposingTeam Score, Season, Formation, Game Status (is NE losing / winning / tied)
  • 19. FIGHTING CRIME IN CHICAGO Spark + H2O
  • 20. OPEN CITY, OPEN DATA “…my kind of town” - F. Sinatra ~4.6 Million rows of crimes from 2001, updated weekly* External data source considerations??? Weather Data ?U.S. Census Data ? Crime Data
  • 21. ML WORKFLOW 1. Collect datasets (Crime + Weather + Census) 2. Do some feature extraction (e.g. dates, times) 3. Join Crime data Weather Data Census Data 4. Build deep learning model to predict arrest / no arrest made GOAL: For a given crime, predict if an arrest is more / less likely to be made!
  • 22. SPARK SQL + H2O RDD 3 table join using Spark SQL Convert joined table to H2O RDD
  • 24. NEW:TEXT CLASSIFICATION Text Processing in Spark + H2O Deep Learning!
  • 25. HAM / SPAMTEXTS Problem: No one likes to be spammed. Can we look at text messages and come up with a ham (real text) / spam classifier using Spark feature processing + h2o deep learning? ML Workflow: 1.Tokenize words in text messages (1,024 texts) 2.Transform each text using Spark’s implementation of TF-IDF 3. ConvertTF-IDF Spark RDD H2O RDD 4. Run Deep Learning onTrain /Test Data
  • 26. FEATURE EXTRACTION Original Text: “Ok…But they said i’ve got wisdom teeth hidden inside n mayb need 2 remove.” Post Data Cleaning & Tokenization: ( but, they, said, got, wisdom, teeth, hidden, inside, maybe, need, remove) lower case ignore stopwordsstrip punctuation remove numbers
  • 27. FEATURETRANSFORMATION Post Data Cleaning & Tokenization: ( but, they, said, got, wisdom, teeth, hidden, inside, maybe, need, remove) Term Frequency - Inverse Document Frequency (TF-IDF) 1.TF - How often does “wisdom” occur in above text? 2. IDF - Normalization which calc’s frequency of “wisdom” across all other text messages. tf-idf(t, d) = tf(t, d) x idf(t) WHERE idf(t) = log(N / n)
  • 28. SO…WHAT JUST HAPPENED? 0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0[ ] ( but, they, said, got, wisdom, teeth, hidden, inside, maybe, need, remove) wisdom teeth remove Bag-O-Words 0 , 0 , 0 , 0 , 3.5, 2.9, 0 , 0 , 0.85, 0 , 0 …, 0 wisdom teeth remove TF-IDF
  • 29. DO IT LIVE! Let’s fire up H2O and run a model to predict ham / spam!
  • 30. DEEP AUTOENCODERS + K- MEANS EXAMPLE Help cyclists with their health related questions!
  • 31. CYCLING + __________ Problem: New and Experienced Cyclists have questions about cycling + ______ (given topic). Let’s build a question + answer system to help! ML Workflow: 1) Scrape thousands of article titles from internet about cycling / cycling tips / cycling health, etc from various sources. 2) Build Bag-of-Words Dataset on article titles corpus 3) Reduce # of dimensions via deep autoencoder 4) Extract ‘last layer’ of deep features and cluster using k-means 5) Inspect Results!
  • 32. BAG-OF-WORDS Build dataset of cycling-related articles from various sources: The Basics of Exercise Nutrition 0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0 basics exercise nutrition lower case remove ‘stopwords’ remove punctuation Article Title [ ]
  • 33. DIMENSIONALITY REDUCTION Use deep autoencoder to reduce # features (~2,700 words!) 2,700Words 500hiddenfeatures 250H.F. 125H.F. 50 125H.F. 250H.F. 500hiddenfeatures 2,700Words Decoder Encoder The Basics of Exercise Nutrition
  • 34. K-MEANS CLUSTERING For each article: Extract ‘last’ layer of autoencoder (50 deep features) The Basics of Exercise Nutrition 50 ‘deep features’ The Basics of Exercise Nutrition -­‐0.09330833 0.167881429 -­‐0.234307408 0.247723639 -­‐0.067700267 -­‐0.094107866 DF1 DF2 DF3 DF4 DF5 DF6 K-Means Clustering Inputs: Extracted 50 deep features for each cycling-related article K = 50 clusters after grid-search of values
  • 35. RESULT: CYCLING + A.I. Now we inspect the clusters! Test Article Title: Fluid & Carbohydrate Ingestion Improve Performance During 1Hour of Intense Exercise Result: Clustered w/ 17 other titles (out of ~5,700) Top 5 similar titles within cluster: Caffeine ingestion does not alter performance during a 100-km cycling time-trial performance Immuno-endocrine response to cycling following ingestion of caffeine and carbohydrate Metabolism and performance following carbohydrate ingestion late in exercise Increases in cycling performance in response to caffeine ingestion are repeatable Fluid ingestion does not influence intense 1-h exercise performance in a mild environment
  • 36. HOWTO GET FASTER? Test Article Title: Muscle Coordination is Key to Power Output & Mechanical Efficiency of Limb Movements Result: Clustered w/ 29 other titles (out of ~5,700) Top 5 similar titles within cluster: Muscle fibre type efficiency and mechanical optima affect freely chosen pedal rate during cycling. Standard mechanical energy analyses do not correlate with muscle work in cycling. The influence of body position on leg kinematics and muscle recruitment during cycling. Influence of repeated sprint training on pulmonary O2 uptake and muscle deoxygenation kinetics in humans Influence of pedaling rate on muscle mechanical energy in low power recumbent pedaling using forward dynamic simulations
  • 37. 4. DATA SCIENCE COMPETITION Apply / Learn More @: apps.h2o.ai Checkout ourYouTube Channel for last year’s talks @ H2O World