SlideShare a Scribd company logo
APPLICATIONS OF MACHINE
LEARNING
AlexTellez + Amy Wang + H2OTeam
UC Santa Barbara, 4/6/15
AGENDA
1. Introduction to Big Data / ML
2. What is H2O.ai?
3. Use Cases:
4. Data Science Competition
a) Beat Bill Belichick
b) Fight Crime in Chicago
c) Ham/Spam Text Messages
d) Cycling Article Search
1. INTROTO BIG DATA / ML
BIG DATA IS LIKE TEENAGE SEX:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is
doing it, so everyone claims
they are doing it…
Dan Ariely, Prof. @ Duke
BIGVS. SMALL DATA
When you try to open
file in excel, excel
CRASHES
SMALL = Data fits in RAM
BIG = Data does NOT fit in RAM
Basically…
Big Data is data too big
to process using conventional
methods
(e.g. excel, access)
V +V +V
Today, we have access to more data than we know what to do with!
1) Wearables (fitbit, iWatch, etc)
2) Click streams from web visitors
3. Sensor readings
4. Social Media Outlets (e.g. twitter, facebook, etc)
Volume - Data volumes are becoming unmanageable
Variety - More data types being captured
Velocity - Data arrives rapidly and must
be processed / stored
THE HOPE OF BIG DATA
1. Data contains information of great business / personal value
Examples:
a) Predicting future stock movements = $$$
b) Netflix movie recommendations = Better experience = $$$
2. IF you can extract those insights from the data, you can make better
decisions
Enter, Machine Learning (ML)…
So how the hell do you do it?
MACHINE LEARNING
The Wikipedia Definition:
…a scientific discipline that explores the construction and study
of algorithms that can learn from data. Such algorithms operate
by building a model…. ZZZzzzzzZZZzzzzzz
My Definition:
The development, analysis, and application of algorithms that enable
machines to: make predictions and / or better understand data
2 Types of Learning:
SUPERVISED + UNSUPERVISED
SUPERVISED LEARNING
What is it?
Examples of supervised learning tasks:
1. ClassificationTasks - Benign / Malignant tumor
2. RegressionTasks - Predicting future stock market prices
3. Image Recognition - Highlighting faces in pictures
Methods that infer a function from labeled training data. Key task:
Predicting ________ . (Insert your task here)
UNSUPERVISED LEARNING
What is it?
Examples of unsupervised learning tasks:
1. Clustering - Discovering customer segments
2.Topic Extraction - What topics are people tweeting about?
3. Information Retrieval - IBM Watson: Question + Answer
Methods to understand the general structure of input data where
no predictions is needed.
4.Anomaly Detection - Detecting irregular heart-beats
NO CURATION NEEDED!
2.WHAT IS H2O?
What is H2O? (water, duh!)
It is ALSO an open-source, parallel processing engine for machine
learning.
What makes H2O different?
Cutting-edge algorithms + parallel architecture + ease-of-use
=
Happy Data Scientists / Analysts
TEAM @ H2O.AI
16,000 commits
H2O World Conference 2014
COMMUNITY REACH
120 meetups in 2014
11,000 installations
2,000 corporations
First Friday Hack-A-Thons
TRY IT!
Don’t take my word for it…www.h2o.ai
Simple Instructions
1. CD to Download Location
2. unzip h2o file
3. java -jar h2o.jar
4. Point browser to: localhost:54321
GUI
R
3. USE CASES (LOTS OF EM)
BEAT BILL BELICHICK
TB + BB
Bill Belichick Tom Brady
+ =
15 years together
3 Super Bowls
PASS OR RUN?
On any given offensive play…
Coach Bill can either call a PASS or a RUN
What determines this?
Game situation
Opposing team
Time remaining, etc, etc
Yards to go (until 1st down)
Basically, LOTS of stuff.
Personnel
BUT WHAT IF??
Question:
Can we try to predict whether the next play will be PASS or RUN
using historical data?
Approach:
Download every offensive play from Belichick-Brady era since 2000
Use various Machine Learning approaches to model PASS / RUN
Disclaimer: I’m not a Seahawks fan!
Extract known features to build model inputs
DATA COLLECTION
Data:
13 years of data (2002 -2013 season)
194 games total
14,547 total offensive plays (excludes punts, kickoffs, returns)
Response Variable: PASS / RUN
Model Inputs:
Quarter, Minutes, Seconds, OpposingTeam, Down, Distance,
Line of Scrimmage, NE-Score, OpposingTeam Score, Season,
Formation, Game Status (is NE losing / winning / tied)
FIGHTING CRIME IN CHICAGO
Spark + H2O
OPEN CITY, OPEN DATA
“…my kind of town” - F. Sinatra
~4.6 Million rows of crimes from 2001, updated weekly*
External data source considerations???
Weather Data ?U.S. Census
Data ?
Crime Data
ML WORKFLOW
1. Collect datasets (Crime + Weather + Census)
2. Do some feature extraction (e.g. dates, times)
3. Join Crime data Weather Data Census Data
4. Build deep learning model to predict
arrest / no arrest made
GOAL:
For a given crime,
predict if an arrest is
more / less likely to be made!
SPARK SQL + H2O RDD
3 table join using Spark SQL
Convert joined table to H2O RDD
HOW’D WE DO?
nice!
~ 10 mins
NEW:TEXT CLASSIFICATION
Text Processing in Spark + H2O Deep Learning!
HAM / SPAMTEXTS
Problem:
No one likes to be spammed. Can we look at text messages and
come up with a ham (real text) / spam classifier using Spark feature
processing + h2o deep learning?
ML Workflow:
1.Tokenize words in text messages (1,024 texts)
2.Transform each text using Spark’s implementation of TF-IDF
3. ConvertTF-IDF Spark RDD H2O RDD
4. Run Deep Learning onTrain /Test Data
FEATURE EXTRACTION
Original Text:
“Ok…But they said i’ve got wisdom teeth hidden inside n mayb need 2
remove.”
Post Data Cleaning & Tokenization:
( but, they, said, got, wisdom, teeth, hidden, inside,
maybe, need, remove)
lower case
ignore stopwordsstrip punctuation
remove numbers
FEATURETRANSFORMATION
Post Data Cleaning & Tokenization:
( but, they, said, got, wisdom, teeth, hidden, inside,
maybe, need, remove)
Term Frequency - Inverse Document Frequency (TF-IDF)
1.TF - How often does “wisdom” occur in above text?
2. IDF - Normalization which calc’s frequency of “wisdom” across all
other text messages.
tf-idf(t, d) = tf(t, d) x idf(t) WHERE idf(t) = log(N / n)
SO…WHAT JUST HAPPENED?
0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0[ ]
( but, they, said, got, wisdom, teeth, hidden, inside,
maybe, need, remove)
wisdom teeth remove
Bag-O-Words
0 , 0 , 0 , 0 , 3.5, 2.9, 0 , 0 , 0.85, 0 , 0 …, 0
wisdom teeth remove
TF-IDF
DO IT LIVE!
Let’s fire up H2O and run a model to predict ham / spam!
DEEP AUTOENCODERS + K-
MEANS EXAMPLE
Help cyclists with their health related questions!
CYCLING + __________
Problem:
New and Experienced Cyclists have questions about cycling + ______
(given topic). Let’s build a question + answer system to help!
ML Workflow:
1) Scrape thousands of article titles from internet about cycling /
cycling tips / cycling health, etc from various sources.
2) Build Bag-of-Words Dataset on article titles corpus
3) Reduce # of dimensions via deep autoencoder
4) Extract ‘last layer’ of deep features and cluster using k-means
5) Inspect Results!
BAG-OF-WORDS
Build dataset of cycling-related articles from various sources:
The Basics of Exercise Nutrition
0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0
basics exercise nutrition
lower case
remove ‘stopwords’
remove punctuation
Article Title
[ ]
DIMENSIONALITY
REDUCTION
Use deep autoencoder to reduce # features (~2,700 words!)
2,700Words
500hiddenfeatures
250H.F.
125H.F.
50
125H.F.
250H.F.
500hiddenfeatures
2,700Words
Decoder
Encoder
The Basics of
Exercise Nutrition
K-MEANS CLUSTERING
For each article: Extract ‘last’ layer of autoencoder (50 deep features)
The Basics of
Exercise Nutrition 50 ‘deep features’
The Basics of
Exercise Nutrition
-­‐0.09330833 0.167881429 -­‐0.234307408 0.247723639 -­‐0.067700267 -­‐0.094107866
DF1 DF2 DF3 DF4 DF5 DF6
K-Means Clustering
Inputs: Extracted 50 deep features for each cycling-related article
K = 50 clusters after grid-search of values
RESULT: CYCLING + A.I.
Now we inspect the clusters!
Test Article Title:
Fluid & Carbohydrate Ingestion Improve Performance During 1Hour of
Intense Exercise
Result:
Clustered w/ 17 other titles (out of ~5,700)
Top 5 similar titles within cluster:
Caffeine ingestion does not alter performance during a 100-km cycling time-trial performance
Immuno-endocrine response to cycling following ingestion of caffeine and carbohydrate
Metabolism and performance following carbohydrate ingestion late in exercise
Increases in cycling performance in response to caffeine ingestion are repeatable
Fluid ingestion does not influence intense 1-h exercise performance in a mild environment
HOWTO GET FASTER?
Test Article Title:
Muscle Coordination is Key to Power Output & Mechanical Efficiency of
Limb Movements
Result:
Clustered w/ 29 other titles (out of ~5,700)
Top 5 similar titles within cluster:
Muscle fibre type efficiency and mechanical optima affect freely chosen pedal rate during cycling.
Standard mechanical energy analyses do not correlate with muscle work in cycling.
The influence of body position on leg kinematics and muscle recruitment during cycling.
Influence of repeated sprint training on pulmonary O2 uptake and muscle deoxygenation kinetics in humans
Influence of pedaling rate on muscle mechanical energy in low power recumbent pedaling
using forward dynamic simulations
4. DATA SCIENCE
COMPETITION
Apply / Learn More @: apps.h2o.ai
Checkout ourYouTube Channel for last year’s talks @ H2O World

More Related Content

Similar to Applications of Machine Learning at UCSB

SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
University of Washington
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
AmmarChalifah
 
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Databricks
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
Deep Kayal
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
StampedeCon
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
Max De Marzi
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
Marta Fajlhauer
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
Eli White
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
Aseel Addawood
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
Murugan Krishnamoorthy
 
Data science in action
Data science in actionData science in action
Data science in action
Longhow Lam
 
Discover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and DiscoveryDiscover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and Discovery
New Delhi Salesforce Developer Group
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science Demystified
Emily Robinson
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
Toshiyuki Shimono
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
dallemang
 
M|18 How InfoArmor Harvests Data from the Underground Economy
M|18 How InfoArmor Harvests Data from the Underground EconomyM|18 How InfoArmor Harvests Data from the Underground Economy
M|18 How InfoArmor Harvests Data from the Underground Economy
MariaDB plc
 
Cs437 lecture 1-6
Cs437 lecture 1-6Cs437 lecture 1-6
Cs437 lecture 1-6
Aneeb_Khawar
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
PyData
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
Rising Media, Inc.
 

Similar to Applications of Machine Learning at UCSB (20)

SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
 
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Unit 1
Unit 1Unit 1
Unit 1
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
Data science in action
Data science in actionData science in action
Data science in action
 
Discover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and DiscoveryDiscover deep insights with Salesforce Einstein Analytics and Discovery
Discover deep insights with Salesforce Einstein Analytics and Discovery
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science Demystified
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
M|18 How InfoArmor Harvests Data from the Underground Economy
M|18 How InfoArmor Harvests Data from the Underground EconomyM|18 How InfoArmor Harvests Data from the Underground Economy
M|18 How InfoArmor Harvests Data from the Underground Economy
 
Cs437 lecture 1-6
Cs437 lecture 1-6Cs437 lecture 1-6
Cs437 lecture 1-6
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
 

More from Sri Ambati

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Sri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
Sri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
Sri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
Sri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
Sri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
Sri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
Sri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
Sri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
Sri Ambati
 

More from Sri Ambati (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 

Recently uploaded

First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 

Recently uploaded (20)

First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 

Applications of Machine Learning at UCSB

  • 1. APPLICATIONS OF MACHINE LEARNING AlexTellez + Amy Wang + H2OTeam UC Santa Barbara, 4/6/15
  • 2. AGENDA 1. Introduction to Big Data / ML 2. What is H2O.ai? 3. Use Cases: 4. Data Science Competition a) Beat Bill Belichick b) Fight Crime in Chicago c) Ham/Spam Text Messages d) Cycling Article Search
  • 3. 1. INTROTO BIG DATA / ML BIG DATA IS LIKE TEENAGE SEX: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it… Dan Ariely, Prof. @ Duke
  • 4. BIGVS. SMALL DATA When you try to open file in excel, excel CRASHES SMALL = Data fits in RAM BIG = Data does NOT fit in RAM Basically… Big Data is data too big to process using conventional methods (e.g. excel, access)
  • 5. V +V +V Today, we have access to more data than we know what to do with! 1) Wearables (fitbit, iWatch, etc) 2) Click streams from web visitors 3. Sensor readings 4. Social Media Outlets (e.g. twitter, facebook, etc) Volume - Data volumes are becoming unmanageable Variety - More data types being captured Velocity - Data arrives rapidly and must be processed / stored
  • 6. THE HOPE OF BIG DATA 1. Data contains information of great business / personal value Examples: a) Predicting future stock movements = $$$ b) Netflix movie recommendations = Better experience = $$$ 2. IF you can extract those insights from the data, you can make better decisions Enter, Machine Learning (ML)… So how the hell do you do it?
  • 7. MACHINE LEARNING The Wikipedia Definition: …a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model…. ZZZzzzzzZZZzzzzzz My Definition: The development, analysis, and application of algorithms that enable machines to: make predictions and / or better understand data 2 Types of Learning: SUPERVISED + UNSUPERVISED
  • 8. SUPERVISED LEARNING What is it? Examples of supervised learning tasks: 1. ClassificationTasks - Benign / Malignant tumor 2. RegressionTasks - Predicting future stock market prices 3. Image Recognition - Highlighting faces in pictures Methods that infer a function from labeled training data. Key task: Predicting ________ . (Insert your task here)
  • 9. UNSUPERVISED LEARNING What is it? Examples of unsupervised learning tasks: 1. Clustering - Discovering customer segments 2.Topic Extraction - What topics are people tweeting about? 3. Information Retrieval - IBM Watson: Question + Answer Methods to understand the general structure of input data where no predictions is needed. 4.Anomaly Detection - Detecting irregular heart-beats NO CURATION NEEDED!
  • 10. 2.WHAT IS H2O? What is H2O? (water, duh!) It is ALSO an open-source, parallel processing engine for machine learning. What makes H2O different? Cutting-edge algorithms + parallel architecture + ease-of-use = Happy Data Scientists / Analysts
  • 11. TEAM @ H2O.AI 16,000 commits H2O World Conference 2014
  • 12. COMMUNITY REACH 120 meetups in 2014 11,000 installations 2,000 corporations First Friday Hack-A-Thons
  • 13. TRY IT! Don’t take my word for it…www.h2o.ai Simple Instructions 1. CD to Download Location 2. unzip h2o file 3. java -jar h2o.jar 4. Point browser to: localhost:54321 GUI R
  • 14. 3. USE CASES (LOTS OF EM) BEAT BILL BELICHICK
  • 15. TB + BB Bill Belichick Tom Brady + = 15 years together 3 Super Bowls
  • 16. PASS OR RUN? On any given offensive play… Coach Bill can either call a PASS or a RUN What determines this? Game situation Opposing team Time remaining, etc, etc Yards to go (until 1st down) Basically, LOTS of stuff. Personnel
  • 17. BUT WHAT IF?? Question: Can we try to predict whether the next play will be PASS or RUN using historical data? Approach: Download every offensive play from Belichick-Brady era since 2000 Use various Machine Learning approaches to model PASS / RUN Disclaimer: I’m not a Seahawks fan! Extract known features to build model inputs
  • 18. DATA COLLECTION Data: 13 years of data (2002 -2013 season) 194 games total 14,547 total offensive plays (excludes punts, kickoffs, returns) Response Variable: PASS / RUN Model Inputs: Quarter, Minutes, Seconds, OpposingTeam, Down, Distance, Line of Scrimmage, NE-Score, OpposingTeam Score, Season, Formation, Game Status (is NE losing / winning / tied)
  • 19. FIGHTING CRIME IN CHICAGO Spark + H2O
  • 20. OPEN CITY, OPEN DATA “…my kind of town” - F. Sinatra ~4.6 Million rows of crimes from 2001, updated weekly* External data source considerations??? Weather Data ?U.S. Census Data ? Crime Data
  • 21. ML WORKFLOW 1. Collect datasets (Crime + Weather + Census) 2. Do some feature extraction (e.g. dates, times) 3. Join Crime data Weather Data Census Data 4. Build deep learning model to predict arrest / no arrest made GOAL: For a given crime, predict if an arrest is more / less likely to be made!
  • 22. SPARK SQL + H2O RDD 3 table join using Spark SQL Convert joined table to H2O RDD
  • 24. NEW:TEXT CLASSIFICATION Text Processing in Spark + H2O Deep Learning!
  • 25. HAM / SPAMTEXTS Problem: No one likes to be spammed. Can we look at text messages and come up with a ham (real text) / spam classifier using Spark feature processing + h2o deep learning? ML Workflow: 1.Tokenize words in text messages (1,024 texts) 2.Transform each text using Spark’s implementation of TF-IDF 3. ConvertTF-IDF Spark RDD H2O RDD 4. Run Deep Learning onTrain /Test Data
  • 26. FEATURE EXTRACTION Original Text: “Ok…But they said i’ve got wisdom teeth hidden inside n mayb need 2 remove.” Post Data Cleaning & Tokenization: ( but, they, said, got, wisdom, teeth, hidden, inside, maybe, need, remove) lower case ignore stopwordsstrip punctuation remove numbers
  • 27. FEATURETRANSFORMATION Post Data Cleaning & Tokenization: ( but, they, said, got, wisdom, teeth, hidden, inside, maybe, need, remove) Term Frequency - Inverse Document Frequency (TF-IDF) 1.TF - How often does “wisdom” occur in above text? 2. IDF - Normalization which calc’s frequency of “wisdom” across all other text messages. tf-idf(t, d) = tf(t, d) x idf(t) WHERE idf(t) = log(N / n)
  • 28. SO…WHAT JUST HAPPENED? 0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0[ ] ( but, they, said, got, wisdom, teeth, hidden, inside, maybe, need, remove) wisdom teeth remove Bag-O-Words 0 , 0 , 0 , 0 , 3.5, 2.9, 0 , 0 , 0.85, 0 , 0 …, 0 wisdom teeth remove TF-IDF
  • 29. DO IT LIVE! Let’s fire up H2O and run a model to predict ham / spam!
  • 30. DEEP AUTOENCODERS + K- MEANS EXAMPLE Help cyclists with their health related questions!
  • 31. CYCLING + __________ Problem: New and Experienced Cyclists have questions about cycling + ______ (given topic). Let’s build a question + answer system to help! ML Workflow: 1) Scrape thousands of article titles from internet about cycling / cycling tips / cycling health, etc from various sources. 2) Build Bag-of-Words Dataset on article titles corpus 3) Reduce # of dimensions via deep autoencoder 4) Extract ‘last layer’ of deep features and cluster using k-means 5) Inspect Results!
  • 32. BAG-OF-WORDS Build dataset of cycling-related articles from various sources: The Basics of Exercise Nutrition 0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0 basics exercise nutrition lower case remove ‘stopwords’ remove punctuation Article Title [ ]
  • 33. DIMENSIONALITY REDUCTION Use deep autoencoder to reduce # features (~2,700 words!) 2,700Words 500hiddenfeatures 250H.F. 125H.F. 50 125H.F. 250H.F. 500hiddenfeatures 2,700Words Decoder Encoder The Basics of Exercise Nutrition
  • 34. K-MEANS CLUSTERING For each article: Extract ‘last’ layer of autoencoder (50 deep features) The Basics of Exercise Nutrition 50 ‘deep features’ The Basics of Exercise Nutrition -­‐0.09330833 0.167881429 -­‐0.234307408 0.247723639 -­‐0.067700267 -­‐0.094107866 DF1 DF2 DF3 DF4 DF5 DF6 K-Means Clustering Inputs: Extracted 50 deep features for each cycling-related article K = 50 clusters after grid-search of values
  • 35. RESULT: CYCLING + A.I. Now we inspect the clusters! Test Article Title: Fluid & Carbohydrate Ingestion Improve Performance During 1Hour of Intense Exercise Result: Clustered w/ 17 other titles (out of ~5,700) Top 5 similar titles within cluster: Caffeine ingestion does not alter performance during a 100-km cycling time-trial performance Immuno-endocrine response to cycling following ingestion of caffeine and carbohydrate Metabolism and performance following carbohydrate ingestion late in exercise Increases in cycling performance in response to caffeine ingestion are repeatable Fluid ingestion does not influence intense 1-h exercise performance in a mild environment
  • 36. HOWTO GET FASTER? Test Article Title: Muscle Coordination is Key to Power Output & Mechanical Efficiency of Limb Movements Result: Clustered w/ 29 other titles (out of ~5,700) Top 5 similar titles within cluster: Muscle fibre type efficiency and mechanical optima affect freely chosen pedal rate during cycling. Standard mechanical energy analyses do not correlate with muscle work in cycling. The influence of body position on leg kinematics and muscle recruitment during cycling. Influence of repeated sprint training on pulmonary O2 uptake and muscle deoxygenation kinetics in humans Influence of pedaling rate on muscle mechanical energy in low power recumbent pedaling using forward dynamic simulations
  • 37. 4. DATA SCIENCE COMPETITION Apply / Learn More @: apps.h2o.ai Checkout ourYouTube Channel for last year’s talks @ H2O World