SlideShare a Scribd company logo
1 of 69
Download to read offline
The ML pipeline circa 2013
Data
ML
Algorithm
My curve is
better than
your curve
Write a
paper
Retail
Movie Distribution
Music
Advertising
Networking
Search
Taxis
Dating
Legal Advice
Human Resources
Coupons
Campaigning
Real Estate
Wearables
CRM
Disruptive companies
differentiated by
INTELLIGENT
APPLICATIONS
using
Machine Learning
Dato’s mission is to
accelerate the creation of
intelligent applications
by making
sophisticated machine learning
as easy as
“Hello world!”
•  Released 3 products
•  More than 10,000 downloads
GraphLab Create Dato Distributed Dato Predictive Services
Since last year…
Since last year…
Our
customers…
Demo:
Intelligent application
(Gift for Julia)
Systems
Elastic, scalable
People
Data scientist
Challenge today: Path from inspiration to production
ProductionPrototyping
Inspiration
Scale
Sophisticated ML Production
Sophisticated ML is
impractical
• Hard to match algo to app
• Algos trapped in paper
Scaling is costly
• Rewrite algo from scratch
• Expensive infrastructure
Deployment: more costly
infrastructure & time
• Build custom services & API
• Model quality deteriorates
Deploy Service
Slow & expensive process
Sophisticated ML is
impractical
MLdevelopmenttoday
Inspiration for Intelligent Application
Data
Top down solution
would be easiest
Read data
Extract text
Create features
Choose model
Tune parameter
Forced to go
bottoms up
Try again
And again
but not possible:
Application is
innovative
→
no black box
solution available
Fine approach if it’s 2013 & I’m obsessed with
“my curve is better than your curve”
(i.e., yet another solution for same old problem)
or not primarily focused on
accelerating creation of
intelligent applications
Inspiration for Intelligent Application
Data
If in 5 years all applications intelligent, ML needs:
Start from relevant,high-level,
sophisticated ML building blocks
Don’t waste time on boring stuff, like parameter search
or
worry about specialized ML knowledge, like SGD
Quickly write code:
combine, blend,
understand, adapt,
improve, optimize
Read data
Extract text
Create features
Choose model
Tune parameter
Forced to go
bottoms up
Try again
And again
ML done
differently,
Let’s see
how…
Demo:
Building an intelligent application with
GraphLab Create
(Restaurant recommender)
High-level ML toolkits
get started with 4 lines of code,
then modify, blend, add yours…
Recommender
Image
search
Sentiment
analysis
Data
matching
Auto
tagging
Churn
predictor
Object
detector
Product
sentiment
Click
prediction
Fraud detection
User
segmentation
Data
completion
Anomaly
detection
Document
clustering
Forecasting
Search
ranking
Summarization …
import graphlab as gl
data = gl.SFrame.read_csv('my_data.csv')
model = gl.recommender.create(data,
user_id='user',
item_id='movie’,
target='rating')
recommendations = model.recommend(k=5)
Sophisticated machine learning made easy
Create Intelligence Accelerants
High-level
ML toolkits
AutoML
tune params, model
selection,…
è
so you can focus on
creative parts
Reusable
features
transferrable feature
engineering
è
accuracy with less data &
less effort
Makes
ML hard
Understand
& scale
complex
models
Feature
engineering
Need for
lots of
labeled data
Very hard!
Usually: Simple models &
lots of feature engineering
Krishna’s talk tomorrow @9:10am:
auto feature engineering
Next: Transfer learning can
provide complex models with
less work & less data
Modeling challenge Data challenge
Representation challenge
Example:
Deep learning in computer vision
(or the deep devil is in the deep details)
Image features
•  Features = local detectors
o  Combined to make prediction
o  (in reality, features are more low-level)
Face!
Eye
Eye
Nose
Mouth
Many hand create features exist…Computer$vision$features$
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Slide$Credit:$Honglak$Lee$
Standard image classification approach
Input
Computer$vision$features$
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Slide$Credit:$Honglak$Lee$
Extract features Use simple classifier
e.g., logistic regression, SVMs
Car?
Many hand create features exist…Computer$vision$features$
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Slide$Credit:$Honglak$Lee$
… but very painful to design
Deep neural networks
implicitly learn features
Each layer learns features, at different levels of abstraction
Y LeCun
MA Ranzato
Deep Learning = Learning Hierarchical Representations
It's deep if it has more than one stage of non-linear feature
transformation
Trainable
Classifier
Low-Level
Feature
Mid-Level
Feature
High-Level
Feature
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Color & edge
detectors
Geometric
detectors
Car-specific
detectors
Deep learning has yielded exciting accuracy, e.g.,
Krizhevsky et al. won 2012
ImageNet competition impressively
Huge
gain
Challenges of deep learning
Deep learning workflow
Lots of
labeled data
Training set
Validation
set
80%
20%
Learn deep
neural net
model
Validate
Many tricks needed to work well…
Different types of layers, connections,… needed for high accuracy
Krizhevsky et al. ‘12
GraphLab Create adds deep features
Deep learning
+
Transfer learning
Change image classification approach?
Input
Computer$vision$features$
SIFT$ Spin$image$
HoG$ RIFT$
Textons$ GLOH$
Slide$Credit:$Honglak$Lee$
Extract features Use simple classifier
e.g., logistic regression, SVMs
Car?
Can we learn features
from data,
even when
we don’t have
data or time?
Transfer learning:
Use data from one domain to help learn on another
Lots of data:
Learn
neural net
Great accuracy
on cat v. dogvs.
Some data: Neural net as
feature extractor
+
Simple classifier
Great accuracy
on 101
categories
Old idea, explored for deep learning by Donahue et al. ’14
What’s learned in a neural net
Neural net trained for Task 1: cat vs. dog
Very specific to Task 1
Should be ignored for other tasks
More generic
Can be used as feature extractor
vs.
Transfer learning in more detail…
Neural net trained for Task 1: cat vs. dog
Very specific to Task 1
Should be ignored for other tasks
More generic
Can be used as feature extractor
Keep weights fixed!
For Task 2, predicting 101 categories, learn only end part
Use simple classifier
e.g., logistic regression, SVMs
Class?
Transfer learning with deep features
Training set
Validation
set
80%
20%
Learn
simple
model
Some
labeled data
Extract
features
with neural
net trained
on different
task
Validate
Deploy in
production
Deep learning tutorial tomorrow, 4pm!
Demo:
The power of deep features, a.k.a., transfer learning
(Shoes, please)
How general are deep
features?
Talk by founder, Jason Gates, tomorrow 9:40am
GraphLab Create includes
easy to use, deep learning on multi-GPUs
Deep learning tutorial tomorrow, 4pm!
graphlab.deeplearning.create(data,target=label')
Deep learning in
1 line of code You can also
open the box
and add your
own layers
Average Pooling Layer Rectified Linear Layer
Convolution Layer Sigmoid Layer
Dropout Layer SoftMax Layer
Flatten Layer SoftPlus Layer
Full Connection Layer Sum Pooling Layer
Max Pooling Layer Tanh Layer
0.60%
0.65%
0.70%
0.75%
0.80%
0.85%
0 5 10 15
TestError
Hours
Digit recognition benchmark
H2O.ai:
10 machines/80 cores
GraphLab Create
4 min on 4 GPUs
GraphLab Create
for intelligent applications
High-level ML toolkits
(4 lines of code gets you started)
deep learning, recommender,
product reviews, data matching,
sentiment, image search, churn,
click prediction, customer
segmentation, fraud detection,…
Auto Feature Engineering
(automate, achieve high accuracy)
. deep & reusable features
. data transformation pipelines
. kernels & hashing, encodings
AutoML
(automate to focus on creativity)
. parameter search
. model selection
. algorithm selection
. distributed
Tables, graphs,
text, images
Scalable viz for
TBs of data
Including
Matplotlib
at scale
Anthony Goldbloom
Founder & CEO
Debora Donato
Sr. Director of Personalization
& Principal Data Scientist
Native Advertising – The opportunity of making ads valuable
For	
  the	
  users	
  
For	
  the	
  
publishers	
  
Bad advertising does not work for anybody
The data:
•  400k raw html pages containing:
o  text, images, links, and well, everything web pages have
The task:
•  predict which pages are organic and which are
sponsored advertising
When:
•  starts August 1!
The Prize
•  Fame!!!
•  Knowledge!!!
•  $10,000
A lot of effort in Kaggle
competitions involves running
many experiments…
…can get slow L
SFrame ❤️ all ML tools SGraph
Sophisticated machine learning made scalable
Data Structures to Create Intelligence
Data frames
user movie rating
When you choose a
data frame,
have your application in mind
SFrame is
optimized for ML
ML has specific
data access patterns,
we make them fast, really fast
(Columnar transformations,
creating new features, iterations,…)
… Same
code
user movie rating
SFrame: Scalable data frame optimized for ML
Never run out of memory
Sharded, compressed, out-of-core, columnar
Arbitrary lambda transformations, joins,… from Python
Talk tomorrow with details: Yucheng @11am
Large data on one machine?
Limited RAM è Must use disk
(out-of-core computation)
Opportunity for Out-of-Core ML
Capacity 1 TB
0.5 GB/s
10 TB
0.1 GB/s
0.1 TB
1 GB/sThroughput
Fast, but significantly
limits data sizeOpportunity for big data on 1 machine
For sequential reads only!
Random access very slow
Out-of-core ML
opportunity is huge
Usual design → Lots of
random access → Slow
Design to maximize
sequential access for
ML algo patterns
GraphChi early example
SFrame data frame for ML
Demo: 10TBs of data on one
machine!
SFrame ❤️ all ML
scikit-learn is awesome, but...
0
1000
2000
3000
4000
0 50 100 150 200 250 300 350 400
Runtime(s)
Millions of RowsAirline Delay Dataset,
SGDLinearClassifier
scikit-learn
+
Numpy
Out of RAM
Numpy in memory only
Demo: 10TBs of data on one machine
redux
Numpy Automatically Backed by Sframes →
Scale many Python packages (scikit-learn, scipy,…)
import graphlab.numpy
Scalable numpy activation successful
0
1000
2000
3000
4000
0 50 100 150 200 250 300 350 400
Runtime(s)
Millions of Rows
Airline Delay Dataset,
SGDLinearClassifier
Out of RAM
Graphlab Create
+
scikit-learn
+
Numpy
scikit-learn
+
Numpy
Caveats apply
- Scales most memory-bound sklearn algorithms
- Sequential access highly preferred for performance
ML is not just about tables
ML pipelines combine multiple data types
Raw
Wikipedia
< / >< / >< / >
XML
Hyperlinks PageRank Top 20 Pages
Title PR
Text
Table
Title Body
Topic Model
(LDA) Word Topics
WordTopic
Term-Doc
Graph
SGraph
Graph processing
& analytics
Out-of-core &
scalable
Neighborhoods, paths, graph
algos, community detection,
label propagation, ML on
graphs, viz, …
Backed by
SFrame
Performance of SGraph
55	
  
70 sec
251 sec
200 sec
2,128 sec
0 750 1500 2250
GraphLab Create
GraphX
Giraph
Spark
Connected components in Twitter graph
Source(s): Gonzalez et. al. (OSDI 2014)
Twitter: 41 million Nodes, 1.4 billion Edges
SGraph
16 machines
1 machine
Pagerank on Common Crawl Graph
3.5 billion Nodes and 128 billion Edges
0
2
4
6
8
10
1 machine
Minutesperiteration
16 CPUs, 1 SSD
We ❤️ open source
SFrame & SGraph
Optimized
out-of-core
computation for ML
High Performance
1 machine can handle:
TBs of data
100s Billions of edges
Optimized for ML
. Columnar transformation
. Create features
. Iterators
. Filter, join, group-by, aggregate
. User-defined functions
. Easily extended through SDK
Tables, graphs,
text, images
Open-source
❤️
BSD
license
(August)
Distributed
machine
learning
Your big data
infrastructure
(cloud, hadoop, spark,..)
Sophisticated machine learning made distributed
Create Intelligence on Huge Data
Pagerank on Common Crawl Graph
3.5 billion Nodes and 128 billion Edges
0
2
4
6
8
10
1 machine 16 machines
Minutesperiteration
256 CPUs16 CPUs
45 secs/iteration
3B edges/sec
Criteo Terabyte Click Prediction
4.4 Billion Rows
13 Features
½ TB of data
0
500
1000
1500
2000
2500
3000
3500
4000
0 4 8 12 16
Runtime
#Machines
225s
3630s
Same code, distributed ML
import graphlab as gl
data = gl.SFrame.read_csv(’s3://…')
model = gl.classifier.create(data,
target=’click’)
Singlemachine
MLcode
c = gl.deploy.ec2_cluster.load(’s3://…')
gl.set_distributed_execution_environment(c)
c = gl.deploy.hadoop_cluster.load(’hdfs://…')c = gl.deploy.spark_cluster.load(’hdfs://…')
Dato machine learning platform
Inspiration
Scale
Sophisticated ML
Optimized for ML performance,
for any data size, on any infrastructure
AutoML
GraphLab Create
ML Toolkits
Canvas
Reusable Features
Job Mgmt
Distributed Engine
Distributed MLDato
Distributed
SGraph
Create Engine
SFrame
GraphLab Create
Machine Learning
In Production
Machine Learning in Production
Deployment
Easily serve live predictions
Deployment Engineers
Deploying ML models
Data Scientists
Exciting new deep
learning model.
How long is this
going to take?!
REST API!
I will be done today.
It’s
accurate!
Dato Predictive
Services
Choosing between deployed models
Machine Learning in Production
Evaluation
Monitoring
Deployment
Management
Easily serve live predictions
Measuring quality of deployed models
Tracking model operations
Talk tomorrow with details: Alice & Rajat @1:45pm
Evaluation
Monitoring
Deployment
Management
Inspiration
Scale
Sophisticated ML
Optimized for ML performance,
for any data size, on any infrastructure
AutoML
GraphLab Create
ML Toolkits
Canvas
Reusable Features
Job Mgmt
Distributed Engine
Distributed MLDato
Distributed
SGraph
Create Engine
SFrame
GraphLab Create
Dato machine learning platform
Dato machine learning platform
Inspiration
Scale
ProductionDeploy Service
Optimized for ML performance,
for any data size, on any infrastructure
AutoML
GraphLab Create
ML Toolkits
Canvas
Reusable Features REST Client Model Mgmt
Dato Predictive Services
Robust, Elastic
Direct
Job Mgmt
Distributed Engine
Distributed MLDato
Distributed
SGraph
Create Engine
SFrame
GraphLab Create
Sophisticated ML
Create of intelligent applications faster & cheaper
My curve is
better than
your curve
INTELLIGENT
APPLICATIONS
are
disrupting markets
Phase transition of
machine learning
Accelerate this process
> pip install graphlab-create
jobs@dato.com@guestrin

More Related Content

What's hot

Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 

What's hot (20)

The deep learning tour - Q1 2017
The deep learning tour - Q1 2017 The deep learning tour - Q1 2017
The deep learning tour - Q1 2017
 
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ... Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Squeezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile PhonesSqueezing Deep Learning Into Mobile Phones
Squeezing Deep Learning Into Mobile Phones
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive Toolkit
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
 
Deep Learning on Qubole Data Platform
Deep Learning on Qubole Data PlatformDeep Learning on Qubole Data Platform
Deep Learning on Qubole Data Platform
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Deeplearning on Hadoop @OSCON 2014
Deeplearning on Hadoop @OSCON 2014Deeplearning on Hadoop @OSCON 2014
Deeplearning on Hadoop @OSCON 2014
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
Deep Learning Primer: A First-Principles Approach
Deep Learning Primer: A First-Principles ApproachDeep Learning Primer: A First-Principles Approach
Deep Learning Primer: A First-Principles Approach
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...AI powered emotion recognition: From Inception to Production - Global AI Conf...
AI powered emotion recognition: From Inception to Production - Global AI Conf...
 

Viewers also liked

Broadscale Predictive Modeling and Marketing Optimization in Retail Sales
Broadscale Predictive Modeling and Marketing Optimization in Retail SalesBroadscale Predictive Modeling and Marketing Optimization in Retail Sales
Broadscale Predictive Modeling and Marketing Optimization in Retail Sales
Salford Systems
 
Lucent Technologies With Analysis
Lucent Technologies With AnalysisLucent Technologies With Analysis
Lucent Technologies With Analysis
binotrisha
 
Khasiat buah pepaya
Khasiat buah pepayaKhasiat buah pepaya
Khasiat buah pepaya
qurathun
 
Eye Catching Photos
Eye Catching PhotosEye Catching Photos
Eye Catching Photos
Yee Seng Gan
 
Web Security Programming I I
Web  Security  Programming  I IWeb  Security  Programming  I I
Web Security Programming I I
Pavu Jas
 
Summary of first term
Summary of first termSummary of first term
Summary of first term
anaiktak
 
IBM Annual Report 2009
IBM Annual Report 2009IBM Annual Report 2009
IBM Annual Report 2009
fspeech6
 

Viewers also liked (20)

Broadscale Predictive Modeling and Marketing Optimization in Retail Sales
Broadscale Predictive Modeling and Marketing Optimization in Retail SalesBroadscale Predictive Modeling and Marketing Optimization in Retail Sales
Broadscale Predictive Modeling and Marketing Optimization in Retail Sales
 
Chapter 020
Chapter 020Chapter 020
Chapter 020
 
PMD PMP DIPLOMA
PMD PMP DIPLOMAPMD PMP DIPLOMA
PMD PMP DIPLOMA
 
Quantum Entanglement - Cryptography and Communication
Quantum Entanglement - Cryptography and CommunicationQuantum Entanglement - Cryptography and Communication
Quantum Entanglement - Cryptography and Communication
 
Lucent Technologies With Analysis
Lucent Technologies With AnalysisLucent Technologies With Analysis
Lucent Technologies With Analysis
 
Khasiat buah pepaya
Khasiat buah pepayaKhasiat buah pepaya
Khasiat buah pepaya
 
Eye Catching Photos
Eye Catching PhotosEye Catching Photos
Eye Catching Photos
 
Anyone Can Cook Report - WOWEL
Anyone Can Cook Report - WOWELAnyone Can Cook Report - WOWEL
Anyone Can Cook Report - WOWEL
 
Reward week 1 bus681
Reward week 1   bus681Reward week 1   bus681
Reward week 1 bus681
 
Merkel double wiper_pt-1
Merkel double wiper_pt-1Merkel double wiper_pt-1
Merkel double wiper_pt-1
 
Web Security Programming I I
Web  Security  Programming  I IWeb  Security  Programming  I I
Web Security Programming I I
 
Summary of first term
Summary of first termSummary of first term
Summary of first term
 
Rom - Ruby Object Mapper
Rom - Ruby Object MapperRom - Ruby Object Mapper
Rom - Ruby Object Mapper
 
Gerusalemme terrena-educaz.crist
Gerusalemme terrena-educaz.cristGerusalemme terrena-educaz.crist
Gerusalemme terrena-educaz.crist
 
Systems analysis and design (abe)
Systems analysis and design (abe)Systems analysis and design (abe)
Systems analysis and design (abe)
 
Lunch pa överbliven mat
Lunch pa överbliven matLunch pa överbliven mat
Lunch pa överbliven mat
 
Twig: Friendly Curly Braces Invade Your Templates!
Twig: Friendly Curly Braces Invade Your Templates!Twig: Friendly Curly Braces Invade Your Templates!
Twig: Friendly Curly Braces Invade Your Templates!
 
Gnbkk by mz
Gnbkk by mzGnbkk by mz
Gnbkk by mz
 
IBM Annual Report 2009
IBM Annual Report 2009IBM Annual Report 2009
IBM Annual Report 2009
 
Caliac disease
Caliac diseaseCaliac disease
Caliac disease
 

Similar to Dato Keynote

OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 

Similar to Dato Keynote (20)

Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Promises of Deep Learning
Promises of Deep LearningPromises of Deep Learning
Promises of Deep Learning
 
Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobile
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
 
Getting Started With Dato - August 2015
Getting Started With Dato - August 2015Getting Started With Dato - August 2015
Getting Started With Dato - August 2015
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
GraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos GuestrinGraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos Guestrin
 
How I became ML Engineer
How I became ML Engineer How I became ML Engineer
How I became ML Engineer
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
Introduction to ML.NET
Introduction to ML.NETIntroduction to ML.NET
Introduction to ML.NET
 
AI at Google (30 min)
AI at Google (30 min)AI at Google (30 min)
AI at Google (30 min)
 
OReilly AI Transfer Learning
OReilly AI Transfer LearningOReilly AI Transfer Learning
OReilly AI Transfer Learning
 

More from Turi, Inc.

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Turi, Inc.
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
Turi, Inc.
 

More from Turi, Inc. (20)

Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Webinar - Patient Readmission Risk
Webinar - Patient Readmission RiskWebinar - Patient Readmission Risk
Webinar - Patient Readmission Risk
 
Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)Webinar - Know Your Customer - Arya (20160526)
Webinar - Know Your Customer - Arya (20160526)
 
Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)Webinar - Product Matching - Palombo (20160428)
Webinar - Product Matching - Palombo (20160428)
 
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
 
Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)Webinar - Fraud Detection - Palombo (20160428)
Webinar - Fraud Detection - Palombo (20160428)
 
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge DatasetsScaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
 
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
 
Intelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning ToolkitsIntelligent Applications with Machine Learning Toolkits
Intelligent Applications with Machine Learning Toolkits
 
Text Analysis with Machine Learning
Text Analysis with Machine LearningText Analysis with Machine Learning
Text Analysis with Machine Learning
 
Machine Learning with GraphLab Create
Machine Learning with GraphLab CreateMachine Learning with GraphLab Create
Machine Learning with GraphLab Create
 
Machine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive ServicesMachine Learning in Production with Dato Predictive Services
Machine Learning in Production with Dato Predictive Services
 
Machine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos GuestrinMachine Learning in 2016: Live Q&A with Carlos Guestrin
Machine Learning in 2016: Live Q&A with Carlos Guestrin
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
SFrame
SFrameSFrame
SFrame
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 

Recently uploaded

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Recently uploaded (20)

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 

Dato Keynote

  • 1.
  • 2. The ML pipeline circa 2013 Data ML Algorithm My curve is better than your curve Write a paper
  • 3.
  • 4. Retail Movie Distribution Music Advertising Networking Search Taxis Dating Legal Advice Human Resources Coupons Campaigning Real Estate Wearables CRM Disruptive companies differentiated by INTELLIGENT APPLICATIONS using Machine Learning
  • 5. Dato’s mission is to accelerate the creation of intelligent applications by making sophisticated machine learning as easy as “Hello world!”
  • 6. •  Released 3 products •  More than 10,000 downloads GraphLab Create Dato Distributed Dato Predictive Services Since last year…
  • 9. Systems Elastic, scalable People Data scientist Challenge today: Path from inspiration to production ProductionPrototyping Inspiration Scale Sophisticated ML Production Sophisticated ML is impractical • Hard to match algo to app • Algos trapped in paper Scaling is costly • Rewrite algo from scratch • Expensive infrastructure Deployment: more costly infrastructure & time • Build custom services & API • Model quality deteriorates Deploy Service Slow & expensive process
  • 11. MLdevelopmenttoday Inspiration for Intelligent Application Data Top down solution would be easiest Read data Extract text Create features Choose model Tune parameter Forced to go bottoms up Try again And again but not possible: Application is innovative → no black box solution available Fine approach if it’s 2013 & I’m obsessed with “my curve is better than your curve” (i.e., yet another solution for same old problem) or not primarily focused on accelerating creation of intelligent applications
  • 12. Inspiration for Intelligent Application Data If in 5 years all applications intelligent, ML needs: Start from relevant,high-level, sophisticated ML building blocks Don’t waste time on boring stuff, like parameter search or worry about specialized ML knowledge, like SGD Quickly write code: combine, blend, understand, adapt, improve, optimize Read data Extract text Create features Choose model Tune parameter Forced to go bottoms up Try again And again ML done differently, Let’s see how…
  • 13. Demo: Building an intelligent application with GraphLab Create (Restaurant recommender)
  • 14. High-level ML toolkits get started with 4 lines of code, then modify, blend, add yours… Recommender Image search Sentiment analysis Data matching Auto tagging Churn predictor Object detector Product sentiment Click prediction Fraud detection User segmentation Data completion Anomaly detection Document clustering Forecasting Search ranking Summarization … import graphlab as gl data = gl.SFrame.read_csv('my_data.csv') model = gl.recommender.create(data, user_id='user', item_id='movie’, target='rating') recommendations = model.recommend(k=5)
  • 15. Sophisticated machine learning made easy Create Intelligence Accelerants High-level ML toolkits AutoML tune params, model selection,… è so you can focus on creative parts Reusable features transferrable feature engineering è accuracy with less data & less effort
  • 16. Makes ML hard Understand & scale complex models Feature engineering Need for lots of labeled data Very hard! Usually: Simple models & lots of feature engineering Krishna’s talk tomorrow @9:10am: auto feature engineering Next: Transfer learning can provide complex models with less work & less data Modeling challenge Data challenge Representation challenge
  • 17. Example: Deep learning in computer vision (or the deep devil is in the deep details)
  • 18. Image features •  Features = local detectors o  Combined to make prediction o  (in reality, features are more low-level) Face! Eye Eye Nose Mouth
  • 19. Many hand create features exist…Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$
  • 20. Standard image classification approach Input Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$ Extract features Use simple classifier e.g., logistic regression, SVMs Car?
  • 21. Many hand create features exist…Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$ … but very painful to design
  • 22. Deep neural networks implicitly learn features Each layer learns features, at different levels of abstraction Y LeCun MA Ranzato Deep Learning = Learning Hierarchical Representations It's deep if it has more than one stage of non-linear feature transformation Trainable Classifier Low-Level Feature Mid-Level Feature High-Level Feature Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013] Color & edge detectors Geometric detectors Car-specific detectors
  • 23. Deep learning has yielded exciting accuracy, e.g., Krizhevsky et al. won 2012 ImageNet competition impressively Huge gain
  • 24. Challenges of deep learning
  • 25. Deep learning workflow Lots of labeled data Training set Validation set 80% 20% Learn deep neural net model Validate
  • 26. Many tricks needed to work well… Different types of layers, connections,… needed for high accuracy Krizhevsky et al. ‘12
  • 27. GraphLab Create adds deep features Deep learning + Transfer learning
  • 28. Change image classification approach? Input Computer$vision$features$ SIFT$ Spin$image$ HoG$ RIFT$ Textons$ GLOH$ Slide$Credit:$Honglak$Lee$ Extract features Use simple classifier e.g., logistic regression, SVMs Car? Can we learn features from data, even when we don’t have data or time?
  • 29. Transfer learning: Use data from one domain to help learn on another Lots of data: Learn neural net Great accuracy on cat v. dogvs. Some data: Neural net as feature extractor + Simple classifier Great accuracy on 101 categories Old idea, explored for deep learning by Donahue et al. ’14
  • 30. What’s learned in a neural net Neural net trained for Task 1: cat vs. dog Very specific to Task 1 Should be ignored for other tasks More generic Can be used as feature extractor vs.
  • 31. Transfer learning in more detail… Neural net trained for Task 1: cat vs. dog Very specific to Task 1 Should be ignored for other tasks More generic Can be used as feature extractor Keep weights fixed! For Task 2, predicting 101 categories, learn only end part Use simple classifier e.g., logistic regression, SVMs Class?
  • 32. Transfer learning with deep features Training set Validation set 80% 20% Learn simple model Some labeled data Extract features with neural net trained on different task Validate Deploy in production Deep learning tutorial tomorrow, 4pm!
  • 33. Demo: The power of deep features, a.k.a., transfer learning (Shoes, please)
  • 34. How general are deep features? Talk by founder, Jason Gates, tomorrow 9:40am
  • 35. GraphLab Create includes easy to use, deep learning on multi-GPUs Deep learning tutorial tomorrow, 4pm! graphlab.deeplearning.create(data,target=label') Deep learning in 1 line of code You can also open the box and add your own layers Average Pooling Layer Rectified Linear Layer Convolution Layer Sigmoid Layer Dropout Layer SoftMax Layer Flatten Layer SoftPlus Layer Full Connection Layer Sum Pooling Layer Max Pooling Layer Tanh Layer
  • 36. 0.60% 0.65% 0.70% 0.75% 0.80% 0.85% 0 5 10 15 TestError Hours Digit recognition benchmark H2O.ai: 10 machines/80 cores GraphLab Create 4 min on 4 GPUs
  • 37. GraphLab Create for intelligent applications High-level ML toolkits (4 lines of code gets you started) deep learning, recommender, product reviews, data matching, sentiment, image search, churn, click prediction, customer segmentation, fraud detection,… Auto Feature Engineering (automate, achieve high accuracy) . deep & reusable features . data transformation pipelines . kernels & hashing, encodings AutoML (automate to focus on creativity) . parameter search . model selection . algorithm selection . distributed Tables, graphs, text, images Scalable viz for TBs of data Including Matplotlib at scale
  • 38. Anthony Goldbloom Founder & CEO Debora Donato Sr. Director of Personalization & Principal Data Scientist
  • 39. Native Advertising – The opportunity of making ads valuable For  the  users   For  the   publishers  
  • 40. Bad advertising does not work for anybody
  • 41. The data: •  400k raw html pages containing: o  text, images, links, and well, everything web pages have The task: •  predict which pages are organic and which are sponsored advertising When: •  starts August 1! The Prize •  Fame!!! •  Knowledge!!! •  $10,000
  • 42. A lot of effort in Kaggle competitions involves running many experiments… …can get slow L
  • 43. SFrame ❤️ all ML tools SGraph Sophisticated machine learning made scalable Data Structures to Create Intelligence
  • 44. Data frames user movie rating When you choose a data frame, have your application in mind SFrame is optimized for ML ML has specific data access patterns, we make them fast, really fast (Columnar transformations, creating new features, iterations,…)
  • 45. … Same code user movie rating SFrame: Scalable data frame optimized for ML Never run out of memory Sharded, compressed, out-of-core, columnar Arbitrary lambda transformations, joins,… from Python Talk tomorrow with details: Yucheng @11am Large data on one machine? Limited RAM è Must use disk (out-of-core computation)
  • 46. Opportunity for Out-of-Core ML Capacity 1 TB 0.5 GB/s 10 TB 0.1 GB/s 0.1 TB 1 GB/sThroughput Fast, but significantly limits data sizeOpportunity for big data on 1 machine For sequential reads only! Random access very slow Out-of-core ML opportunity is huge Usual design → Lots of random access → Slow Design to maximize sequential access for ML algo patterns GraphChi early example SFrame data frame for ML
  • 47. Demo: 10TBs of data on one machine!
  • 49. scikit-learn is awesome, but... 0 1000 2000 3000 4000 0 50 100 150 200 250 300 350 400 Runtime(s) Millions of RowsAirline Delay Dataset, SGDLinearClassifier scikit-learn + Numpy Out of RAM Numpy in memory only
  • 50. Demo: 10TBs of data on one machine redux
  • 51. Numpy Automatically Backed by Sframes → Scale many Python packages (scikit-learn, scipy,…) import graphlab.numpy Scalable numpy activation successful 0 1000 2000 3000 4000 0 50 100 150 200 250 300 350 400 Runtime(s) Millions of Rows Airline Delay Dataset, SGDLinearClassifier Out of RAM Graphlab Create + scikit-learn + Numpy scikit-learn + Numpy Caveats apply - Scales most memory-bound sklearn algorithms - Sequential access highly preferred for performance
  • 52. ML is not just about tables
  • 53. ML pipelines combine multiple data types Raw Wikipedia < / >< / >< / > XML Hyperlinks PageRank Top 20 Pages Title PR Text Table Title Body Topic Model (LDA) Word Topics WordTopic Term-Doc Graph
  • 54. SGraph Graph processing & analytics Out-of-core & scalable Neighborhoods, paths, graph algos, community detection, label propagation, ML on graphs, viz, … Backed by SFrame
  • 55. Performance of SGraph 55   70 sec 251 sec 200 sec 2,128 sec 0 750 1500 2250 GraphLab Create GraphX Giraph Spark Connected components in Twitter graph Source(s): Gonzalez et. al. (OSDI 2014) Twitter: 41 million Nodes, 1.4 billion Edges SGraph 16 machines 1 machine
  • 56. Pagerank on Common Crawl Graph 3.5 billion Nodes and 128 billion Edges 0 2 4 6 8 10 1 machine Minutesperiteration 16 CPUs, 1 SSD
  • 57. We ❤️ open source
  • 58. SFrame & SGraph Optimized out-of-core computation for ML High Performance 1 machine can handle: TBs of data 100s Billions of edges Optimized for ML . Columnar transformation . Create features . Iterators . Filter, join, group-by, aggregate . User-defined functions . Easily extended through SDK Tables, graphs, text, images Open-source ❤️ BSD license (August)
  • 59. Distributed machine learning Your big data infrastructure (cloud, hadoop, spark,..) Sophisticated machine learning made distributed Create Intelligence on Huge Data
  • 60. Pagerank on Common Crawl Graph 3.5 billion Nodes and 128 billion Edges 0 2 4 6 8 10 1 machine 16 machines Minutesperiteration 256 CPUs16 CPUs 45 secs/iteration 3B edges/sec
  • 61. Criteo Terabyte Click Prediction 4.4 Billion Rows 13 Features ½ TB of data 0 500 1000 1500 2000 2500 3000 3500 4000 0 4 8 12 16 Runtime #Machines 225s 3630s
  • 62. Same code, distributed ML import graphlab as gl data = gl.SFrame.read_csv(’s3://…') model = gl.classifier.create(data, target=’click’) Singlemachine MLcode c = gl.deploy.ec2_cluster.load(’s3://…') gl.set_distributed_execution_environment(c) c = gl.deploy.hadoop_cluster.load(’hdfs://…')c = gl.deploy.spark_cluster.load(’hdfs://…')
  • 63. Dato machine learning platform Inspiration Scale Sophisticated ML Optimized for ML performance, for any data size, on any infrastructure AutoML GraphLab Create ML Toolkits Canvas Reusable Features Job Mgmt Distributed Engine Distributed MLDato Distributed SGraph Create Engine SFrame GraphLab Create Machine Learning In Production
  • 64. Machine Learning in Production Deployment Easily serve live predictions
  • 65. Deployment Engineers Deploying ML models Data Scientists Exciting new deep learning model. How long is this going to take?! REST API! I will be done today. It’s accurate! Dato Predictive Services
  • 66. Choosing between deployed models Machine Learning in Production Evaluation Monitoring Deployment Management Easily serve live predictions Measuring quality of deployed models Tracking model operations Talk tomorrow with details: Alice & Rajat @1:45pm
  • 67. Evaluation Monitoring Deployment Management Inspiration Scale Sophisticated ML Optimized for ML performance, for any data size, on any infrastructure AutoML GraphLab Create ML Toolkits Canvas Reusable Features Job Mgmt Distributed Engine Distributed MLDato Distributed SGraph Create Engine SFrame GraphLab Create Dato machine learning platform
  • 68. Dato machine learning platform Inspiration Scale ProductionDeploy Service Optimized for ML performance, for any data size, on any infrastructure AutoML GraphLab Create ML Toolkits Canvas Reusable Features REST Client Model Mgmt Dato Predictive Services Robust, Elastic Direct Job Mgmt Distributed Engine Distributed MLDato Distributed SGraph Create Engine SFrame GraphLab Create Sophisticated ML Create of intelligent applications faster & cheaper
  • 69. My curve is better than your curve INTELLIGENT APPLICATIONS are disrupting markets Phase transition of machine learning Accelerate this process > pip install graphlab-create jobs@dato.com@guestrin

Editor's Notes

  1. and if you talked to me in 2013, this is how I thought machine learning worked... But, I didn't get into machine learning to write papers, I got into it because, as I kid, I read a lot of scifi
  2. and I wanted to build intelligent robots applications that really demonstrate intelligence I'm excited that today, these fantasies are coming to reality...
  3. we are seeing industry after industry being disrupted by companies that build intelligent applications amazon netflix pandora adsense uber and these intelligent applications use machine learning at their core so, revisiting that childhood dream, I can say...
  4. by making sophisticated ML easy for my people, the developers and data scientists... Since last year, a lot has happened...
  5. And, it is with great enthusiasm, that I can share that we, Dato, are the emerging machine learning company with most paying customers...
  6. And, the vision we share is that building intelligent applications is the key differentiator that they can provide for their users
  7. My sister Julia is a successful fashion designer... Let's see what it takes today to build such intelligent applications
  8. start with inspiration
  9. to understand why building sophisticated ML applications is impractical for at a huge scale, let's look at the ML journey
  10. but, i predict that in 5 years, every disruptive application will be differentiate by machine learning for this to come true...
  11. MNIST is I think its just 60K images at 28x28, 10 classes, 4x GRID K520 GPUs on EC2. http://h2o.ai/blog/2015/02/deep-learning-performance/
  12. Native Advertising is paid content that matches a publication’s editorial standards while meeting the audience’s expectations. Demian Farnworth - http://www.copyblogger.com/examples-of-native-ads
  13. When trying to get to performance and scalability on a single machine, the most important thing for oany programmer to understand is the storage hierarchy.
  14. r3.8xlarge If you were to try to represent this in memory, it is a minimum of a TB of memory or so, excluding overheads.
  15. r3.8xlarge If you were to try to represent this in memory, it is a minimum of a TB of memory or so, excluding overheads.
  16. Simplify the process of moving models to production Show that will support serving sci-kit-learn, R, mlLib .. Manage multiple models in production Connects to user services (high-availability and low latency)
  17. stepping back...
  18. we are lucky enough to be at a time in the development of technology when we are witnessing a phase transition actually, you are making this machine learning phase transition happen from the 2013 perspective, when I was only focused on papers and my curve is better than your curve to a time when disruptive applications are differentiated by machine learning we hope we can help accelerate this transition