Dato Confidential1
GraphLab Create Training
UvA School of Business
Danny Bickson, Co-Founder and VP EMEA
bickson@dato.com
Dato Confidential2
Dato: We Intelligent Applications
Dato Confidential3
Business
must be intelligent
Machine learning
applications
• Recommenders
• Fraud detection
• Ad targeting
• Financial models
• Personalized medicine
• Churn prediction
• Smart UX
(video & text)
• Personal assistants
• IoT
• Socials networks
• Log analysis
Last decade:
Data management
Now:
Intelligent apps
?
Last 5 years:
Traditional analytics
Dato Confidential
Example Intelligent Applications
- images
- text
- graphs
- tabular data
4
Dato Confidential
Creating a model
exploration
data
modeling
pipeline
Dato Confidential
Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
Dato Confidential
The Dato Machine Learning Platform
Dato Predictive Services
Predictive Engine
REST Client
Model Mgmt
Machine Learning Toolkits
Canvas Free for academic
usage
SDK SGraphSFrame
Engine – sframe gihub
GraphLab Create
Dato Confidential
GraphLab Create Benefits
Dato Confidential
Why use GraphLab Create?
9
• Efficient storage
GraphLab Sframe compressed column store:
• x20 smaller than pandas
• x2 smaller than Gzip
Size on disk (the lower the better!)
Dato Confidential
No need for huge RAM!
10
Effective Delay vs RAM
x2x5
Data size limited by disk size
My data is larger than my machine RAM
Dato Confidential11
Comparison to sklearn
Try it here: http://blog.dato.com/how-fast-are-out-of-core-algorithms
Dato Confidential
Summary of differences vs. sklearn
12
• Better multicore support
• Out of core implementation (working from disk)
• Automatic feature expansion
• Automatic parameter selection
• Support for model serving
• Additional algorithms
Dato Confidential
Some of our Customers
13
Dato Confidential14
Dato on Coursera
40,000 students in 4 months
https://www.coursera.org/learn/ml-foundations
Specialization
content:
● Machine Leraning
Foundations
● Regression
● Classification
● Clustering &
Retrieval
● Recommendation
Systems &
Dimentionality
Reduction
● Capstone: An
Intelligent
Application with
Deep Learning
Dato Confidential15
Remco Frijling
Dato Confidential16
Dato Confidential17
Create an intelligent world!
Data
Engineering
Sophisticated
ML
Deployment
• Fast & scalable
• Rich data types
• Built for ML
• App-oriented ML
• Scalable ML
• Extensibility
• Batch & always-on
• RESTful interface
• Elastic & robust
bickson@dato.com
Dato Confidential
Appendix: Performance
18
Dato Confidential
Confidential – Dato internal use only. ©2015 Dato, Inc.
Performance Highlights
Dato’s Platform outperforms other frameworks on most tasks: Data munging, machine learning essentials, & graph analytics tasks.
● Data Munging - SFrame, the columnar and out-of-core abstraction enables tabular queries on a single node that are faster or
comparable to queries on 5-node clusters for systems like Spark & Redshift.
● Machine Learning - Unparalleled speed & accuracy for tasks including classification, recommendation, and deep learning on
images compared to systems like MLLib, H2O, and scikit-learn.
● Graph Analysis - Orders of magnitude faster than comparable frameworks like GraphX & Giraph for common graph analytics tasks.
Tasks complete in reasonable times (mins) even on the world’s largest publicly available webgraph. The only other known system
to complete these tasks is one that runs on non-commodity, custom hardware.
Dato Confidential20
0.60%
0.65%
0.70%
0.75%
0.80%
0.85%
0 2 4 6 8 10 12
TestError
Hours
Digit recognition benchmark
4 min on 4 GPUs
Machine Learning – Deep Learning
10 machines/80 cores
Dato Confidential
Graph Analytics - 1
21
70 sec
251 sec
200 sec
2,128 sec
0 750 1500 2250
GraphLab Create
GraphX
Giraph
Spark
Connected components in Twitter graph
Source(s): Gonzalez et. al. (OSDI 2014)
Twitter: 41 million Nodes, 1.4 billion Edges
SGraph
16 machines
1 machine
Dato Confidential22
Pagerank on Common Crawl Graph
3.5 billion Nodes and 128 billion Edges
0
2
4
6
8
10
1 machine 16 machines
Minutesperiteration
256 CPUs16 CPUs
16 machines 300 machines
Dato Confidential23
Criteo Terabyte Click Prediction
4.4 Billion Rows
13 Features
½ TB of data
0
500
1000
1500
2000
2500
3000
3500
4000
0 4 8 12 16
Runtime
#Machines
225s
3630s
Dato Confidential
Confidential – Dato internal use only. ©2015 Dato, Inc.
Machine Learning – Logistic Reg. Accuracy
Dataset Source(s): LIBLinear binary classification datsets.
Dato Confidential
Confidential – Dato internal use only. ©2015 Dato, Inc.
Data Munging
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
5 Nodes
1 Node
Source(s): https://amplab.cs.berkeley.edu/benchmark/, Armbrust et. al. (SIGMOD 2015)
Dataset: Extracted from 775M visits to 90M documents in the Common Crawl corpus
Dato Confidential
Appendix: Pricing & Deployment Scenarios
26
Dato Confidential27
• Subscription license which
includes support and and
upgrades
• Licensed by user for
Create & by machine for
production use
• Training & technical
services also available
• Discounts available for 10
or more users
Dato Confidential
Deployment Scenarios
28
“Getting Started”
“Real-time Predictions”
“Scaling Up”
GraphLab Create
Dato Predictive Services
Dato Distributed
Key
GraphLab Create – installed on each team member machine
• Working with data, training new models, doing ad-hoc analysis
GraphLab Create
• Installed on central team server
• Trains production models periodically (ex. nightly)
• Generates predictions and records to data store
GraphLab Create – installed on each team member machine
• Installed on team member laptops
• Working with data, ad-hoc analysis, training new models
• Deploy new models to Predictive Services deployment
GraphLab Create – installed on central team server
• Trains production models periodically (ex. nightly)
• Deploys models to Dato Predictive Services
Dato Predictive Services – installed on central team cluster
• Hosting & Serving deployed models
• REST API for application integration
GraphLab Create – installed on each team member machine
• Working with data, training new models, doing ad-hoc analysis
• Deploys models to Predictive Services
• Submits jobs to Distributed
Dato Distributed – installed on central team cluster
• Train models in parallel on larger dataset periodically (ex. nightly)
• Deploys newly trained models to Dato Predictive Services
Dato Predictive Services – installed on central team cluster
• Hosting deployed models
• REST API for applicationintegration

Danny Bickson - Python based predictive analytics with GraphLab Create

  • 1.
    Dato Confidential1 GraphLab CreateTraining UvA School of Business Danny Bickson, Co-Founder and VP EMEA bickson@dato.com
  • 2.
    Dato Confidential2 Dato: WeIntelligent Applications
  • 3.
    Dato Confidential3 Business must beintelligent Machine learning applications • Recommenders • Fraud detection • Ad targeting • Financial models • Personalized medicine • Churn prediction • Smart UX (video & text) • Personal assistants • IoT • Socials networks • Log analysis Last decade: Data management Now: Intelligent apps ? Last 5 years: Traditional analytics
  • 4.
    Dato Confidential Example IntelligentApplications - images - text - graphs - tabular data 4
  • 5.
    Dato Confidential Creating amodel exploration data modeling pipeline
  • 6.
    Dato Confidential Creating amodel pipeline Ingest Transform Model Deploy Unstructured Data
  • 7.
    Dato Confidential The DatoMachine Learning Platform Dato Predictive Services Predictive Engine REST Client Model Mgmt Machine Learning Toolkits Canvas Free for academic usage SDK SGraphSFrame Engine – sframe gihub GraphLab Create
  • 8.
  • 9.
    Dato Confidential Why useGraphLab Create? 9 • Efficient storage GraphLab Sframe compressed column store: • x20 smaller than pandas • x2 smaller than Gzip Size on disk (the lower the better!)
  • 10.
    Dato Confidential No needfor huge RAM! 10 Effective Delay vs RAM x2x5 Data size limited by disk size My data is larger than my machine RAM
  • 11.
    Dato Confidential11 Comparison tosklearn Try it here: http://blog.dato.com/how-fast-are-out-of-core-algorithms
  • 12.
    Dato Confidential Summary ofdifferences vs. sklearn 12 • Better multicore support • Out of core implementation (working from disk) • Automatic feature expansion • Automatic parameter selection • Support for model serving • Additional algorithms
  • 13.
    Dato Confidential Some ofour Customers 13
  • 14.
    Dato Confidential14 Dato onCoursera 40,000 students in 4 months https://www.coursera.org/learn/ml-foundations Specialization content: ● Machine Leraning Foundations ● Regression ● Classification ● Clustering & Retrieval ● Recommendation Systems & Dimentionality Reduction ● Capstone: An Intelligent Application with Deep Learning
  • 15.
  • 16.
  • 17.
    Dato Confidential17 Create anintelligent world! Data Engineering Sophisticated ML Deployment • Fast & scalable • Rich data types • Built for ML • App-oriented ML • Scalable ML • Extensibility • Batch & always-on • RESTful interface • Elastic & robust bickson@dato.com
  • 18.
  • 19.
    Dato Confidential Confidential –Dato internal use only. ©2015 Dato, Inc. Performance Highlights Dato’s Platform outperforms other frameworks on most tasks: Data munging, machine learning essentials, & graph analytics tasks. ● Data Munging - SFrame, the columnar and out-of-core abstraction enables tabular queries on a single node that are faster or comparable to queries on 5-node clusters for systems like Spark & Redshift. ● Machine Learning - Unparalleled speed & accuracy for tasks including classification, recommendation, and deep learning on images compared to systems like MLLib, H2O, and scikit-learn. ● Graph Analysis - Orders of magnitude faster than comparable frameworks like GraphX & Giraph for common graph analytics tasks. Tasks complete in reasonable times (mins) even on the world’s largest publicly available webgraph. The only other known system to complete these tasks is one that runs on non-commodity, custom hardware.
  • 20.
    Dato Confidential20 0.60% 0.65% 0.70% 0.75% 0.80% 0.85% 0 24 6 8 10 12 TestError Hours Digit recognition benchmark 4 min on 4 GPUs Machine Learning – Deep Learning 10 machines/80 cores
  • 21.
    Dato Confidential Graph Analytics- 1 21 70 sec 251 sec 200 sec 2,128 sec 0 750 1500 2250 GraphLab Create GraphX Giraph Spark Connected components in Twitter graph Source(s): Gonzalez et. al. (OSDI 2014) Twitter: 41 million Nodes, 1.4 billion Edges SGraph 16 machines 1 machine
  • 22.
    Dato Confidential22 Pagerank onCommon Crawl Graph 3.5 billion Nodes and 128 billion Edges 0 2 4 6 8 10 1 machine 16 machines Minutesperiteration 256 CPUs16 CPUs 16 machines 300 machines
  • 23.
    Dato Confidential23 Criteo TerabyteClick Prediction 4.4 Billion Rows 13 Features ½ TB of data 0 500 1000 1500 2000 2500 3000 3500 4000 0 4 8 12 16 Runtime #Machines 225s 3630s
  • 24.
    Dato Confidential Confidential –Dato internal use only. ©2015 Dato, Inc. Machine Learning – Logistic Reg. Accuracy Dataset Source(s): LIBLinear binary classification datsets.
  • 25.
    Dato Confidential Confidential –Dato internal use only. ©2015 Dato, Inc. Data Munging SELECT pageURL, pageRank FROM rankings WHERE pageRank > X 5 Nodes 1 Node Source(s): https://amplab.cs.berkeley.edu/benchmark/, Armbrust et. al. (SIGMOD 2015) Dataset: Extracted from 775M visits to 90M documents in the Common Crawl corpus
  • 26.
    Dato Confidential Appendix: Pricing& Deployment Scenarios 26
  • 27.
    Dato Confidential27 • Subscriptionlicense which includes support and and upgrades • Licensed by user for Create & by machine for production use • Training & technical services also available • Discounts available for 10 or more users
  • 28.
    Dato Confidential Deployment Scenarios 28 “GettingStarted” “Real-time Predictions” “Scaling Up” GraphLab Create Dato Predictive Services Dato Distributed Key GraphLab Create – installed on each team member machine • Working with data, training new models, doing ad-hoc analysis GraphLab Create • Installed on central team server • Trains production models periodically (ex. nightly) • Generates predictions and records to data store GraphLab Create – installed on each team member machine • Installed on team member laptops • Working with data, ad-hoc analysis, training new models • Deploy new models to Predictive Services deployment GraphLab Create – installed on central team server • Trains production models periodically (ex. nightly) • Deploys models to Dato Predictive Services Dato Predictive Services – installed on central team cluster • Hosting & Serving deployed models • REST API for application integration GraphLab Create – installed on each team member machine • Working with data, training new models, doing ad-hoc analysis • Deploys models to Predictive Services • Submits jobs to Distributed Dato Distributed – installed on central team cluster • Train models in parallel on larger dataset periodically (ex. nightly) • Deploys newly trained models to Dato Predictive Services Dato Predictive Services – installed on central team cluster • Hosting deployed models • REST API for applicationintegration