Dato Confidential1
GraphLab Create Benchmarks
April 21, 2016
Guy Rapaport, Data Scientist, Dato EMEA
guy@dato.com
Dato Confidential2
Dato: We Intelligent Applications
Dato Confidential
Some of our Customers
3
Dato Confidential4
Business
must be intelligent
Machine learning
applications
• Recommenders
• Fraud detection
• Ad targeting
• Financial models
• Personalized medicine
• Churn prediction
• Smart UX
(video & text)
• Personal assistants
• IoT
• Socials networks
• Log analysis
Last decade:
Data management
Now:
Intelligent apps
?
Last 5 years:
Traditional analytics
Dato Confidential
Example Intelligent Applications
- images
- text
- graphs
- tabular data
5
Dato Confidential
Creating a model pipeline
exploration
data
modeling
Dato Confidential
Creating a model pipeline
Ingest Transform Model Deploy
Unstructured Data
Dato Confidential
Dato Confidential9
GraphLab Create in a Line
“A general-purpose machine learning Python library that
scales on large datasets.”
 General purpose: classification, graph analytics…
 Python API on top, C++ open-source engine below.
 Scales vertically: more CPUs, RAM and faster disks.
 Large datasets: disk bound, not RAM bound.
9
Dato Confidential10
What will we cover today?
1. Instantiating a machine in the Amazon EC2 cloud
• r3.8xlarge instance
• 32 cores, 244GBs of RAM, 2 SSDs of 320GBs each
2. Run PageRank on a large graph
• CommonCrawl 2012 dataset – the internet as a graph
• 3.5 billion nodes, 128 billion links
3. Run Gradient Boosted Trees on a large dataset
• Criteo 1TB Click Logs Dataset
• 4.3 billion rows, 39 features (13 numerical, 26 categorical)
10
Dato Confidential11
What will you be able to do afterwards?
Instantiate an EC2 instance, grab our benchmark
notebooks, and try it yourself!
Everything is publicly available on github:
https://github.com/guy4261/glc_pagerank_benchmark
11
Dato Confidential12
Screen Primer
Command Action
sudo apt-get install –y screen Install screen
screen –S my_session Start a session named my_session
PS1=‘u@h(${STY}:${WINDOW}):w$’ Change your screen prompt (helpful)
# CTRL+A, then D Key combination to detach
screen -ls List all open screens
screen –r my_session Reattach to your screen
exit Exit the session and terminate the screen
Dato Confidential
Confidential – Dato internal use only. ©2015 Dato, Inc.
Questions?
“For the purpose of learning the Answer to the
Ultimate Question of Life, The Universe, and Everything,
the supercomputer Deep Thought was specially built.
It takes Deep Thought 7½ million years to compute and check the
answer, which turns out to be 42. Deep Thought points out that
the answer seems meaningless because
the beings who instructed it
never actually knew what the Question was.”
- Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”
Dato Confidential14
Our Machine Learning Specialization
in Coursera
https://www.coursera.org/learn/ml-foundations
Dato Confidential
Confidential – Dato internal use only. ©2015 Dato, Inc.
Thanks!
Install using pip: $ pip install -U graphlab-create
Dato Launcher Download:
https://dato.com/download/
The benchmarks on GitHub:
https://github.com/guy4261/glc_pagerank_benchmark
Coursera Course:
https://www.coursera.org/learn/ml-foundations
Reach out: guy@dato.com

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

  • 1.
    Dato Confidential1 GraphLab CreateBenchmarks April 21, 2016 Guy Rapaport, Data Scientist, Dato EMEA guy@dato.com
  • 2.
    Dato Confidential2 Dato: WeIntelligent Applications
  • 3.
  • 4.
    Dato Confidential4 Business must beintelligent Machine learning applications • Recommenders • Fraud detection • Ad targeting • Financial models • Personalized medicine • Churn prediction • Smart UX (video & text) • Personal assistants • IoT • Socials networks • Log analysis Last decade: Data management Now: Intelligent apps ? Last 5 years: Traditional analytics
  • 5.
    Dato Confidential Example IntelligentApplications - images - text - graphs - tabular data 5
  • 6.
    Dato Confidential Creating amodel pipeline exploration data modeling
  • 7.
    Dato Confidential Creating amodel pipeline Ingest Transform Model Deploy Unstructured Data
  • 8.
  • 9.
    Dato Confidential9 GraphLab Createin a Line “A general-purpose machine learning Python library that scales on large datasets.”  General purpose: classification, graph analytics…  Python API on top, C++ open-source engine below.  Scales vertically: more CPUs, RAM and faster disks.  Large datasets: disk bound, not RAM bound. 9
  • 10.
    Dato Confidential10 What willwe cover today? 1. Instantiating a machine in the Amazon EC2 cloud • r3.8xlarge instance • 32 cores, 244GBs of RAM, 2 SSDs of 320GBs each 2. Run PageRank on a large graph • CommonCrawl 2012 dataset – the internet as a graph • 3.5 billion nodes, 128 billion links 3. Run Gradient Boosted Trees on a large dataset • Criteo 1TB Click Logs Dataset • 4.3 billion rows, 39 features (13 numerical, 26 categorical) 10
  • 11.
    Dato Confidential11 What willyou be able to do afterwards? Instantiate an EC2 instance, grab our benchmark notebooks, and try it yourself! Everything is publicly available on github: https://github.com/guy4261/glc_pagerank_benchmark 11
  • 12.
    Dato Confidential12 Screen Primer CommandAction sudo apt-get install –y screen Install screen screen –S my_session Start a session named my_session PS1=‘u@h(${STY}:${WINDOW}):w$’ Change your screen prompt (helpful) # CTRL+A, then D Key combination to detach screen -ls List all open screens screen –r my_session Reattach to your screen exit Exit the session and terminate the screen
  • 13.
    Dato Confidential Confidential –Dato internal use only. ©2015 Dato, Inc. Questions? “For the purpose of learning the Answer to the Ultimate Question of Life, The Universe, and Everything, the supercomputer Deep Thought was specially built. It takes Deep Thought 7½ million years to compute and check the answer, which turns out to be 42. Deep Thought points out that the answer seems meaningless because the beings who instructed it never actually knew what the Question was.” - Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”
  • 14.
    Dato Confidential14 Our MachineLearning Specialization in Coursera https://www.coursera.org/learn/ml-foundations
  • 15.
    Dato Confidential Confidential –Dato internal use only. ©2015 Dato, Inc. Thanks! Install using pip: $ pip install -U graphlab-create Dato Launcher Download: https://dato.com/download/ The benchmarks on GitHub: https://github.com/guy4261/glc_pagerank_benchmark Coursera Course: https://www.coursera.org/learn/ml-foundations Reach out: guy@dato.com

Editor's Notes

  • #2  The team, the history of the product
  • #3 Company began 7 years ago in Carnegie Mellon University as an open-source project. Now a company with 50+ employees and a recently opened EMEA office here in Israel. Customers 
  • #4 Yes, we are selling  (100+ paying customers, brand names)  Intelligent apps are predictive
  • #5 From analytics (queries over known data) to predictive (discovering the unknown). Supported data types 
  • #6 Creating a model pipeline 
  • #7 Steps in the model pipeline creation 
  • #8 From inspiration to production 
  • #9 # end of corporate slides GLC in a line 
  • #10 We’re gonna see it all today 
  • #11 Really 3,443,082,324 vertices, 128,736,914,167 edges in CommonCrawl 2012. Really 4,373,472,329 rows in Criteo. Afterwards run the benchmarks 
  • #12 Switch to screen share of creating EC2 instance, followed by the benchmarks Questions 
  • #13 PS1 compliments of http://unix.stackexchange.com/a/20991
  • #14 Check our Coursera course 
  • #15 Thanks 