Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets


Published on

Presented by Guy Rapaport

Published in: Technology
  • Be the first to comment

Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets

  1. 1. Dato Confidential1 GraphLab Create Benchmarks April 21, 2016 Guy Rapaport, Data Scientist, Dato EMEA
  2. 2. Dato Confidential2 Dato: We Intelligent Applications
  3. 3. Dato Confidential Some of our Customers 3
  4. 4. Dato Confidential4 Business must be intelligent Machine learning applications • Recommenders • Fraud detection • Ad targeting • Financial models • Personalized medicine • Churn prediction • Smart UX (video & text) • Personal assistants • IoT • Socials networks • Log analysis Last decade: Data management Now: Intelligent apps ? Last 5 years: Traditional analytics
  5. 5. Dato Confidential Example Intelligent Applications - images - text - graphs - tabular data 5
  6. 6. Dato Confidential Creating a model pipeline exploration data modeling
  7. 7. Dato Confidential Creating a model pipeline Ingest Transform Model Deploy Unstructured Data
  8. 8. Dato Confidential
  9. 9. Dato Confidential9 GraphLab Create in a Line “A general-purpose machine learning Python library that scales on large datasets.”  General purpose: classification, graph analytics…  Python API on top, C++ open-source engine below.  Scales vertically: more CPUs, RAM and faster disks.  Large datasets: disk bound, not RAM bound. 9
  10. 10. Dato Confidential10 What will we cover today? 1. Instantiating a machine in the Amazon EC2 cloud • r3.8xlarge instance • 32 cores, 244GBs of RAM, 2 SSDs of 320GBs each 2. Run PageRank on a large graph • CommonCrawl 2012 dataset – the internet as a graph • 3.5 billion nodes, 128 billion links 3. Run Gradient Boosted Trees on a large dataset • Criteo 1TB Click Logs Dataset • 4.3 billion rows, 39 features (13 numerical, 26 categorical) 10
  11. 11. Dato Confidential11 What will you be able to do afterwards? Instantiate an EC2 instance, grab our benchmark notebooks, and try it yourself! Everything is publicly available on github: 11
  12. 12. Dato Confidential12 Screen Primer Command Action sudo apt-get install –y screen Install screen screen –S my_session Start a session named my_session PS1=‘u@h(${STY}:${WINDOW}):w$’ Change your screen prompt (helpful) # CTRL+A, then D Key combination to detach screen -ls List all open screens screen –r my_session Reattach to your screen exit Exit the session and terminate the screen
  13. 13. Dato Confidential Confidential – Dato internal use only. ©2015 Dato, Inc. Questions? “For the purpose of learning the Answer to the Ultimate Question of Life, The Universe, and Everything, the supercomputer Deep Thought was specially built. It takes Deep Thought 7½ million years to compute and check the answer, which turns out to be 42. Deep Thought points out that the answer seems meaningless because the beings who instructed it never actually knew what the Question was.” - Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”
  14. 14. Dato Confidential14 Our Machine Learning Specialization in Coursera
  15. 15. Dato Confidential Confidential – Dato internal use only. ©2015 Dato, Inc. Thanks! Install using pip: $ pip install -U graphlab-create Dato Launcher Download: The benchmarks on GitHub: Coursera Course: Reach out: