Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Challenges of Bringing Machine Learning to the Masses


Published on

Why is it hard to build ML software, and why it is like designing a database. Jointly created with Sethu Raman (Dato/GraphLab). Talk at NIPS 2014 workshop on Software Engineering for Machine Learning (

Published in: Engineering
  • Be the first to comment

The Challenges of Bringing Machine Learning to the Masses

  1. 1. The Challenges of Bringing Machine Learning to the Masses Alice Zheng and Sethu Raman GraphLab Inc. NIPS workshop on Software Engineering for Machine Learning December 13, 2014
  2. 2. Self introduction ML Research “Accessible ML”
  3. 3. The need for accessible ML • So much potential in ML • Everyone trying to make sense of their data • ML is transforming lives and industries: personalized medicine, internet search, social networks, advertising, etc. • But success is unattainable to most
  4. 4. Building a predictive app Was using 217 business rules hoping world doesn’t change Have an inspiring idea to reinvent their business Key pains: Hiring Talent Shortfall in data-savvy workers needed to make sense out of big data by 2018 [McKinsey 2011] 35% Noisy Space of Tools Data scientists use a variety of tools, across different programming languages… require a lot of context-switching… affects productivity and impedes reproducibility. Ben Lorica, Data Analysis: Just one component of the Data Science workflow
  5. 5. Building a predictive app Feature engineering Model definition Training evaluation Data DeploymentMonitoring
  6. 6. Pure ML is not enough • Building a predictive application involves much more than just building ML models • System engineering: data storage, computation infrastructure, networking… • Data Science: problem definition, data cleaning, feature engineering • Software development: turn prototype model into bullet-proof production code • Operations engineering: deploy and monitor app • …
  7. 7. Pain points • What are the right features? • What model should I use? • How do I train it? • How do I set the tuning parameters? • Do I even have the right data? • Ok, I have a working prototype, now what?
  8. 8. Pain points • Increase in data size or decrease in latency requires complete rewrite of code and new toolset • GB – R/scikit-learn/Matlab • TB-PB—Hadoop/Mahout/Spark • Many forms of data and data structures • Images, text, speech, logs • Dense lists, sparse dictionaries, time series • Tables, graphs, matrices, tensors
  9. 9. The need for an ML platform • Minimize tool/code switching, maximize performance (speed/accuracy/scale) • Graceful transition from small to large dataset sizes • Flexible, interoperable data types • Minimize complexity • System-agnostic • Simple API • Auto-tune parameters
  10. 10. The parallel to databases • What’s an example of a mega-successful platform for data operations? • Databases! • SQL, Oracle, NoSQL, … • What lessons can we bring in from the database world?
  11. 11. Database engine components Storage engine Query execution Query optimizer Storage
  12. 12. Database engine components Storage engine Query execution Query optimizer Storage Complex but self-contained, has clean API, only changes when there’s new hardware.
  13. 13. Database engine components Storage engine Query execution Query optimizer Storage Complex bag of tricks, no formalism, constantly changing to adapt to data, query, disk characteristics.
  14. 14. ML engine components Feature engineering Model definition Training evaluation Data Bags of tricks, expert knowledge, experience, lots of trial and error
  15. 15. Advances in databases • Reasonable abstraction—relational DB • Hardware speedups • Pragmatic software implementation Successful platform • Take-away lesson: fast computation engine + “good enough” execution plan
  16. 16. To advance ML platforms • ML will be end-user friendly when the platform is clever enough to handle less- than-optimal directions from the user • What needs to happen? • The complexity needs to be automated and wrapped away with neat interfaces between components • Fast components, “good enough” directions
  17. 17. GraphLab • Started as a research project at CMU in 2009 • Now a Seattle-based startup
  18. 18. The GraphLab CreateTM Solution • Flexible, interoperable data types • SArray+SFrame+SGraph inter-translatable • dense list, sparse array, image, text, tables, graphs • Graceful transition between data sizes • SFrame: memory to disk to distributed • One environment, many substrates • Python front-end • Localhost, cluster, Hadoop, EC2 • End-to-end • Data ingestion+feature engineering+model building+ deployment in a single environment
  19. 19. GraphLab Create ML Toolkits Machine Learning Task Business Task Algorithms & SDK Recommender, Target, Social Match, … Regression, Classification, Data Matching,… SVM, Matrix Factorization, LDA, … Developers Savvy Dev & Data Sci. ML experts
  20. 20. Demos
  21. 21. GLC SDK example • Task: fill in missing value in an array using previous value • Existing solution: • E.g., use Pandas—Python library providing in- memory dataframes • Problem: • Given, say, 25M rows and 50 cols, takes forever to even load the data
  22. 22. GLC SDK solution > cat fill.cpp #include <flexible_type/flexible_type.hpp> #include <unity/lib/toolkit_function_macros.hpp> #include <unity/lib/gl_sarray.hpp> using namespace graphlab; gl_sarray fill(gl_sarray sa) { gl_sarray_writer writer(sa.dtype(), 1); flexible_type last_value = sa[0]; for (const auto &elem: sa.range_iterator()) { if (elem != FLEX_UNDEFINED) last_value = elem; writer.write(last_value, 0); } return writer.close(); } BEGIN_FUNCTION_REGISTRATION REGISTER_FUNCTION(fill, "sa"); END_FUNCTION_REGISTRATION
  23. 23. GLC SDK solution > cat Makefile all: fill.cpp g++ -std=c++11 $^ -l graphlab –l ~/graphlab-dev/deps/shared-fPIC –o $@ -O3 > python >>> import graphlab as gl >>> gl.ext_import(‘’, ‘example’) >>> sa = gl.Sarray([1, 2, 3, None, 6]) >>> print gl.extensions.example.fill.fill(sa) [1, 2, 3, 3, 6]
  24. 24. Join the revolution! • Research methods to make the following efficient and automatic: • Feature engineering • Model selection • Model debugging • Problem formulation (??) • Develop novel algorithms on top of our SDK • Backed by scalable, flexible typed data structures • Automatic Python wrappers • Make them available to many other peple • We’re hiring!