Cloudera User Group - From the Lab to the Factory

758 views

Published on

This is the presentation that Cloudera's senior director of data science, Josh Wills, delivered at the Cloudera User Group (CUG) Chicago meeting on 12/3/13 and NYC meeting on 12/5/13.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
758
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cloudera User Group - From the Lab to the Factory

  1. 1. From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera 1
  2. 2. One Other Thing About Me 2
  3. 3. Data Science: Another Definition 3
  4. 4. Data Scientists Build Data Products. 4
  5. 5. A Shift In Perspective Analytics in the Factory Analytics in the Lab • • • • • • 5 Question-driven Interactive Ad-hoc, post-hoc Fixed data Focus on speed and flexibility Output is embedded into a report or in-database scoring engine • • • • • • Metric-driven Automated Systematic Fluid data Focus on transparency and reliability Output is a production system that makes customer-facing decisions
  6. 6. All* Products Become Data Products 6
  7. 7. Identifying the Bottlenecks 7
  8. 8. Oryx: Model Building and Serving • Algorithms • • • ALS Recommenders K-Means Parallel RDF Batch model building via MapReduce* • Server for real-time scoring and updates • PMML 4.1 Models • 8
  9. 9. Oryx Design 9
  10. 10. Generational Thinking 10
  11. 11. The Limits of Our Models 11
  12. 12. Space Exploration 12
  13. 13. Data Science Needs DevOps 13
  14. 14. Introducing Gertrude • Multivariate Testing • • Overlapping Experiments • • 14 Define and explore a space of parameters Tang et al. (2010) Runs multiple independent experiments on every request
  15. 15. Simple Conditional Logic • Declare experiment flags in compiled code • • 15 Settings that can vary per request Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
  16. 16. Separate Data Push from Code Push • Validate config files and push updates to servers • • • 16 Zookeeper via Curator File-based Servers pick up new configs, load them, and update experiment space and flag value calculations
  17. 17. The Experiments Dashboard 17
  18. 18. A Few Links I Love • http://research.google.com/pubs/pub36500.html • • http://www.exp-platform.com/ • • Collection of all of Microsoft’s papers and presentations on their experimentation platform http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/ • 18 The original paper on the overlapping experiments infrastrucure at Google Dean Eckles on his paper about bootstrapped confidence intervals with multiple dependencies
  19. 19. Thank you! Josh Wills, Director of Data Science, Cloudera @josh_wills

×