Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Production machine learning_infrastructure

Slides from Josh Wills' talk on building machine learning infrastructure at Data Day Texas 2014.

  • Be the first to comment

Production machine learning_infrastructure

  1. 1. From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera 1
  2. 2. What is a Data Scientist? 2
  3. 3. One Definition… 3
  4. 4. …versus Another 4
  5. 5. The Two Kinds of Data Scientists • The Lab • • • The Factory • 5 Statisticians who got really good at programming Neuroscientists, geneticis ts, etc. Software engineers who were in the wrong place at the wrong time
  6. 6. Data Science In The Lab 6
  7. 7. Data Science as Statistics 7
  8. 8. Investigative Analytics 8
  9. 9. Tools for Investigative Analytics 9
  10. 10. Inputs and Outputs 10
  11. 11. On Actionable Insights 11
  12. 12. Data Science In The Factory 12
  13. 13. Building Data Products 13
  14. 14. A Shift In Perspective Analytics in the Lab Question-driven • Interactive • Ad-hoc, post-hoc • Fixed data • Focus on speed and flexibility • Output is embedded into a report or in-database scoring engine • 14 Analytics in the Factory • • • • • • Metric-driven Automated Systematic Fluid data Focus on transparency and reliability Output is a production system that makes customer-facing decisions
  15. 15. Data Science as Decision Engineering 15
  16. 16. All* Products Become Data Products 16
  17. 17. Sounds Great. So Who Is Doing This? 17
  18. 18. From The Lab To The Factory 18
  19. 19. The Art of Machine Learning 19
  20. 20. A New Kind of Statistics 20
  21. 21. DevOps for Data Science 21
  22. 22. The Model: Information Retrieval 22
  23. 23. From the Lab to the Factory: First Steps 23
  24. 24. Step 1: Choose a Good Problem 24
  25. 25. Step 2: DTSTCPWTM 25
  26. 26. Step 3: Log Everything 26
  27. 27. Step 4: Hire (More) Data Scientists 27
  28. 28. Things We’re Working On 28
  29. 29. The Data Science Workflow 29
  30. 30. Identifying the Bottlenecks 30
  31. 31. Myrrix 31
  32. 32. Oryx: Simple and Scalable ML 32
  33. 33. Generational Thinking 33
  34. 34. Working on the Gaps 34
  35. 35. Space Exploration 35
  36. 36. The Limits of Our Models 36
  37. 37. Gertrude: Experimenting with ML • Multivariate Testing • • Overlapping Experiments • • 37 Define and explore a space of parameters Tang et al. (2010) Runs multiple independent experiments on every request
  38. 38. Simple Conditional Logic • Declare experiment flags in compiled code • • 38 Settings that can vary per request Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
  39. 39. Separate Data Push from Code Push • Validate config files and push updates to servers • • • 39 Zookeeper via Curator File-based Servers pick up new configs, load them, and update experiment space and flag value calculations
  40. 40. Computational Hypothesis Testing 40
  41. 41. The Experiments Dashboard 41
  42. 42. A Few Links I Love • http://research.google.com/pubs/pub36500.html • • http://www.exp-platform.com/ • • Collection of all of Microsoft’s papers and presentations on their experimentation platform http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/ • 42 The original paper on the overlapping experiments infrastrucure at Google Dean Eckles on his paper about bootstrapped confidence intervals with multiple dependencies
  43. 43. One More Thing 43
  44. 44. A Day In The Life of a Data Scientist 44
  45. 45. On Functional Programming 45
  46. 46. On Lineage 46
  47. 47. Thank you! Josh Wills, Director of Data Science, Cloudera @josh_wills

×