From The Lab To The Factory: Building A Production Machine Learning Infrastructure

2,427 views
1,986 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1kMUPAe.

Josh Wills discusses using Hadoop technologies to build real-time data analysis models with a focus on strategies for data integration, large-scale machine learning, and experimentation. Filmed at qconsf.com.

Josh Wills is the director of data science at Cloudera. Wills is one of the main contributors to Cloudera’s most recent open source project, Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Prior to joining Cloudera, Wills was a software engineer at Google. Josh holds a M.S.E. in operations research and a BS in mathematics.

Published in: Technology, Education
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,427
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
0
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

From The Lab To The Factory: Building A Production Machine Learning Infrastructure

  1. 1. From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera 1
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /machine-learning-infrastructure InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  3. 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. About Me 2
  5. 5. What Do Data Scientists Do? 3
  6. 6. What I Think I Do 4
  7. 7. What Other People Think I Do 5
  8. 8. What I Actually Do 6
  9. 9. Data Science In the Lab 7
  10. 10. Data Science as Statistics 8
  11. 11. Investigative Analytics 9
  12. 12. Tools for Investigative Analytics 10
  13. 13. Inputs and Outputs 11
  14. 14. On Actionable Insights 12
  15. 15. Data Science in the Factory 13
  16. 16. Building Data Products 14
  17. 17. A Shift In Perspective Analytics in the Lab • • • • • • 15 Question-driven Interactive Ad-hoc, post-hoc Fixed data Focus on speed and flexibility Output is embedded into a report or in-database scoring engine Analytics in the Factory • • • • • • Metric-driven Automated Systematic Fluid data Focus on transparency and reliability Output is a production system that makes customer-facing decisions
  18. 18. Data Science as Decision Engineering 16
  19. 19. All* Products Become Data Products 17
  20. 20. From the Lab to the Factory: First Steps 18
  21. 21. Step 1: Choose a Good Problem 19
  22. 22. Step 2: DTSTCPWTM 20
  23. 23. Step 3: Log Everything 21
  24. 24. Step 4: Hire (More) Data Scientists 22
  25. 25. Workflow Optimization 23
  26. 26. The Data Science Workflow 24
  27. 27. Identifying the Bottlenecks 25
  28. 28. Myrrix 26
  29. 29. Introducing Oryx 27
  30. 30. Generational Thinking 28
  31. 31. Oryx ALS Recommender Demo 29
  32. 32. Rolling to Production 30
  33. 33. The Limits of Our Models 31
  34. 34. Space Exploration 32
  35. 35. Data Science Needs DevOps 33
  36. 36. Introducing Gertrude • Multivariate Testing • • Overlapping Experiments • • 34 Define and explore a space of parameters Tang et al. (2010) Runs multiple independent experiments on every request
  37. 37. Simple Conditional Logic • Declare experiment flags in compiled code • • 35 Settings that can vary per request Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
  38. 38. Separate Data Push from Code Push • Validate config files and push updates to servers • • • 36 Zookeeper via Curator File-based Servers pick up new configs, load them, and update experiment space and flag value calculations
  39. 39. The Experiments Dashboard 37
  40. 40. A Few Links I Love • http://research.google.com/pubs/pub36500.html • • http://www.exp-platform.com/ • • Collection of all of Microsoft’s papers and presentations on their experimentation platform http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/ • 38 The original paper on the overlapping experiments infrastrucure at Google Dean Eckles on his paper about bootstrapped confidence intervals with multiple dependencies
  41. 41. Thank you! Josh Wills, Director of Data Science, Cloudera @josh_wills
  42. 42. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/machinelearning-infrastructure

×