Your SlideShare is downloading. ×
Cloudera User Group - From the Lab to the Factory
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Cloudera User Group - From the Lab to the Factory

411
views

Published on

This is the presentation that Cloudera's senior director of data science, Josh Wills, delivered at the Cloudera User Group (CUG) Chicago meeting on 12/3/13 and NYC meeting on 12/5/13.

This is the presentation that Cloudera's senior director of data science, Josh Wills, delivered at the Cloudera User Group (CUG) Chicago meeting on 12/3/13 and NYC meeting on 12/5/13.

Published in: Technology, Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
411
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera 1
  • 2. One Other Thing About Me 2
  • 3. Data Science: Another Definition 3
  • 4. Data Scientists Build Data Products. 4
  • 5. A Shift In Perspective Analytics in the Factory Analytics in the Lab • • • • • • 5 Question-driven Interactive Ad-hoc, post-hoc Fixed data Focus on speed and flexibility Output is embedded into a report or in-database scoring engine • • • • • • Metric-driven Automated Systematic Fluid data Focus on transparency and reliability Output is a production system that makes customer-facing decisions
  • 6. All* Products Become Data Products 6
  • 7. Identifying the Bottlenecks 7
  • 8. Oryx: Model Building and Serving • Algorithms • • • ALS Recommenders K-Means Parallel RDF Batch model building via MapReduce* • Server for real-time scoring and updates • PMML 4.1 Models • 8
  • 9. Oryx Design 9
  • 10. Generational Thinking 10
  • 11. The Limits of Our Models 11
  • 12. Space Exploration 12
  • 13. Data Science Needs DevOps 13
  • 14. Introducing Gertrude • Multivariate Testing • • Overlapping Experiments • • 14 Define and explore a space of parameters Tang et al. (2010) Runs multiple independent experiments on every request
  • 15. Simple Conditional Logic • Declare experiment flags in compiled code • • 15 Settings that can vary per request Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
  • 16. Separate Data Push from Code Push • Validate config files and push updates to servers • • • 16 Zookeeper via Curator File-based Servers pick up new configs, load them, and update experiment space and flag value calculations
  • 17. The Experiments Dashboard 17
  • 18. A Few Links I Love • http://research.google.com/pubs/pub36500.html • • http://www.exp-platform.com/ • • Collection of all of Microsoft’s papers and presentations on their experimentation platform http://www.deaneckles.com/blog/596_lossy-betterthan-lossless-in-online-bootstrapping/ • 18 The original paper on the overlapping experiments infrastrucure at Google Dean Eckles on his paper about bootstrapped confidence intervals with multiple dependencies
  • 19. Thank you! Josh Wills, Director of Data Science, Cloudera @josh_wills