Production machine learning_infrastructure


Published on

Slides from Josh Wills' talk on building machine learning infrastructure at Data Day Texas 2014.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • A popular definition. Also, an example of how correlation != causation.
  • A vastly superior definition. ;-)See also:
  • How I hate this definition.
  • Question-drivenInteractiveAd-hoc, post-hocFixed data
  • Tools focus on speed and flexibility.
  • The source of data is the data warehouse– the ultimate source of truth in the enterprise. The output are reports, charts, maybe a dashboard or two.
  • The output that most people seem to want are insights– specifically, “actionable insights.”An actionable insight is one that allows us to make a clear decision, a useful correlation between a short-term behavior and a long-term outcome. They are pretty rare. You can basically build an entire business on a handful of actionable insights.
  • Data scientists love Venn diagrams. Harlan Harris recently created this one to explain data products, and he commented on his definition in this blog post: products combine software, domain expertise, and statistical modeling in order to solve a problem. We can compare data products to the combination of any two of these three aspects:One-off analyses done by an analyst or a statistician to help inform a decision are good, but creating repeatable and scalable processes into software is better.BI and stats tools are general purpose– they aren’t optimized for solving a specific problem in your business.Rules engines allow you to create maintainable software in the face of frequent policy changes, but they can be made smarter and more robust by bringing modeling and analysis to bear on the decisions they encode.
  • Curt Monashmakes a distinction between investigative analytics (which he defines here: ) and operational analytics that I like, and I expanded it into my own set of differences that I want to walk through here.Investigative analytics is what we think of when we think of traditional BI: there’s an analyst or an executive that is searching for previously unknown patterns in a data set, either by looking at a series of visualizations mediated by database queries, or by applying some statistical models to a prepared data set to tease out some deeper explanations. This is where the vast majority of the BI market is focused right now.Operational analytics, on the other hand, is a nascent market, and I don’t believe the existing BI tools have done a good job of supporting companies that want to start leveraging their modeling and analytical prowess in order to make better decisions in real-time. I’d like to shift some of the conversation and the focus in the market from the lab to the factory.
  • Every customer interaction results in hundreds of decisions– both by us and by the customer.As interactions with customers move primarily to the digital realm, we have the opportunity to use data and modeling to optimize the very large number of small transactions we engage in with our customers.The number of decisions embedded in this page that would be amenable to statistical modeling and designed experiments is simply enormous: not just the price, but the wording, the images, the use of a timer, the selection of which upsell opportunity is right for the current customer, etc., etc.
  • * Slightly longer: All products of any consequence will become data products.
  • Basically nobody. Most models that gets deployed to production happen in one of two ways:In-database scoring, like for a marketing campaign. This isn’t really “production”– there’s not usually an SLA here or an ops person involved beyond the DBA.By taking an existing model definition in SAS or R and converting it (often by hand) into C or Java code for use in a production server. This becomes THE MODEL, which is THE MODEL for the next six months to a year. Because this process is tedious and awful, we don’t do it very often, and it’s not a very glamorous software engineering assignment.Of course, there are a handful of companies that have been building and deploying models continuously for a while now, but that’s usually because their business depends on it (Google, FB, Twitter, LinkedIn, Amazon, etc.)
  • Machine learning is not an engineering discipline. Not even close. There are aspects of it that are familiar to software engineers, like pipeline building, but lots of things are lacking.
  • I suspect that we teach advanced statistics in a way that tends to scare off computer scientists by relying too heavily on parametric models that involve lots of integrals and multivariate calculus, instead of focusing on the non-parametric models that are primarily computational. I would like to create a course that taught advanced statistics (including bootstrapping) without requiring any calculus.
  • Data science needsdevops. If we can’t deploy new code quickly, deploying new models and running experiments quickly isn’t going to happen.
  • Search is, for me, very much a data product. Daniel Tunkelang, one of the best data scientists in the world, is the head of search quality at LinkedIn.Ranking results is an information retrieval problem.Information retrieval is the model of what I would like to see happen with machine learning: IR made the leap from academic research area to a true engineering discipline that can be tackled by any reasonably clever engineer with Lucene/Solr/ElasticSearch.
  • A good problem is one that allows you to get fast feedback and take advantage of that feedback to improve your solution.
  • Do The Simplest Thing That Could Possibly Work. Don’t start with the super-advanced machine learning model until you know that the problem you’re solving is important enough to justify the work involved.A good rule of thumb: choose something that seems laughably simple. You’ll often be surprised at how effective it is, and it will be great material for me to use at other presentations.
  • Log files are the bread-and-butter of data science. They are the river of Nile, they give life to data science teams. Three reasons:Raw and unfiltered: reflect the reality of an event (usually an action that was taken by a user or a process) as it happened at the time, not mediated by anything else.Real-time: Apache Flume can pick log files up and transport them to our Hadoop cluster in a matter of minutes: I don’t need to wait a day for an ETL process to copy operational data into the EDW system before I can start answering questions.One of the most important places to log things are where decisions get made– either user decisions that we wish to understand better, or the decision points in our own internal workflows and processes that drive meaningful outcomes. In many businesses, these decision points involve business rules– either directly embedded in a business rules engine, or in code that is acting much like a business rules engine.The logs will be the primary input to our machine learning models, because they reflect what information was available to the system at the time a decision was made. This is one of the more obvious aspects of doing production machine learning, but it also seems to trip up most people at the get-go: a model that is trained on data that isn’t available to the system at the time a decision is made is at best a useful curiosity and at worse is actively harmful.
  • If you have meaningful problems to work on and an environment that lets your people iterate on them quickly and try new ideas, you won’t need to try to hire data scientists. They’ll be beating down your door.
  • Most tools are focused on collapsing the interface between feature extraction and model fitting. We’d like to focus on collapsing the interface between model building and model serving.
  • Feature creation and model fitting. Lots of folks are focused on this space, because it’s so visible; it’s what data scientists spend most of their time doing, so finding ways to help them do it faster is an obviously good thing to do.But I think that there are other bottlenecks that are less obvious, because they are so narrow we don’t even bother to enter them in the first place, and I think that one of those bottlenecks is between building a model and putting it into production. And there are lots of reasons for this– primarily b/c it’s hard. Companies like Google/FB/LI/etc.
  • What attracted me to Myrrix wasn’t just the algorithms--- because algorithms are commodities– but that they were thinking about these problems in the right way.
  • Oryx builds models and serves models– that’s it. No visualization, no data munging, none of that stuff– there are plenty of great tools to choose from to help data scientists solve those problems.
  • The idea that feedback will be coming to the system in real-time is built into the computation and serving layers.
  • There are inevitably rules, and tuning parameters, and additional logic that needs to get deployed around any model that rolls into production. And just like we can’t be completely sure of how all of those parameters and settings will interact with each other, and with our customers, we end up running lots of experiments to understand how changes impact user behavior– especially in cases where we can’t necessarily re-create the conditions that would make backtesting of the changes possible (examples of this.)
  • There is an inevitable gap between the lab environment and the factory, even after we ensure that everyone is operating on the same data sources by logging everything. The gap is that what the model fits is not the same thing as what the business is trying to optimize. (A couple of examples of this.)
  • Gertrude Cox studied math and statistics at Iowa State University, earning the first master’s degree in statistics ever granted by the university. When they asked her why she decided to study math, she said, “Because it was easy.” #badass
  • Really simple if-then logic. Easy enough for a data scientist (or even a product manager) to understand.
  • This is the part of the talk where the ops people freak out a little bit.
  • Another technique every data scientist should know:
  • Automate metric collection and confidence interval calculation. Make it stupid easy to not just run experiments, but evaluate their performance.
  • Most of what data scientist do (whetherthey’e in the lab or the factory) involves cleaning and transforming datasets. But for as much as we talk about this, we know relatively little about the process of what data scientists do and what techniques are most effective on different data sets. And this seems unfortunate to me.
  • I’ve been spending a lot of time with the Twitter guys, and it’s starting to get to me.Seriously, monads are pretty useful. In particular, the Writer Monad:
  • Playing around with lineage tracking for data transformations in R: building logging into our data analysis tools, we can start to analyze the process of analysis. It’s a little meta, I know.
  • Production machine learning_infrastructure

    1. 1. From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera 1
    2. 2. What is a Data Scientist? 2
    3. 3. One Definition… 3
    4. 4. …versus Another 4
    5. 5. The Two Kinds of Data Scientists • The Lab • • • The Factory • 5 Statisticians who got really good at programming Neuroscientists, geneticis ts, etc. Software engineers who were in the wrong place at the wrong time
    6. 6. Data Science In The Lab 6
    7. 7. Data Science as Statistics 7
    8. 8. Investigative Analytics 8
    9. 9. Tools for Investigative Analytics 9
    10. 10. Inputs and Outputs 10
    11. 11. On Actionable Insights 11
    12. 12. Data Science In The Factory 12
    13. 13. Building Data Products 13
    14. 14. A Shift In Perspective Analytics in the Lab Question-driven • Interactive • Ad-hoc, post-hoc • Fixed data • Focus on speed and flexibility • Output is embedded into a report or in-database scoring engine • 14 Analytics in the Factory • • • • • • Metric-driven Automated Systematic Fluid data Focus on transparency and reliability Output is a production system that makes customer-facing decisions
    15. 15. Data Science as Decision Engineering 15
    16. 16. All* Products Become Data Products 16
    17. 17. Sounds Great. So Who Is Doing This? 17
    18. 18. From The Lab To The Factory 18
    19. 19. The Art of Machine Learning 19
    20. 20. A New Kind of Statistics 20
    21. 21. DevOps for Data Science 21
    22. 22. The Model: Information Retrieval 22
    23. 23. From the Lab to the Factory: First Steps 23
    24. 24. Step 1: Choose a Good Problem 24
    25. 25. Step 2: DTSTCPWTM 25
    26. 26. Step 3: Log Everything 26
    27. 27. Step 4: Hire (More) Data Scientists 27
    28. 28. Things We’re Working On 28
    29. 29. The Data Science Workflow 29
    30. 30. Identifying the Bottlenecks 30
    31. 31. Myrrix 31
    32. 32. Oryx: Simple and Scalable ML 32
    33. 33. Generational Thinking 33
    34. 34. Working on the Gaps 34
    35. 35. Space Exploration 35
    36. 36. The Limits of Our Models 36
    37. 37. Gertrude: Experimenting with ML • Multivariate Testing • • Overlapping Experiments • • 37 Define and explore a space of parameters Tang et al. (2010) Runs multiple independent experiments on every request
    38. 38. Simple Conditional Logic • Declare experiment flags in compiled code • • 38 Settings that can vary per request Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
    39. 39. Separate Data Push from Code Push • Validate config files and push updates to servers • • • 39 Zookeeper via Curator File-based Servers pick up new configs, load them, and update experiment space and flag value calculations
    40. 40. Computational Hypothesis Testing 40
    41. 41. The Experiments Dashboard 41
    42. 42. A Few Links I Love • • • • • Collection of all of Microsoft’s papers and presentations on their experimentation platform • 42 The original paper on the overlapping experiments infrastrucure at Google Dean Eckles on his paper about bootstrapped confidence intervals with multiple dependencies
    43. 43. One More Thing 43
    44. 44. A Day In The Life of a Data Scientist 44
    45. 45. On Functional Programming 45
    46. 46. On Lineage 46
    47. 47. Thank you! Josh Wills, Director of Data Science, Cloudera @josh_wills