A popular definition. Also, an example of how correlation != causation.
A vastly superior definition. ;-)See also: http://www.quora.com/Data-Science/What-is-the-difference-between-a-data-scientist-and-a-statistician/answer/Josh-Wills
How I hate this definition.
Question-drivenInteractiveAd-hoc, post-hocFixed data
Tools focus on speed and flexibility.
The source of data is the data warehouse– the ultimate source of truth in the enterprise. The output are reports, charts, maybe a dashboard or two.
The output that most people seem to want are insights– specifically, “actionable insights.”An actionable insight is one that allows us to make a clear decision, a useful correlation between a short-term behavior and a long-term outcome. They are pretty rare. You can basically build an entire business on a handful of actionable insights.
Data scientists love Venn diagrams. Harlan Harris recently created this one to explain data products, and he commented on his definition in this blog post:http://datacommunitydc.org/blog/2013/09/the-data-products-venn-diagram/Data products combine software, domain expertise, and statistical modeling in order to solve a problem. We can compare data products to the combination of any two of these three aspects:One-off analyses done by an analyst or a statistician to help inform a decision are good, but creating repeatable and scalable processes into software is better.BI and stats tools are general purpose– they aren’t optimized for solving a specific problem in your business.Rules engines allow you to create maintainable software in the face of frequent policy changes, but they can be made smarter and more robust by bringing modeling and analysis to bear on the decisions they encode.
Curt Monashmakes a distinction between investigative analytics (which he defines here: http://www.dbms2.com/2011/03/03/investigative-analytics/ ) and operational analytics that I like, and I expanded it into my own set of differences that I want to walk through here.Investigative analytics is what we think of when we think of traditional BI: there’s an analyst or an executive that is searching for previously unknown patterns in a data set, either by looking at a series of visualizations mediated by database queries, or by applying some statistical models to a prepared data set to tease out some deeper explanations. This is where the vast majority of the BI market is focused right now.Operational analytics, on the other hand, is a nascent market, and I don’t believe the existing BI tools have done a good job of supporting companies that want to start leveraging their modeling and analytical prowess in order to make better decisions in real-time. I’d like to shift some of the conversation and the focus in the market from the lab to the factory.
Every customer interaction results in hundreds of decisions– both by us and by the customer.As interactions with customers move primarily to the digital realm, we have the opportunity to use data and modeling to optimize the very large number of small transactions we engage in with our customers.The number of decisions embedded in this page that would be amenable to statistical modeling and designed experiments is simply enormous: not just the price, but the wording, the images, the use of a timer, the selection of which upsell opportunity is right for the current customer, etc., etc.
* Slightly longer: All products of any consequence will become data products.
Basically nobody. Most models that gets deployed to production happen in one of two ways:In-database scoring, like for a marketing campaign. This isn’t really “production”– there’s not usually an SLA here or an ops person involved beyond the DBA.By taking an existing model definition in SAS or R and converting it (often by hand) into C or Java code for use in a production server. This becomes THE MODEL, which is THE MODEL for the next six months to a year. Because this process is tedious and awful, we don’t do it very often, and it’s not a very glamorous software engineering assignment.Of course, there are a handful of companies that have been building and deploying models continuously for a while now, but that’s usually because their business depends on it (Google, FB, Twitter, LinkedIn, Amazon, etc.)
Machine learning is not an engineering discipline. Not even close. There are aspects of it that are familiar to software engineers, like pipeline building, but lots of things are lacking.
I suspect that we teach advanced statistics in a way that tends to scare off computer scientists by relying too heavily on parametric models that involve lots of integrals and multivariate calculus, instead of focusing on the non-parametric models that are primarily computational. I would like to create a course that taught advanced statistics (including bootstrapping) without requiring any calculus.
Data science needsdevops. If we can’t deploy new code quickly, deploying new models and running experiments quickly isn’t going to happen.
Search is, for me, very much a data product. Daniel Tunkelang, one of the best data scientists in the world, is the head of search quality at LinkedIn.Ranking results is an information retrieval problem.Information retrieval is the model of what I would like to see happen with machine learning: IR made the leap from academic research area to a true engineering discipline that can be tackled by any reasonably clever engineer with Lucene/Solr/ElasticSearch.
A good problem is one that allows you to get fast feedback and take advantage of that feedback to improve your solution.http://uxmag.com/articles/you-are-solving-the-wrong-problem
Do The Simplest Thing That Could Possibly Work. Don’t start with the super-advanced machine learning model until you know that the problem you’re solving is important enough to justify the work involved.A good rule of thumb: choose something that seems laughably simple. You’ll often be surprised at how effective it is, and it will be great material for me to use at other presentations.
Log files are the bread-and-butter of data science. They are the river of Nile, they give life to data science teams. Three reasons:Raw and unfiltered: reflect the reality of an event (usually an action that was taken by a user or a process) as it happened at the time, not mediated by anything else.Real-time: Apache Flume can pick log files up and transport them to our Hadoop cluster in a matter of minutes: I don’t need to wait a day for an ETL process to copy operational data into the EDW system before I can start answering questions.One of the most important places to log things are where decisions get made– either user decisions that we wish to understand better, or the decision points in our own internal workflows and processes that drive meaningful outcomes. In many businesses, these decision points involve business rules– either directly embedded in a business rules engine, or in code that is acting much like a business rules engine.The logs will be the primary input to our machine learning models, because they reflect what information was available to the system at the time a decision was made. This is one of the more obvious aspects of doing production machine learning, but it also seems to trip up most people at the get-go: a model that is trained on data that isn’t available to the system at the time a decision is made is at best a useful curiosity and at worse is actively harmful.
If you have meaningful problems to work on and an environment that lets your people iterate on them quickly and try new ideas, you won’t need to try to hire data scientists. They’ll be beating down your door.
Most tools are focused on collapsing the interface between feature extraction and model fitting. We’d like to focus on collapsing the interface between model building and model serving.
Feature creation and model fitting. Lots of folks are focused on this space, because it’s so visible; it’s what data scientists spend most of their time doing, so finding ways to help them do it faster is an obviously good thing to do.But I think that there are other bottlenecks that are less obvious, because they are so narrow we don’t even bother to enter them in the first place, and I think that one of those bottlenecks is between building a model and putting it into production. And there are lots of reasons for this– primarily b/c it’s hard. Companies like Google/FB/LI/etc.
What attracted me to Myrrix wasn’t just the algorithms--- because algorithms are commodities– but that they were thinking about these problems in the right way.
Oryx builds models and serves models– that’s it. No visualization, no data munging, none of that stuff– there are plenty of great tools to choose from to help data scientists solve those problems.http://github.com/cloudera/oryx
The idea that feedback will be coming to the system in real-time is built into the computation and serving layers.
There are inevitably rules, and tuning parameters, and additional logic that needs to get deployed around any model that rolls into production. And just like we can’t be completely sure of how all of those parameters and settings will interact with each other, and with our customers, we end up running lots of experiments to understand how changes impact user behavior– especially in cases where we can’t necessarily re-create the conditions that would make backtesting of the changes possible (examples of this.)
There is an inevitable gap between the lab environment and the factory, even after we ensure that everyone is operating on the same data sources by logging everything. The gap is that what the model fits is not the same thing as what the business is trying to optimize. (A couple of examples of this.)
Gertrude Cox studied math and statistics at Iowa State University, earning the first master’s degree in statistics ever granted by the university. When they asked her why she decided to study math, she said, “Because it was easy.” #badass
Really simple if-then logic. Easy enough for a data scientist (or even a product manager) to understand.
This is the part of the talk where the ops people freak out a little bit.
Another technique every data scientist should know: http://en.wikipedia.org/wiki/Bootstrapping_(statistics)
Automate metric collection and confidence interval calculation. Make it stupid easy to not just run experiments, but evaluate their performance.
Most of what data scientist do (whetherthey’e in the lab or the factory) involves cleaning and transforming datasets. But for as much as we talk about this, we know relatively little about the process of what data scientists do and what techniques are most effective on different data sets. And this seems unfortunate to me.
I’ve been spending a lot of time with the Twitter guys, and it’s starting to get to me.Seriously, monads are pretty useful. In particular, the Writer Monad: http://learnyouahaskell.com/for-a-few-monads-more
Playing around with lineage tracking for data transformations in R: https://github.com/jwills/lineageBy building logging into our data analysis tools, we can start to analyze the process of analysis. It’s a little meta, I know.
From The Lab to the Factory
Building A Production Machine Learning Infrastructure
Josh Wills, Senior Director of Data Science
The Two Kinds of Data Scientists
Statisticians who got
really good at
Software engineers who
were in the wrong place
at the wrong time
A Shift In Perspective
Analytics in the Lab
• Ad-hoc, post-hoc
• Fixed data
• Focus on speed and
• Output is embedded into a
report or in-database
Analytics in the Factory
Focus on transparency and
Output is a production
system that makes
Gertrude: Experimenting with ML
Define and explore a
space of parameters
Tang et al. (2010)
experiments on every
Simple Conditional Logic
flags in compiled code
Settings that can vary
Create a config file that
contains simple rules
for calculating flag
values and rules for
Separate Data Push from Code Push
Validate config files and
push updates to servers
Zookeeper via Curator
Servers pick up new
configs, load them, and
space and flag value
A Few Links I Love
Collection of all of Microsoft’s papers and presentations on
their experimentation platform
The original paper on the overlapping experiments
infrastrucure at Google
Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies