Many companies have invested time and money into building sophisticated data pipelines that can move massive amounts of data, often in real time. However, for the analyst or data scientist who builds offline models, integrating their analyses into these pipelines for operational purposes can pose a challenge.
In this slide deck, we will discuss some key technologies and workflows companies can leverage to build end-to-end solutions for automating statistical and machine learning solutions: from collection and storage to analysis and real-time predictions.
2. Operationalizing Analytics To Scale
Many companies have invested time and money into building sophisticated data pipelines
that can move massive amounts of data in (near) real time. However, for the analyst or data
scientist who builds models offline, integrating their analyses into these pipelines for
operational purposes can pose a challenge.
In this workshop, we will discuss some key technologies and workflows companies can
leverage to build end-to-end solutions for automating analytical, statistical and machine
learning solutions: from collection and storage to analysis and real-time predictions.
Abstract
7. ● Introduction
● What Are we Talking About Exactly?
● The Problem at Hand
● Operationalizing Analytics
Agenda
8. ● Introduction
● What Are we Talking About Exactly?
● The Problem at Hand
● Operationalizing Analytics
● Operationalizing Predictive Analytics
Agenda
9. ● Introduction
● What Are we Talking About Exactly?
● The Problem at Hand
● Operationalizing Analytics
● Operationalizing Predictive Analytics
● Questions
Agenda
11. Introduction
● I work on the Internal Data team at Looker.
● Before Looker, I worked in consulting and research.
12. Introduction
● I work on the Internal Data team at Looker.
● Before Looker, I worked in consulting and research.
● Looker is a business intelligence tool.
13. What are we talking about?
● What do I mean when I say “operationalizing”?
14. What are we talking about?
● What do I mean when I say “operationalizing”?
● Why is this important?
15. The Problem at Hand
● Analysts are providing basic reports for the entire
business.
16. ● Analysts are providing basic reports for the entire
business.
● Analysts and Data Scientists are building offline models.
The Problem at Hand
17. The Problem With Offline Models
● Offline analyses aren’t associated with particularly quick
turnaround times.
18. The Problem With Offline Models
● Offline analyses aren’t associated with particularly quick
turnaround times.
● Offline analyses aren’t particularly collaborative.
19. The Problem With Offline Models
● Offline analyses aren’t associated with particularly quick
turnaround times.
● Offline analyses aren’t particularly collaborative.
● Offline analyses aren’t particularly portable.
20. A Potential Set-up (Straw Man)
Data Sources
http
Data
Stores
query
Analysis Consumption
23. ● These metrics are vanilla.
● These metrics are critical.
Operationalizing Analytics - The Simple Case
24. ● These metrics are vanilla.
● These metrics are critical.
● The business would probably better served if Data
Scientists and Analysts were spending their time
answering questions that require deep technical
knowledge.
Operationalizing Analytics - The Simple Case
25. ● Build or buy a workhorse ETL tool.
Operationalizing Analytics - A How To
26. ● Build or buy a workhorse ETL tool.
● Move toward an Operational Data Store (ODS), reducing
the need for postprocessing and data “mashups.”
Operationalizing Analytics - A How To
27. ● Build or buy a workhorse ETL tool.
● Move toward an Operational Data Store (ODS), reducing
the need for postprocessing and data “mashups.”
● Emphasize self-service wherever possible.
Operationalizing Analytics - A How To
28. ● Build or buy a workhorse ETL tool.
● Move toward an Operational Data Store (ODS), reducing
the need for postprocessing and data “mashups.”
● Emphasize self-service wherever possible.
● Analytics should slot into existing the infrastructure with
minimal friction.
Operationalizing Analytics - A How To
34. ● XML-based, model-storage format.
● Created and maintained by the Data Mining Group.
A Model Standard - PMML
35. ● XML-based, model-storage format.
● Created and maintained by the Data Mining Group.
● Most commonly used statistical/machine learning
models are supported.
A Model Standard - PMML
38. JPMML
● JPMML is an open-source API for evaluating PMML files.
● In essence, we equip the JPMML application with our PMML file,
serve it up with new data, and it provides us with predictions.
39. JPMML
● JPMML is an open-source API for evaluating PMML files.
● In essence, we equip the JPMML application with our PMML file,
serve it up with new data, and it provides us with predictions.
● Openscoring.io distributes various JPPML APIs and UDFs—for
example, RESTful API, Heroku, Hive, Pig, Cascading and
PostgreSQL.
40. JPMML
● JPMML is an open-source API for evaluating PMML files.
● In essence, we equip the JPMML application with our PMML file,
serve it up with new data, and it provides us with predictions.
● Openscoring.io distributes various JPPML APIs and UDFs—for
example, RESTful API, Heroku, Hive, Pig, Cascading and
PostgreSQL.
● All we have to do is write some code that fetches new values, serves
them up to the JPMML API, captures the predictions, then pushes
them back to a database.
45. Heroku: git push heroku master
REST: curl -X PUT --data-binary @BayesLeadScore.pmml -H "Content-type:
text/xml" http://ec2_endpoint/openscoring/model/BayesLeadScore
Deploy Model - PUT /model/${id}
46. CURLing or navigating to
http://heroku_endpoint/openscoring/model/BayesLeadScore
or
http://ec2_endpoint/openscoring/model/BayesLeadScore
will display our pmml model.
View Model - GET /model/${id}
47. Test Model - POST /model/${id}
newLead.json
curl -X POST
--data-binary @newLead.json
-H "Content-type: application/json"
http://ec2_endpoint/openscoring/model/Ba
yesLeadScore
Send request to JPMML API{
“id” : “001”,
“arguments” : {
“country” :
“US”,
“budget” :
7.8
}
}
53. What About Truly Big Data?
● For the rare few of us who need to make real-time predictions
against millions of rows per second, there’s a popular apache suite
to handle this.
*image borrowed from OryxProject
What I mean is to automate—as much as possible—the creation, dissemination and application of analyses so that they can be used in high-volume or fast-paced, tactical decisionmaking. This carries with it a hard requirement for data pipelines and workflows to scale.
This seems obvious. Many people require data to inform their choices. Reducing friction in the production and dissemination of analyses is beneficial for both consumers and producers of analytics. Consumers can respond to changes quicker, producers free up time to do more or to focus on depth of analysis.
Their time is better spent doing more in-depth analyses. They are an information bottleneck.
The process by which people create predictive analyses is not as efficient as it could be. Typically, when new batches of data come in, analysts retrieve these data, rebuild their model and make new predictions, then they disseminate this information somehow. These people need to start thinking like engineers.
There’s not much to do about this. However, if we do some work up front, we might be able to automate more of this process.
While analysts can share R or Python scripts, it’s not immediately obvious, by looking at their code, what is going on. To collaborate, one must reproduce others’ analyses before they can, themselves, contribute.
That is, they are not easily ported from R to Python to Matlab, etc.
A somewhat standard analytics pipeline that companies may have is akin to this:
Data from mobile and web applications, APIs, and public data sources is collected.
Data is stored in relational and/or nonrelational data stores.
Data is queried, transformed, and analyzed.
Decisionmakers consume the data and analyses, often as a report or dashboard, which then feeds back into the pipeline as product changes, etc.
There are potential efficiency gains:
Getting data out of a store into analysts’ hands.
Presentation of reports and analyses to decision-makers.
Feedback into product development, engineering, sales, marketing, etc.
Predictive analyses are strictly offline.
This class of metric is not particularly sexy. However, they are metrics people need in order to do their day-to-day operations: “Are we on track to meet our sales targets?” “How many users saw a particular marketing campaign yesterday?” “Is supply low in a certain region?”
For most businesses, these metrics address the majority of questions people need answered—they are the metrics that keep the business humming.
Typically, we encounter situations where a few developers and analysts supply an entire organization with data and analytics. This tends to create a bottleneck where one doesn't need to exist.
[stated this problem earlier. no need to dwell on it too much.]
On the buying front, there is a litany of choices, and a lot depends on which data sources and destinations are at play. However, some solutions we commonly encounter are fivetran, alooma, bigsynx, informatica cloud, and datavirtuality.
For those who prefer to build over buy, we’re talking about custom jobs written in a scripting language (shell, Ruby, Python) with some sort of dependency/workflow management tool, like Luigi or Airflow.
As companies grow, scaling ETL processes may pose a few problems if they’ve built their own tool. Admittedly, moving unstructured data is relatively easy. Even as the size and complexity of data increases, transformations don’t really come into play. We can just stuff more key-value pairs into our JSON objects and stash them in a NoSQL database. For companies making use of relational data stores, however, ever-changing schemas will undoubtedly make the ETL process more difficult.
There are clear tradeoffs here: an unstructured data store may come with lower accessibility in favor of simplicity in the data movement phase. Conversely, relational databases are rigid and require work to get large amounts of complex data into the correct format. Typically, however, SQL is the querying language with which most analysts are familiar, so accessibility becomes the upside.
This will reduce the need for continual data pulls and postprocessing in some desktop tool, both of which are slow and work intensive. By storing everything in an ODS, disparate data can be joined in the database. This is likely faster than the alternative; it’s also more conducive to automation.
Redshift is a favorite at Looker. It scales quickly and cost-effectively, relative to other MPP RDBMS offerings. Additionally, we’ve seen great promise with both Spark and Impala. Spark has a leg up on a lot of its competitors: it’s new; it does most of its heavy lifting in-memory rather than reading from and writing to disk; it scales very well; it’s feature rich and has a built-in machine-learning library; and has easy-to-use SQL, Scala and Python interfaces.
Most reporting and simpler analyses can be automated to a degree. Done correctly, they can be accessed by business teams and even tinkered with in a self-service manner (segmenting and filtering so that the analyst doesn’t have to respond to minor changes to, what is effectively, the same report with a minor tweak). All that's left is to teach man a to fish (admittedly, this is the most difficult component, based on my experience).
When setup correctly, a business intelligence or querying tool can scale quite well to support a large organization and automate much of the day-to-day operational analysis. Ideally, such a tool would slot into the existing data infrastructure, exploiting an ODS or connecting to multiple data sources without subsequently moving the data again for further processing.
Imagine, now, that we’re in a world where most analyses are largely self-service or automated. Analysts and data scientists are, instead, focusing on predictive analyses. How do we take these, seemingly inherent offline processes, and integrate them into existing data pipelines and applications?
There are more and more tools that automate statistical- and machine-learning processes, some are more blackbox than others—bigml is perhaps the most popular. Additionally, there is prediction.io, alchemyAPI, indico, rapidminer, yhat, azureML, aws machine learning, etc.
Some of these tools integrate with existing data-science workflows better than others. Let’s suppose, however, that we have an aversion to another canned analysis tool and we prefer a more customizable solution.
Building an operational machine-learning platform from the ground up is no trivial task. This likely requires significant resources from engineering and analytics departments. A reasonable starting place would be to write some Python that trains and tests models, and handles model selection. This is doable with some great existing libraries, such as numpy and sci-kit learn. The daunting task is writing an API that can fetch or accept new data from various sources, score the data using the model created earlier, and finally pipe predictions back into a database for consumption or into an external application.
Sci-kit learn does have the notion of model persistence, which relies on Python “pickles.” This gets us close.
I’d assert that there’s a suite of tools that makes up the mean between these two extremes.
This mean probably provides a few base features:
1. It standardizes machine-learning models, irrespective of the language in which they were written, making them portable and a bit more collaborative.
2. Models are serialized, marshalled, or persisted, so they do not need to be re-training for subsequent prediction batches.
3. It provides a basic API for the ingestion and application of our serialized models.
Everything else—which models are used, how models are chosen, which and how many data sources flow into the API, and how predictions are handled—would be left to the user to handle.
Everything one must know in order to describe and translate a model is captured in a well-structured format: a data dictionary, how to handle missing values, model coefficients, conditional probabilities, etc.
It’s actively maintained and updated by a community comprised of academics and industry professionals.
e.g., regression, svm, association rules, naive bayes, clustering, decision trees, random forest, neural networks, and ensembles.
A large number of pre-existing tools produce and/or consume PMML files. This means adopting PMML as a model standard would likely not disrupt the analytics workflow.
JPMML is the best mean between extremes that I’ve seen to-date.
basic example in R
basic example in Spark’s MLlib
what the PMML looks like
Our predicted value just needs to be pushed into a database or back into its source API, such that it makes its way to our ODS and is ultimately presented to end users.
While the JPMML API doesn’t scale horizontally on its own, it’s feasible to set up a parallel environment and route incoming data accordingly.
On moderate hardware, JPMML can score thousands of records in a second. The simplest solution to scale this out, for the vast majority of use cases, would be to throw more powerful hardware at the problem. For batch jobs that contain tens of millions of records or more, using a Openscoring solution for Hive, Pig, or Cascading would likely be a better choice.
In the rare occurrence that we work at a company that needs to make predictions on millions of incoming records per second, we may find that there are other tools which are better suited to meet our needs.
A framework for such a task was proposed by Nathan Marz, known as “the lambda architecture.”
There are a number of components involved:
1. input distribution, for real-time or microbatch event distribution to both speed and batch layers;
2. stream processing, to transform or make predictions against incoming data;
3. batch processing, to transform or train against historical data;
4. serving layer, for reconciliation of speed and batch results and to serve ad hoc querying.
A popular setup relies on Zookeeper for cluster management, Kafka or Flume for event/message handling, and Spark Streaming or Storm for real-time analysis. Cloudera has bundled Zookeeper, Kafka, and Spark Streaming into a single framework called Oryx.
Also, Spark Streaming can make use of MLLib for microbatch machine learning tasks. Storm, too, has a comparable set-up relying on Trident-ML, which abstracts much of Storm’s low-level programming into declarative, Pig-Latin like statements with machine learning capabilities.