Operationalizing analytics to scale

Scott HooverOperationalizing Analytics To Scale

Operationalizing Analytics To Scale
Many companies have invested time and money into building sophisticated data pipelines
that can move massive amounts of data in (near) real time. However, for the analyst or data
scientist who builds models offline, integrating their analyses into these pipelines for
operational purposes can pose a challenge.
In this workshop, we will discuss some key technologies and workflows companies can
leverage to build end-to-end solutions for automating analytical, statistical and machine
learning solutions: from collection and storage to analysis and real-time predictions.
Abstract

● Introduction
● What Are we Talking About Exactly?
Agenda

● Introduction
● The Problem at Hand
Agenda

● Introduction
● Operationalizing Analytics
Agenda

● Introduction
● Operationalizing Predictive Analytics
Agenda

● Introduction
● Operationalizing Predictive Analytics
● Questions
Agenda

Introduction
● I work on the Internal Data team at Looker.

Introduction
● Before Looker, I worked in consulting and research.

Introduction
● Before Looker, I worked in consulting and research.
● Looker is a business intelligence tool.

What are we talking about?
● What do I mean when I say “operationalizing”?

What are we talking about?
● What do I mean when I say “operationalizing”?
● Why is this important?

The Problem at Hand
● Analysts are providing basic reports for the entire
business.

● Analysts are providing basic reports for the entire
business.
● Analysts and Data Scientists are building offline models.
The Problem at Hand

The Problem With Offline Models
● Offline analyses aren’t associated with particularly quick
turnaround times.

turnaround times.
● Offline analyses aren’t particularly collaborative.

turnaround times.
● Offline analyses aren’t particularly collaborative.
● Offline analyses aren’t particularly portable.

A Potential Set-up (Straw Man)
Data Sources
http
Data
Stores
query
Analysis Consumption

Operationalizing Analytics - The Simple Case

● These metrics are vanilla.

● These metrics are critical.

● These metrics are critical.
● The business would probably better served if Data
Scientists and Analysts were spending their time
answering questions that require deep technical
knowledge.

● Build or buy a workhorse ETL tool.
Operationalizing Analytics - A How To

● Move toward an Operational Data Store (ODS), reducing
the need for postprocessing and data “mashups.”

● Emphasize self-service wherever possible.

● Emphasize self-service wherever possible.
● Analytics should slot into existing the infrastructure with
minimal friction.

Operationalizing Predictive Analytics

Where to Begin
● Out-of-the-box tools.

● Build from scratch.
Where to Begin

● Build from scratch.
● A mean between extremes.
Where to Begin

● XML-based, model-storage format.
A Model Standard - PMML

● Created and maintained by the Data Mining Group.

● Created and maintained by the Data Mining Group.
● Most commonly used statistical/machine learning
models are supported.

PMML Integrations
Producers Consumers

JPMML
● JPMML is an open-source API for evaluating PMML files.

JPMML
● In essence, we equip the JPMML application with our PMML file,
serve it up with new data, and it provides us with predictions.

JPMML
● Openscoring.io distributes various JPPML APIs and UDFs—for
example, RESTful API, Heroku, Hive, Pig, Cascading and
PostgreSQL.

JPMML
● Openscoring.io distributes various JPPML APIs and UDFs—for
example, RESTful API, Heroku, Hive, Pig, Cascading and
PostgreSQL.
● All we have to do is write some code that fetches new values, serves
them up to the JPMML API, captures the predictions, then pushes
them back to a database.

Example Architecture - Lead Scoring
API
API
GET lead
UPDATE lead
GET leads

Heroku: git push heroku master
REST: curl -X PUT --data-binary @BayesLeadScore.pmml -H "Content-type:
text/xml" http://ec2_endpoint/openscoring/model/BayesLeadScore
Deploy Model - PUT /model/${id}

CURLing or navigating to
http://heroku_endpoint/openscoring/model/BayesLeadScore
or
http://ec2_endpoint/openscoring/model/BayesLeadScore
will display our pmml model.
View Model - GET /model/${id}

Test Model - POST /model/${id}
newLead.json
curl -X POST
--data-binary @newLead.json
-H "Content-type: application/json"
http://ec2_endpoint/openscoring/model/Ba
yesLeadScore
Send request to JPMML API{
“id” : “001”,
“arguments” : {
“country” :
“US”,
“budget” :
7.8
}
}

Example Response
{
“id” : “001”,
“result” : {
“meeting” : “1”,
“Probability_0” :
0.33062906130485653,
“Probability_1” : 0.6693709386951435
}
}

Batch Request - POST /model/${id}/batch
batchLeads.json
curl -X POST --data-binary
@batchLeads.json -H "Content-type:
application/json"
http://ec2_endpoint/openscoring/model/Ba
yesLeadScore/batch
Send request to JPMML API
{
"id":"batch-1",
"requests":[
{
"id":"001",
"arguments":{
"country":"US",
"budget":7.8
}
},
{
"id":"002",
"arguments":{
"country":"CA",
"budget":3.2
}
}
]
}

Scale Considerations
● Horizontal scaling.

Scale Considerations
● Horizontal scaling.
● Vertical scaling.

What About Truly Big Data?
● For the rare few of us who need to make real-time predictions
against millions of rows per second, there’s a popular apache suite
to handle this.
*image borrowed from OryxProject

Applications
ODS Analysis
APIs
Transactional DB
/ Event Storage
Business Intelligence
Scoring Server
Consumers Review /
Versioning

Operationalizing analytics to scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Operationalizing analytics to scale

Similar to Operationalizing analytics to scale (20)

More from Looker

More from Looker (20)

Recently uploaded

Recently uploaded (20)

Operationalizing analytics to scale

Editor's Notes