Real-time Big Data Analytics: From Deployment to Production

Revolution Confidential

R eal-Time P redic tive
A nalytic s with B ig Data
from deployment to produc tion

David M S mith @ revodavid
V P Marketing and C ommunity
R evolution Analytic s

Today we’ll dis c us s : Revolution Confidential

 What is “real-time predictive analytics with
big data?”
 Case study: UpStream Software
 Real-time big-data stack
 Five phases of deployment to production
 Just what do we mean by “Big Data” and
“Real-Time”, anyway?
 Recommendations
 Q&A

2

R eal-Time P redic tive A nalytic s with B ig Data

 Real Time
 Milliseconds? Seconds? Hours? Days?
 In production, continuously updated
 Big Data
 Size, flow rate, diversity
 Conflict with ‘real time’
 Predictive Analytics
 Rear view: description, aggregation, tabulation
 Prediction, inference, statistical modeling
3


C as e S tudy

www.upstreamsoftware.com

“Given that our data sets are already
in the terabytes and are growing
rapidly, we depend on Revolution R
Enterprise’s scalability and fast
performance — we saw about a 4x From: “How Big Data is Changing Retail
performance improvement on 50 Marketing Analytics”
million records. It works brilliantly.” http://bit.ly/upstream-webinar

John Wallace, CEO Upstream Software

UpS tream S oftware Revolution Confidential

UpStream’s Big Data Analytics engine
analyzes and optimizes marketing mix for
UpStream’s retailer clients
 Customer analytics and segmentation
 Revenue/ action attribution among channels and
marketing programs
 Customized Next Best Action per individual
prospect or customer
 Event-Triggered Marketing: responses to
consumer actions
5

B ig Data Revolution Confidential

 Demographics: consumer, product, market
 Actions: web clicks, email clicks, mobile app usage,
call center logs, social, search …
 Outcomes: impressions, touches, orders (retail, online,
mobile)

6

P redic tive A nalytic s Revolution Confidential

• Time-to-event models
UPSTREAM DATA • Exploratory data analysis CUSTOM VARIABLES
FORMAT (PMML)
• GAM survival models

• ETL • Scoring for inference
• N marketing channels • Scoring for prediction
• Behavioral variables
• 5 billion scores per day
• Promotional data per customer
• Overlay data

R eal Time Revolution Confidential

http://www.infoworld.com/d/big-data/williams-sonoma-uses-big-data-zero-in-customers-206641
November 8, 2012 8

P redic tive vs Des c riptive A nalytic s Revolution Confidential

 Retail sales are influenced by stores near the
home

 Newer customers are more sensitive to marketing

 Google keywords perform worse than you think

 Display advertising performs better than you think

 Online sales benefit more from seasonal events
than retail stores
9

R eal-time Revolution Confidential

B ig Data
P redic tive
A nalytic s
S tac k

10

Data L ayer Revolution Confidential

 Structured data
 RDBMS, NoSQL, Hbase, Impala
 Unstructured data
 Hadoop MapReduce
 Streaming data
 Web, social, sensors, operational systems
 Some descriptive analytics done here
11

A nalytic s layer Revolution Confidential

 Predictive analytics technology
 Development environment: build models
 Production environment:
 Deploy real-time scoring
 Engine for dynamic analytics
 Local data mart
 Static, periodically updated from data layer
 Improves performance

12

Integration L ayer Revolution Confidential

 Connective tissue between end-user
applications and analytics engine
 Engines for flow control, real-time scoring
 Rules engine / CEP engine
 API for Dynamic Analytics
 Brokers communication between app developers
and data scientists

13

Dec is ion L ayer Revolution Confidential

 End-user interface to analytics system
 Customers
 Operations
 Business analysts
 C-suite
 Familiar & simple interfaces
 Supports a range of end-user technologies
 Level of UI complexity dictated by need
14

R eal-Time Deployment Revolution Confidential

Five phases of deploying real-time predictive
analytics with big data to production:

1. Data Distillation
2. Model Development
3. Validation and Deployment
4. Real-time Scoring
5. Model refresh

16

P has e 1: Data Dis tillation Revolution Confidential

The data in the Data Layer isn’t yet ready for
predictive modeling. We need to:
 Extract features from unstructured text
 Topic modeling, sentiment analysis
 Combine disparate data sources
 Filter for populations of interest
 Select relevant features and outcomes for
modeling
 Export to the data mart for predictive modeling

17

E xample: Data Dis tillation in Hadoop Revolution Confidential

Log Files

Event Streams HDFS Load Map-Reduce Structured
Data
rmr

Scraped Text

Unstructured Analytics
Data Data Mart

Revolution R Enterprise for Hadoop: bit.ly/r-hadoop
18

A utomating the E xtrac tion P roc es s Revolution Confidential

 This process needs to be repeatable and
maintainable
 Create a re-usable R script:
 ASCII: rxDataStep
 SQL: rxImport, ROracle, RMySQL
 NoSQL: Rcassandra, rmongodb
 Hadoop: rmr, rhbase, Rhive
 Appliances: nza, teradataR, PL/R
 Feeds: rjson, XML, RCurl, twitteR, python
 R script runs from Analytics Layer
 Processing: in Data Layer
 Output: structured file for Data Mart
19

Data Revolution Confidential

Dis tillation
P roc es s

XD
F

20

P has e 2: Model development Revolution Confidential

 Goal: create a predictive model that is
 Powerful
 Robust
 Comprehensible
 Implementable
 Key requirements for Data Scientists:
 Flexibility
 Productivity
 Speed
 Reproducibility
21

T he Model Development C yc le Revolution Confidential

Download the White Paper
R is Hot
Variable bit.ly/r-is-hot
Transformation

Feature
Selection Model Predictive
Data Mart Sampling Estimation
Aggregation Model

Model
Model
Comparison /
Refinement
Benchmarking

22

Tec hnology c ons iderations Revolution Confidential

 Moving Big Data is slow
 So just move it once, to data mart near the development
and production environments
 Static data needed for modeling cycle
 But have a plan to refresh and update
 Data layer not optimized for predictive analytics
 But good for descriptive (data distillation)
 Revolution R Enterprise optimizations for the
analytics layer
 R; XDF file format; Parallel External Memory Algorithms

23

P redic tive Modeling A lgorithms Revolution Confidential

Algorithm Example Applications Big Data
ETL, data distillation, record/variable selection,
Data Step
variable transformation
✓✓

Descriptive Statistics Exploratory Data Analysis, Data Validation ✓✓
Tables & Cubes Reporting, contingency analysis ✓✓
Correlation / Covariance Factor Analysis, Value at Risk ✓✓
Linear regression Forecasting, Net present value estimation ✓✓
Logistic Regression Response modeling, offer selection ✓✓
Generalized Linear
Models
Capital reserve estimation, climate modeling ✓✓

K-means clustering Customer Segmentation ✓✓
Decision Trees Dynamic pricing, classification, variable importance ✓✓
Model Prediction Real-time Scoring (decisions, offers, actions) ✓✓
R CRAN packages Everything else
Parallel & distributed Simulations, By-Group analysis, ensemble models,
computing with R custom applications
✓

24

High performanc e with dis tributed c lus ters

Data Scientists / Modelers Grid computing cluster

Revolution R Enterprise
Platform LSF
Microsoft HPC Server
Microsoft Azure
Ad-hoc grids (SNOW)
Shared SMP server
Hadoop / HDFS

25

P has e 3: Model Validation and Deployment Revolution Confidential

 Scoring rules map parameters to outputs
 Parameters: information known at real time
 prospect ID, product, webpage, …
 Scores: values inferred in real time
 prices, recommendations, actions, …

 Validation: backtest with historical data
 Deployment: code running in real time

26

Validation for produc tion Revolution Confidential

Real Outcomes
Historical
Data

Scores
Scoring
Rules

 Refresh data mart
 Rebuild model, withholding a validation set
 Random sampling
 Create accuracy measure
 ROC, positive rate, false positive rate
 Measure and monitor
27

Deployment with R c ode Revolution Confidential

Outcomes
Historical

Scores
R Model R Model

Data
Data

New
Object Object

 Scoring rules captured as “R model objects”
 Move R code directly from development
environment to production server
 No recoding: Lowest cost, greatest speed &
reliability
 Most flexibility (not limited in model choice)
 Easy to validate with custom accuracy measures
28

Managing and S c aling Deployed R C ode Revolution Confidential

 Revolution R Enterprise includes RevoDeployR
 Code management & tracking
 Security & resource allocation
 Scalability for real-time demands
 Web Services API
 Scoring and dynamic analytics
 Data visualization
 custom reporting
 Ad-hoc analytics
 Interactive data apps

29

C ode C onvers ion for S c oring Revolution Confidential

 Business rules (e.g. IBM ILOG)
 Create directly with R code from model object
 Recode in other languages
 SQL
 C++, etc.
 Low latency
 Slow and costly
deployment
 Potential for errors

30

S c oring via P MML Revolution Confidential

 Some predictive models can be expressed in
the PMML standard (www.dmg.org)
 Not all, but growing all the time

Scores
R Model PMML

Data
PMML
Object pmml

 Easily generate PMML with R “pmml” package
 PMML Scoring Engine: Zementis Adapa
 Databases and appliances may support PMML
 But supported standards vary

31

P has e 4: R eal-Time S c oring Revolution Confidential

 Scoring triggered by Decision Layer, brokered by
Integration Layer
 R code
 Revolution R Enterprise Server
 In-appliance (e.g IBM PureData System for Analytics)
 SQL
 In-database
 Rules
 Compiled code
 Bespoke engines
 May be using hardware from data layer
 But not the actual data
32

R eal-Time Revolution Confidential

S c oring:
review
iLog

SQL

Near Real
Time 33

P has e 5: Model R efres h Revolution Confidential

 Deployed scoring rules no longer connected
to Data Layer or Data Mart
 Enables real-time performance
 Need to refresh using recent data

 Model refresh process
 Use data extract script
 Re-run model script
 Statistical review OR automated validation
34

S c heduled B atc h Updates Revolution Confidential

 Periodic model refresh
 Weekly
 Daily
Revolution R Enterprise  Hourly
Production Server Cluster
 Automated
Scheduler validate/deploy process
Excel
RevoDeployR Server

Web Services API

35

R e-developing the model Revolution Confidential

 Even then, the underlying structure of the
data may change:
 Important variables become non-significant
 Non-significant variables become important!
 New data sources become available
 Model accuracy drift is a warning sign
 Return to Phase 2 or Phase 1

36

Kilobytes / second

Megabytes / second
How big is
‘B ig data’?
Megabytes

Gigabytes  Terabytes

Petabytes  Exabytes

38

Seconds or less

Milliseconds
How fas t is
‘R eal time’?
Daily  Hourly

Hours  Minutes

Daily  Hourly

Weeks  Days

39

R ec ommendations Revolution Confidential

 Architect your Predictive Analytics Stack
 Best of breed vs single-vendor stack
 Diversify data sources and decision apps
 R = Predictive Analytics
Revolution R Enterprise provides:
 Analytics Layer: big data, performance
 Integration Layer: scalability, reliability
 Get Help to Get Started
 Consider an SI partner for architecture, best
practices and implementation advice
40

R es ourc es Revolution Confidential

 Revolution R Enterprise : R for Big Data
 www.revolutionanalytics.com/products
 Revolution Analytics Consulting Services
 www.revolutionanalytics.com/services
 Contact us: bit.ly/hey-revo
 Rhadoop : Connecting R and Hadoop
 bit.ly/r-hadoop

 Contact David Smith
 david@revolutionanalytics.com
 @revodavid
 blog.revolutionanalytics.com

41

T hank you. Revolution Confidential

The leading commercial provider of software and support for the popular
open source R statistics language.

www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR

42

Real-time Big Data Analytics: From Deployment to Production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Real-time Big Data Analytics: From Deployment to Production

Similar to Real-time Big Data Analytics: From Deployment to Production (20)

More from Revolution Analytics

More from Revolution Analytics (20)

Real-time Big Data Analytics: From Deployment to Production