David Smith
                           Revolution Analytics
                                   @revodavid




Real-Time Big Data Analytics
From Deployment to Production


                                            1
2
Buzzword
 Bingo!


           REAL TIME

           BIG DATA

   PREDICTIVE ANALYTICS
                          3
Photo: Sarah&Boston (flickr: pocheco) Creative Commons BY-SA 2.0   4
User ID
Predictive                                         Browser
                     Factors                       Time/Date / Location
                                                    Any known information
Analytics                                          Previous purchases
                                                   Friend data
Model
                                                                   Decision Tree
                                                                   Logistic Regression
                                                                   Neural Network
                                                                   Predictive Model
                                                                   K-means clustering
                   Scoring Rules                                   Ensemble Model

                                                   Product of most interest
                                                   Offer of most likely sale
                      Scores                       Most relevant Selection
                                                   Prediction or link
                                                   Forecast sale value
                                                   Optimal Bid
             ”IO VAPOURA” by Jaya Prime flickr.com/photos/sanjayaprime/4924462993 CC-BY 2.0   5
Real-time Deployment
1. Data distillation
2. Model development and
   validation
3. Model deployment
4. Real-time model scoring
5. Model refresh
                 "CLOCK" by Heiko Klingele flickr.com/photos/divdax/3458668053/ CC-BY 2.0   6
1. Data Distillation in Hadoop

   Log Files


Sensor Streams HDFS Load    Map-Reduce   Structured
                                            Data
                                 rmr
  Language Text


 Unstructured                            Analytics
    Data                                 Data Mart

                                                      7
2. The Model Development Cycle
                                    Feature
                                   Selection
                                   Sampling
                                   Aggregati
                                      on
                   Model
                  Comparis                             Variable
Structured Data     on /
                   Bench-
                                                        Trans-
                                                      formation
                                                                  Predictive Model
                  marking




                         Model
                                                 Model
                        Refineme
                            nt
                                               Estimation           R White Paper
                                                                        bit.ly/r-is-hot



                                                                                          8
3: Deployment Options
                                 Factors
 Unknown factors
   SQL / Rules Engine
   Code (C++, Java, R, Hadoop)
   PMML Engine
 Factors known in advance
   Batch Lookup Tables           Scores


                                           9
Why did I buy that blender?
 Just browsing in the mall
 TV ad / magazine ad
 Coupon in the mail
 “Just moved” promo email
 Webstore recommendation
 Browsing catalog

                              10
UpStream: Attribution Modeling




                                 11
4. Model
                                  • Exploratory data analysis
Scoring                           • Time-to-event models
                                  • GAM survival models


UPSTREAM DATA                                                                        CUSTOM VARIABLES
FORMAT                                                                                         (PMML)




     •   ETL                                                    • Scoring for inference
     •   Marketing channel data                                 • Scoring for prediction
     •   Behavioral variables
                                                                • 5 billion scores per day
     •   Promotional data                                         per retailer
     •   Overlay data
5. Model refresh      Factors




                       Scores

                   Actual Outcomes
Big Data     Real Time
Kilobytes/S
               Seconds
     ec

Megabytes/
              Milliseconds
   Sec


 Gigabytes
                Minutes
 Terabytes



Petabytes    Minutes 
 Exabytes       Hours

                             14
PREDICTIVE
ANALYTICS
 BIG DATA

REAL TIME
             15
Real-Time Big Data Predictive Analytics:                                            David Smith
From Deployment to Production                                                             @revodavid




             The leading enterprise provider of software and services for Open Source R



                          Booth 618 / Office Hours Weds 1:30PM

    www.revolutionanalytics.com             +1 650 646 9545               Twitter: @RevolutionR




                                                                                                  16

Real-time Big Data Analytics: From Deployment to Production

  • 1.
    David Smith Revolution Analytics @revodavid Real-Time Big Data Analytics From Deployment to Production 1
  • 2.
  • 3.
    Buzzword Bingo! REAL TIME BIG DATA PREDICTIVE ANALYTICS 3
  • 4.
    Photo: Sarah&Boston (flickr:pocheco) Creative Commons BY-SA 2.0 4
  • 5.
    User ID Predictive Browser Factors Time/Date / Location Any known information Analytics Previous purchases Friend data Model Decision Tree Logistic Regression Neural Network Predictive Model K-means clustering Scoring Rules Ensemble Model Product of most interest Offer of most likely sale Scores Most relevant Selection Prediction or link Forecast sale value Optimal Bid ”IO VAPOURA” by Jaya Prime flickr.com/photos/sanjayaprime/4924462993 CC-BY 2.0 5
  • 6.
    Real-time Deployment 1. Datadistillation 2. Model development and validation 3. Model deployment 4. Real-time model scoring 5. Model refresh "CLOCK" by Heiko Klingele flickr.com/photos/divdax/3458668053/ CC-BY 2.0 6
  • 7.
    1. Data Distillationin Hadoop Log Files Sensor Streams HDFS Load Map-Reduce Structured Data rmr Language Text Unstructured Analytics Data Data Mart 7
  • 8.
    2. The ModelDevelopment Cycle Feature Selection Sampling Aggregati on Model Comparis Variable Structured Data on / Bench- Trans- formation Predictive Model marking Model Model Refineme nt Estimation R White Paper bit.ly/r-is-hot 8
  • 9.
    3: Deployment Options Factors Unknown factors SQL / Rules Engine Code (C++, Java, R, Hadoop) PMML Engine Factors known in advance Batch Lookup Tables Scores 9
  • 10.
    Why did Ibuy that blender? Just browsing in the mall TV ad / magazine ad Coupon in the mail “Just moved” promo email Webstore recommendation Browsing catalog 10
  • 11.
  • 12.
    4. Model • Exploratory data analysis Scoring • Time-to-event models • GAM survival models UPSTREAM DATA CUSTOM VARIABLES FORMAT (PMML) • ETL • Scoring for inference • Marketing channel data • Scoring for prediction • Behavioral variables • 5 billion scores per day • Promotional data per retailer • Overlay data
  • 13.
    5. Model refresh Factors Scores Actual Outcomes
  • 14.
    Big Data Real Time Kilobytes/S Seconds ec Megabytes/ Milliseconds Sec Gigabytes Minutes  Terabytes Petabytes  Minutes  Exabytes Hours 14
  • 15.
  • 16.
    Real-Time Big DataPredictive Analytics: David Smith From Deployment to Production @revodavid The leading enterprise provider of software and services for Open Source R Booth 618 / Office Hours Weds 1:30PM www.revolutionanalytics.com +1 650 646 9545 Twitter: @RevolutionR 16

Editor's Notes

  • #4 Get out your buzzword bingo cards!
  • #5 Data as “new oil” – valuable commodityBig Data is crude oil: messy, hard to get at, got contaminants in it.
  • #6 Start off with stuff we know in real time.
  • #9 Model development processNot just about the computational speed. Also about productivity of developer.
  • #12 Demographics: consumer, product, marketActions: web clicks, email clicks, mobile app usage, call center logs, social, search …Outcomes: impressions, touches, orders (retail, online, mobile)Strategic allocation
  • #13 Outcome is “buying” instead of “dying”
  • #17 From Revolution Analytics. We help companies deploy predictive models created in R to real-time production systems.