Real-time Big Data Analytics: From Deployment to Production


Published on

As the Big Data market has evolved, the focus has shifted from data operations (storage, access and processing of data) to data science (understanding, analyzing and forecasting from data). And as new models are developed, organizations need a process for deploying analytics from research into the production environment. In this talk, we'll describe the five stages of real-time analytics deployment:

Data distillation
Model development
Model validation and deployment
Model refresh
Real-time model scoring

We'll review the technologies supporting each stage, and how Revolution Analytics software works with the entire analytics stack to bring Big Data analytics to real-time production environments.

Real-time Big Data Analytics: From Deployment to Production

  1. 1. Revolution Confidential R eal-Time P redic tive A nalytic s with B ig Data from deployment to produc tionDavid M S mith @ revodavidV P Marketing and C ommunityR evolution Analytic s
  2. 2. Today we’ll dis c us s : Revolution Confidential What is “real-time predictive analytics with big data?”  Case study: UpStream Software Real-time big-data stack Five phases of deployment to production Just what do we mean by “Big Data” and “Real-Time”, anyway? Recommendations Q&A 2
  3. 3. R eal-Time P redic tive A nalytic s with B ig Data Revolution Confidential Real Time  Milliseconds? Seconds? Hours? Days?  In production, continuously updated Big Data  Size, flow rate, diversity  Conflict with ‘real time’ Predictive Analytics  Rear view: description, aggregation, tabulation  Prediction, inference, statistical modeling 3
  4. 4. Revolution Confidential C as e S tudy“Given that our data sets are already in the terabytes and are growing rapidly, we depend on Revolution R Enterprise’s scalability and fast performance — we saw about a 4x From: “How Big Data is Changing Retail performance improvement on 50 Marketing Analytics” million records. It works brilliantly.” Wallace, CEO Upstream Software
  5. 5. UpS tream S oftware Revolution ConfidentialUpStream’s Big Data Analytics engineanalyzes and optimizes marketing mix forUpStream’s retailer clients  Customer analytics and segmentation  Revenue/ action attribution among channels and marketing programs  Customized Next Best Action per individual prospect or customer  Event-Triggered Marketing: responses to consumer actions 5
  6. 6. B ig Data Revolution Confidential Demographics: consumer, product, market Actions: web clicks, email clicks, mobile app usage, call center logs, social, search … Outcomes: impressions, touches, orders (retail, online, mobile) 6
  7. 7. P redic tive A nalytic s Revolution Confidential • Time-to-event models UPSTREAM DATA • Exploratory data analysis CUSTOM VARIABLES FORMAT (PMML) • GAM survival models • ETL • Scoring for inference • N marketing channels • Scoring for prediction • Behavioral variables • 5 billion scores per day • Promotional data per customer • Overlay data
  8. 8. R eal Time Revolution Confidential November 8, 2012 8
  9. 9. P redic tive vs Des c riptive A nalytic s Revolution Confidential Retail sales are influenced by stores near the home Newer customers are more sensitive to marketing Google keywords perform worse than you think Display advertising performs better than you think Online sales benefit more from seasonal events than retail stores 9
  10. 10. R eal-time Revolution ConfidentialB ig DataP redic tiveA nalytic sS tac k 10
  11. 11. Data L ayer Revolution Confidential Structured data  RDBMS, NoSQL, Hbase, Impala Unstructured data  Hadoop MapReduce Streaming data  Web, social, sensors, operational systems Some descriptive analytics done here 11
  12. 12. A nalytic s layer Revolution Confidential Predictive analytics technology  Development environment: build models  Production environment:  Deploy real-time scoring  Engine for dynamic analytics Local data mart  Static, periodically updated from data layer  Improves performance 12
  13. 13. Integration L ayer Revolution Confidential Connective tissue between end-user applications and analytics engine Engines for flow control, real-time scoring  Rules engine / CEP engine API for Dynamic Analytics  Brokers communication between app developers and data scientists 13
  14. 14. Dec is ion L ayer Revolution Confidential End-user interface to analytics system  Customers  Operations  Business analysts  C-suite Familiar & simple interfaces  Supports a range of end-user technologies  Level of UI complexity dictated by need 14
  15. 15. R eal-Time Deployment Revolution ConfidentialFive phases of deploying real-time predictiveanalytics with big data to production:1. Data Distillation2. Model Development3. Validation and Deployment4. Real-time Scoring5. Model refresh 16
  16. 16. P has e 1: Data Dis tillation Revolution ConfidentialThe data in the Data Layer isn’t yet ready forpredictive modeling. We need to:  Extract features from unstructured text  Topic modeling, sentiment analysis  Combine disparate data sources  Filter for populations of interest  Select relevant features and outcomes for modeling  Export to the data mart for predictive modeling 17
  17. 17. E xample: Data Dis tillation in Hadoop Revolution Confidential Log Files Event Streams HDFS Load Map-Reduce Structured Data rmr Scraped Text Unstructured Analytics Data Data Mart Revolution R Enterprise for Hadoop: 18
  18. 18. A utomating the E xtrac tion P roc es s Revolution Confidential This process needs to be repeatable and maintainable Create a re-usable R script:  ASCII: rxDataStep  SQL: rxImport, ROracle, RMySQL  NoSQL: Rcassandra, rmongodb  Hadoop: rmr, rhbase, Rhive  Appliances: nza, teradataR, PL/R  Feeds: rjson, XML, RCurl, twitteR, python R script runs from Analytics Layer  Processing: in Data Layer  Output: structured file for Data Mart 19
  19. 19. Data Revolution ConfidentialDis tillationP roc es s XD F 20
  20. 20. P has e 2: Model development Revolution Confidential Goal: create a predictive model that is  Powerful  Robust  Comprehensible  Implementable Key requirements for Data Scientists:  Flexibility  Productivity  Speed  Reproducibility 21
  21. 21. T he Model Development C yc le Revolution Confidential Download the White Paper R is Hot Variable Transformation Feature Selection Model PredictiveData Mart Sampling Estimation Aggregation Model Model Model Comparison / Refinement Benchmarking 22
  22. 22. Tec hnology c ons iderations Revolution Confidential Moving Big Data is slow  So just move it once, to data mart near the development and production environments Static data needed for modeling cycle  But have a plan to refresh and update Data layer not optimized for predictive analytics  But good for descriptive (data distillation) Revolution R Enterprise optimizations for the analytics layer  R; XDF file format; Parallel External Memory Algorithms 23
  23. 23. P redic tive Modeling A lgorithms Revolution Confidential Algorithm Example Applications Big Data ETL, data distillation, record/variable selection, Data Step variable transformation ✓✓ Descriptive Statistics Exploratory Data Analysis, Data Validation ✓✓ Tables & Cubes Reporting, contingency analysis ✓✓Correlation / Covariance Factor Analysis, Value at Risk ✓✓ Linear regression Forecasting, Net present value estimation ✓✓ Logistic Regression Response modeling, offer selection ✓✓ Generalized Linear Models Capital reserve estimation, climate modeling ✓✓ K-means clustering Customer Segmentation ✓✓ Decision Trees Dynamic pricing, classification, variable importance ✓✓ Model Prediction Real-time Scoring (decisions, offers, actions) ✓✓ R CRAN packages Everything else Parallel & distributed Simulations, By-Group analysis, ensemble models, computing with R custom applications ✓ 24
  24. 24. High performanc e with dis tributed c lus ters Revolution Confidential Data Scientists / Modelers Grid computing cluster Revolution R Enterprise Platform LSF Microsoft HPC Server Microsoft Azure Ad-hoc grids (SNOW) Shared SMP server Hadoop / HDFS 25
  25. 25. P has e 3: Model Validation and Deployment Revolution Confidential Scoring rules map parameters to outputs  Parameters: information known at real time  prospect ID, product, webpage, …  Scores: values inferred in real time  prices, recommendations, actions, … Validation: backtest with historical data Deployment: code running in real time 26
  26. 26. Validation for produc tion Revolution Confidential Real Outcomes Historical Data Scores Scoring Rules Refresh data mart Rebuild model, withholding a validation set  Random sampling Create accuracy measure  ROC, positive rate, false positive rate Measure and monitor 27
  27. 27. Deployment with R c ode Revolution Confidential OutcomesHistorical Scores R Model R Model Data Data New Object Object Scoring rules captured as “R model objects” Move R code directly from development environment to production server  No recoding: Lowest cost, greatest speed & reliability  Most flexibility (not limited in model choice)  Easy to validate with custom accuracy measures 28
  28. 28. Managing and S c aling Deployed R C ode Revolution Confidential Revolution R Enterprise includes RevoDeployR  Code management & tracking  Security & resource allocation  Scalability for real-time demands  Web Services API Scoring and dynamic analytics  Data visualization  custom reporting  Ad-hoc analytics  Interactive data apps 29
  29. 29. C ode C onvers ion for S c oring Revolution Confidential Business rules (e.g. IBM ILOG)  Create directly with R code from model object Recode in other languages  SQL  C++, etc. Low latency Slow and costly deployment Potential for errors 30
  30. 30. S c oring via P MML Revolution Confidential Some predictive models can be expressed in the PMML standard (  Not all, but growing all the time ScoresR Model PMML Data PMML Object pmml Easily generate PMML with R “pmml” package PMML Scoring Engine: Zementis Adapa Databases and appliances may support PMML  But supported standards vary 31
  31. 31. P has e 4: R eal-Time S c oring Revolution Confidential Scoring triggered by Decision Layer, brokered by Integration Layer  R code  Revolution R Enterprise Server  In-appliance (e.g IBM PureData System for Analytics)  SQL  In-database  Rules  Compiled code  Bespoke engines May be using hardware from data layer  But not the actual data 32
  32. 32. R eal-Time Revolution ConfidentialS c oring:review iLog SQL Near Real Time 33
  33. 33. P has e 5: Model R efres h Revolution Confidential Deployed scoring rules no longer connected to Data Layer or Data Mart  Enables real-time performance  Need to refresh using recent data Model refresh process  Use data extract script  Re-run model script  Statistical review OR automated validation 34
  34. 34. S c heduled B atc h Updates Revolution Confidential  Periodic model refresh  Weekly  Daily Revolution R Enterprise  Hourly Production Server Cluster  Automated Scheduler validate/deploy process Excel RevoDeployR Server Web Services API 35
  35. 35. R e-developing the model Revolution Confidential Even then, the underlying structure of the data may change:  Important variables become non-significant  Non-significant variables become important!  New data sources become available Model accuracy drift is a warning sign Return to Phase 2 or Phase 1 36
  36. 36. Revolution Confidential Kilobytes / second Megabytes / secondHow big is‘B ig data’? MegabytesGigabytes  Terabytes Petabytes  Exabytes 38
  37. 37. Revolution Confidential Seconds or less MillisecondsHow fas t is‘R eal time’? Daily  Hourly Hours  Minutes Daily  Hourly Weeks  Days 39
  38. 38. R ec ommendations Revolution Confidential Architect your Predictive Analytics Stack  Best of breed vs single-vendor stack  Diversify data sources and decision apps R = Predictive Analytics Revolution R Enterprise provides:  Analytics Layer: big data, performance  Integration Layer: scalability, reliability Get Help to Get Started  Consider an SI partner for architecture, best practices and implementation advice 40
  39. 39. R es ourc es Revolution Confidential Revolution R Enterprise : R for Big Data  Revolution Analytics Consulting Services   Contact us: Rhadoop : Connecting R and Hadoop  Contact David Smith   @revodavid  41
  40. 40. T hank you. Revolution Confidential The leading commercial provider of software and support for the popular open source R statistics language. 650.646.9545 Twitter: @RevolutionR 42