How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real Time Streaming Analytics

16,046 views

Published on

"Big Data" is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud.

"Fast Data" via stream processing is the solution to embed patterns - which were obtained from analyzing historical data - into future transactions in real-time. This session uses several real world success stories to explain the concepts behind stream processing and its relation to Hadoop and other big data platforms. The session discusses how patterns and statistical models of R, Spark MLlib and other technologies can be integrated into real-time processing using open source frameworks (such as Apache Storm, Spark or Flink) or products (such as IBM InfoSphere Streams or TIBCO StreamBase). A live demo shows the complete development lifecycle combining analytics, machine learning and stream processing.

Published in: Technology

How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real Time Streaming Analytics

  1. 1. HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING Kai Wähner kwaehner@tibco.com @KaiWaehner www.kai-waehner.de LinkedIn / Xing  Please connect!
  2. 2. 2 Digital Transformation - Physical and Digital Worlds are Merging © Copyright 2000-2016 TIBCO Software Inc.
  3. 3. 3 Apply Big Data Analytics to Real Time Processing © Copyright 2000-2016 TIBCO Software Inc.
  4. 4. 4 Analyse and Act on Critical Business Moments © Copyright 2000-2016 TIBCO Software Inc.
  5. 5. Key Take-Aways  Insights are hidden in Historical Data on Big Data Platforms  Machine Learning and Big Data Analytics find these Insights by building Analytics Models  Event Processing uses these Models (without Rebuilding) to take Action in Real Time
  6. 6. 6 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo
  7. 7. 7 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo
  8. 8. 8 Machine Learning © Copyright 2000-2016 TIBCO Software Inc. Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look. http://www.sas.com
  9. 9. 9 10 Examples of Machine Learning © Copyright 2000-2016 TIBCO Software Inc. • Spam Detection • Credit Card Fraud Detection • Digit Recognition • Speech Understanding • Face Detection • Shape Detection • Product Recommendation • Medical Diagnosis • Stock Trading • Customer Segmentation http://machinelearningmastery.com/practical-machine-learning-problems/
  10. 10. 10 10 Examples of Machine Learning © Copyright 2000-2016 TIBCO Software Inc. • Spam Detection: Given email in an inbox, identify those email messages that are spam and those that are not. Having a model of this problem would allow a program to leave non-spam emails in the inbox and move spam emails to a spam folder. We should all be familiar with this example. • Credit Card Fraud Detection: Given credit card transactions for a customer in a month, identify those transactions that were made by the customer and those that were not. A program with a model of this decision could refund those transactions that were fraudulent. • Digit Recognition: Given a zip codes hand written on envelops, identify the digit for each hand written character. A model of this problem would allow a computer program to read and understand handwritten zip codes and sort envelops by geographic region. • Speech Understanding: Given an utterance from a user, identify the specific request made by the user. A model of this problem would allow a program to understand and make an attempt to fulfil that request. The iPhone with Siri has this capability. • Face Detection: Given a digital photo album of many hundreds of digital photographs, identify those photos that include a given person. A model of this decision process would allow a program to organize photos by person. Some cameras and software like iPhoto has this capability. http://machinelearningmastery.com/practical-machine-learning-problems/
  11. 11. 11 10 Examples of Machine Learning © Copyright 2000-2016 TIBCO Software Inc. • Product Recommendation: Given a purchase history for a customer and a large inventory of products, identify those products in which that customer will be interested and likely to purchase. A model of this decision process would allow a program to make recommendations to a customer and motivate product purchases. Amazon has this capability. Also think of Facebook, GooglePlus and Facebook that recommend users to connect with you after you sign-up. • Medical Diagnosis: Given the symptoms exhibited in a patient and a database of anonymized patient records, predict whether the patient is likely to have an illness. A model of this decision problem could be used by a program to provide decision support to medical professionals. • Stock Trading: Given the current and past price movements for a stock, determine whether the stock should be bought, held or sold. A model of this decision problem could provide decision support to financial analysts. • Customer Segmentation: Given the pattern of behaviour by a user during a trial period and the past behaviours of all users, identify those users that will convert to the paid version of the product and those that will not. A model of this decision problem would allow a program to trigger customer interventions to persuade the customer to covert early or better engage in the trial. • Shape Detection: Given a user hand drawing a shape on a touch screen and a database of known shapes, determine which shape the user was trying to draw. A model of this decision would allow a program to show the platonic version of that shape the user drew to make crisp diagrams. The Instaviz iPhone app does this. http://machinelearningmastery.com/practical-machine-learning-problems/
  12. 12. 12 Types of Machine Learning Problems © Copyright 2000-2016 TIBCO Software Inc. • Classification: Data is labelled meaning it is assigned a class, for example spam / non-spam or fraud / non-fraud. • Regression: Data is labelled with a real value (think floating point) rather then a label. Examples that are easy to understand are time series data like the price of a stock over time. • Clustering: Data is not labelled, but can be divided into groups based on similarity and other measures of natural structure in the data. An example from would be organising pictures by faces without names. • Rule Extraction: Data is used as the basis for the extraction of propositional rules (antecedent/consequent aka if-then). An example is the discovery of the relationship between the purchase of beer and diapers. http://machinelearningmastery.com/practical-machine-learning-problems/ (no complete list!)
  13. 13. © Copyright 2000-2016 TIBCO Software Inc. Closed Loop for Big Data Analytics MODEL Develop model Deploy into Stream Processing flow ACT Automatically monitor real-time transactions Automatically trigger action ANALYZE Analyze data via Data Discovery Uncover patterns, trends, correlations
  14. 14. 14 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc. Immediate Long-Term Competitive AdvantageValue to the Organization A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity Self-service Dashboards Event Processing Analytics
  15. 15. 15 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc. Immediate Long-Term Competitive AdvantageValue to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Analytics
  16. 16. 16 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc. Immediate Long-Term Competitive AdvantageValue to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Analytics
  17. 17. 17 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo
  18. 18. 18 Analytical Pipeline © Copyright 2000-2016 TIBCO Software Inc.
  19. 19. 19 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc. Immediate Long-Term Competitive AdvantageValue to the Organization A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity Self-service Dashboards Event Processing Analytics
  20. 20. What is Predictive Analytics?
  21. 21. 21 Analytical Pipeline © Copyright 2000-2016 TIBCO Software Inc.
  22. 22. © Copyright 2000-2016 TIBCO Software Inc. Data Acquisition
  23. 23. 23 Analytical Pipeline © Copyright 2000-2016 TIBCO Software Inc.
  24. 24. © Copyright 2000-2016 TIBCO Software Inc. Data Munging / Wrangling / Mash-up
  25. 25. cust_id dept sku dollar gift date 1 104 C 12003 2.40 FALSE 2016-10-17 2 105 A 12005 62.85 FALSE 2016-10-17 3 102 C 12007 69.23 TRUE 2016-10-17 4 104 B 12004 9.33 FALSE 2016-10-18 5 105 C 12010 14.16 TRUE 2016-10-18 6 101 B 12003 90.43 FALSE 2016-10-19 7 103 C 12005 90.97 FALSE 2016-10-19 n … … … … … … cust_id A B C total # orders first_date last_date 1 100 21.76 23.67 0.00 45.43 2 2016-10-19 2016-10-20 2 101 0.01 74.65 0.00 74.66 3 2016-10-19 2016-10-20 3 102 0.00 60.92 50.29 111.21 6 2016-10-17 2016-10-20 4 103 0.00 0.00 52.30 52.30 2 2016-10-19 2016-10-20 5 104 31.34 9.33 2.40 43.06 4 2016-10-17 2016-10-20 6 105 62.85 0.00 56.00 118.85 3 2016-10-17 2016-10-20 © Copyright 2000-2016 TIBCO Software Inc. Data Munging - Transformations
  26. 26. 26 Analytical Pipeline © Copyright 2000-2016 TIBCO Software Inc.
  27. 27. © Copyright 2000-2016 TIBCO Software Inc. Exploratory Data Analysis
  28. 28. Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) 1. to maximize insight into a data set 2. uncover underlying structure 3. extract important variables 4. detect outliers and anomalies 5. test underlying assumptions 6. develop parsimonious models 7. determine optimal factor settings © Copyright 2000-2016 TIBCO Software Inc. Exploratory Data Analysis
  29. 29. “The greatest value of a picture is when it forces us to notice what we never expected to see” John W. Tukey, 1977 © Copyright 2000-2016 TIBCO Software Inc. Exploratory Data Analysis
  30. 30. Visual Analytics - Interactive Brush-Linked © Copyright 2000-2016 TIBCO Software Inc.
  31. 31. 31 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc. Immediate Long-Term Competitive AdvantageValue to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Analytics
  32. 32. What is Predictive Analytics?
  33. 33. 33 Analytical Pipeline © Copyright 2000-2016 TIBCO Software Inc.
  34. 34. © Copyright 2000-2016 TIBCO Software Inc. Which picture represents a model? A model is a simplification of the truth that helps you with decision making.
  35. 35. © Copyright 2000-2016 TIBCO Software Inc. Model Building Supervised Models – known, labeled responses • Regression (for example Linear Regression) • Categorical (for example Random Forest) Unsupervised Models – no labeled responses • Clustering (for example k-means clustering)
  36. 36. © Copyright 2000-2016 TIBCO Software Inc. Model Building
  37. 37. Employees who write longer emails earn higher salaries! © Copyright 2000-2016 TIBCO Software Inc. Model Building
  38. 38. © Copyright 2000-2016 TIBCO Software Inc. Model Improvement
  39. 39. Managers Staff © Copyright 2000-2016 TIBCO Software Inc. Model Improvement
  40. 40. 40 Analytical Pipeline © Copyright 2000-2016 TIBCO Software Inc.
  41. 41. © Copyright 2000-2016 TIBCO Software Inc. Model Validation How is the IQ of a kid related to the IQ of his / her mum?
  42. 42. © Copyright 2000-2016 TIBCO Software Inc. What tools do Data Scientists use?
  43. 43. Data Scientists work with many Tools © Copyright 2000-2016 TIBCO Software Inc. • SQL • Excel • Python • R Source: O’Reilly 2015 Data Science Salary Survey http://duu86o6n09pv.cloudfront.net/reports/2015- data-science-salary-survey.pdf
  44. 44. 44 Alternatives for Data Scientists © Copyright 2000-2016 TIBCO Software Inc. Open Source Closed Source Tooling Source Code (no complete list) R
  45. 45. R Language R is well known as the most and increasingly getting more popular programming language used by data scientists for modeling. It is developing very rapidly with a very active community. © Copyright 2000-2016 TIBCO Software Inc.
  46. 46. R with Revolution Analytics (now Microsoft) © Copyright 2000-2016 TIBCO Software Inc. Open Source GPL License (including its restrictions) http://www.revolutionanalytics.com/webinars/introducing-revolution-r-open-enhanced-open-source-r-distribution-revolution-analytics
  47. 47. • TIBCO has rewritten R as a Commercial Compute Engine • Latest statistics scripting engine: S a S-PLUS® a R a TERR • Runs R code including CRAN packages • Engine internals rebuilt from scratch at low-level • Redesigned data objects, memory management • High performance + Big Data • TERR is licensed from TIBCO • TERR Installs (free) with Spotfire Analyst / Desktop + other TIBCO products • Spotfire Server can manage all TERR / R scripts, artifacts for reuse • Standalone Developer Edition • Supported by TIBCO • No GPL license issues © Copyright 2000-2016 TIBCO Software Inc. TERR - TIBCO’s Enterprise Runtime for R
  48. 48. Which R to use? © Copyright 2000-2016 TIBCO Software Inc. http://www.forbes.com/sites/danwoods/2016/01/27/microsofts-revolution-analytics-acquisition-is-the-wrong-way-to-embrace-r/
  49. 49. 49 Apache Spark © Copyright 2000-2016 TIBCO Software Inc. General Data-processing Framework  However, focus is especially on Analytics (at least these days) http://fortune.com/2016/09/09/cloudera-spark-mapreduce/
  50. 50. Spark MLlib © Copyright 2000-2016 TIBCO Software Inc. MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. You can even combine Mllib module with R language
  51. 51. 51 Why Spark is used for Analytics?
  52. 52. 52 Apache Spark – Focus on Analytics http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/ http://fortune.com/2016/09/09/cloudera-spark-mapreduce/ http://www.ebaytechblog.com/2016/05/28/using-spark-to-ignite-data-analytics/ http://www.forbes.com/sites/paulmiller/2016/06/15/ibm-backs-apache-spark-for-big-data-analytics/ “[IBM’s initiatives] include: • deepening the integration between Apache Spark and existing IBM products like the Watson Health Cloud; • open sourcing IBM’s existing SystemML machine learning technology;
  53. 53. H20 © Copyright 2000-2016 TIBCO Software Inc. An Extensible Open Source Platform for Analytics • Best of Breed Open Source Technology • Easy-to-use WebUI and Familiar Interfaces • Data Agnostic Support for all Common Database and File Types • Massively Scalable Big Data Analysis • Real-time Data Scoring (Nanofast Scoring Engine) http://www.h2o.ai/
  54. 54. TIBCO Spotfire for Visual Data Discovery © Copyright 2000-2016 TIBCO Software Inc. Let the business user leverage historical data to find insights!
  55. 55. TIBCO Spotfire with R / TERR Integration © Copyright 2000-2016 TIBCO Software Inc. Let the business user leverage Analytic Models (created by the Data Scientist)! Example: Customer Churn with Random Forest Algorithm • ‘refresh model’ button lives a ‘random forest algorithm’ • requires no a priori assumptions at all, it just always works • The business user doesn’t need to know what random forest is to be empowered by it Select variables for the model
  56. 56. SaaS Machine Learning © Copyright 2000-2016 TIBCO Software Inc. • Managed SaaS service for building ML models and generating predictions • Integrated into the corresponding cloud ecosystem • Easy to use, but limited feature set and potential latency issues if combined with external data or applications http://docs.aws.amazon.com/machine-learning/latest/dg/tutorial.html
  57. 57. PMML (Predictive Model Markup Language ) © Copyright 2000-2016 TIBCO Software Inc. • XML-based de facto standard to represent predictive analytic models • Developed by the Data Mining Group (DMG) • Easily share models between PMML compliant applications (e.g. between model creation and deployment for operations) http://www.ibm.com/developerworks/library/ba-ind-PMML1/
  58. 58. 58 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo
  59. 59. 59 Analytics Maturity Model © Copyright 2000-2016 TIBCO Software Inc. Immediate Long-Term Competitive AdvantageValue to the Organization Self-service Dashboards Event Processing Predictive and Prescriptive Analytics Measure Diagnose Predict Optimize Operationalize Automate Analytics Maturity A good Big Data Analytics platform can provide value to the organization across the full spectrum of use cases Self-service Dashboards Event Processing Analytics
  60. 60. Streaming Analytics © Copyright 2000-2016 TIBCO Software Inc. time 1 2 3 4 5 6 7 8 9 Event Streams • Continuous Queries • Sliding Windows • Filter • Aggregation • Correlation • …
  61. 61. Operational Intelligence in Action © Copyright 2000-2016 TIBCO Software Inc. Actions by Operations Human decisions in real time informed by up to date information The Challenge: Empower operations staff to see and seize key business moments61 Automated action based on models of history combined with live context and business rules The Challenge: Create, understand, and deploy algorithms & rules that automate key business reactions Machine-to-Machine Automation
  62. 62. What is Prescriptive Analytics?
  63. 63. 63 Alternatives for Stream Processing © Copyright 2000-2016 TIBCO Software Inc. OPEN SOURCE CLOSED SOURCE PRODUCT FRAMEWORK (no complete list!) Azure Microsoft Stream Analytics
  64. 64. Visual IDE (Dev, Test, Debug) Simulation (Feed Testing, Test Generation) Live UI (monitoring, proactive interaction) Maturity (24/7 support, consulting) Integration (out-of-the-box: ESB, MDM, etc.) Library (Java, .NET, Python) Query Language (often similar to SQL) Scalability (horizontal and vertical, fail over) Connectivity (technologies, markets, products) Operators (Filter, Sort, Aggregate) What Streaming Alternative do you need? Time to Market Streaming Frameworks Streaming Products Slow Fast Streaming Concepts
  65. 65. 65 Comparison of Stream Processing Frameworks and Products © Copyright 2000-2016 TIBCO Software Inc. Slide Deck from JavaOne 2016: http://www.kai-waehner.de/blog/2016/10/25/comparison-of-stream-processing-frameworks-and-products/
  66. 66. StreamBase: The Power of Visual Programming © Copyright 2000-2016 TIBCO Software Inc. 1) Get ideas into market in days or weeks, not months or years 2) Unlock the power of IT and data scientists working together
  67. 67. 67 Dynamic aggregation Live visualization Ad-hoc continuous query Alerts Action Live Datamart
  68. 68. © Copyright 2000-2016 TIBCO Software Inc. How to apply analytic models to real time processing without rebuilding them ?
  69. 69. Streaming Analytics to operationalize insights and patterns in real time without rebuilding the models Stream Processing H20 Open Source R TERR Spark MLlib MATLAB SAS PMML Real Time Close Loop: Understand – Anticipate – Act
  70. 70. TIBCO StreamBase + R / TERR
  71. 71. TIBCO StreamBase + H20
  72. 72. TIBCO StreamBase + PMML
  73. 73. Real World Application - Customer Churn
  74. 74. 74 Agenda © Copyright 2000-2016 TIBCO Software Inc. 1) Machine Learning and Big Data Analytics 2) Analysis of Historical Data 3) Real Time Processing 4) Live Demo
  75. 75. © Copyright 2000-2013 TIBCO Software Inc. “An outage on one well can cost $10M per hour. We have 20-100 outages per year.“ - Drilling operations VP, major oil company
  76. 76. BIG DATA AT REST FAST DATA IN MOTION Insight to Action – Closing the Loop
  77. 77. Data Monitoring • Motor temperature • Motor vibration • Current • Intake pressure • Intake temperature • Flow Electrical power cable Pump Intake Protector ESP motor Pump monitoring unit Pump Components © Copyright 2000-2016 TIBCO Software Inc. Live Surveillance of Equipment
  78. 78. Voltage Temperature Vibration Device history Temporal analytic: “If vibration spike is followed by temp spike then voltage spike [within 4 hours] then flag high severity alert.” Predictive Analytics (Fault Management)
  79. 79. Operational Analytics Operations Live UI SENSOR DATA TRANSACTIONS MESSAGE BUS MACHINE DATA SOCIAL DATA Streaming AnalyticsAction Aggregate Rules Stream Processing Analytics Correlate Live Monitoring Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS Data Sheets BI Data Scientists Cleansed Data History Data Discovery Analytics Enterprise Service Bus ERP MDM DB WMS SOA Data Storage InternalData IntegrationBus API Event Server Predictive Maintenance Spark Big Data Machine Data (Sensors, Weather Data, …) Take Action (Stop Machine, Send Mechanic, …) Find Insights (Sensor Behaviour, Hardware Issues, …) ERP System (Transaction History, Production Volume) 2
  80. 80. Operational Analytics Operations Live UI SENSOR DATA TRANSACTIONS MESSAGE BUS MACHINE DATA SOCIAL DATA Streaming AnalyticsAction Aggregate Rules Stream Processing Analytics Correlate Live Monitoring Continuous query processing Alerts Manual action, escalation HISTORICAL ANALYSIS Data Sheets BI Data Scientists Cleansed Data History Data Discovery Analytics Enterprise Service Bus ERP MDM DB WMS SOA Data Storage InternalData IntegrationBus API Event Server Complete Big Data Architecture Spark Big Data
  81. 81. Leading Indicators Pump Failure
  82. 82. Find Leading Indicators Backtest Rules / Models Push Rules / Models to Streambase © Copyright 2000-2016 TIBCO Software Inc. Create a Model
  83. 83. © Copyright 2000-2016 TIBCO Software Inc. Real Time Analytics Trend Analysis Combination of Rules CUSUM Analysis Statistical Analysis Statistical Process Control Machine Learning • Location Change – Variable moves up or down • Slope Change – Variable changes trend • Variance Change – Variable becomes more/less volatile • Process Threshold – Shewhart control chart • Failure Model y (0/1) = f (X, b) + e; f = logistic regression, trees, svm, nnet, ...
  84. 84. Upon event trigger, populate Spotfire RCA template; email responsible engineer Put model into Action
  85. 85. 1. Rules / models pushed from Spotfire 2. Data streams into StreamBase 3. Data evaluated in real-time 4. Spotfire RCA on trigger Other notifications available Live view on streaming data Streambase – from Big Data to Fast Data
  86. 86. © Copyright 2000-2016 TIBCO Software Inc. TIBCO StreamBase – TERR Adapter
  87. 87. Live View of the Situation + Proactive Actions
  88. 88. Responsible engineer clicks URL to launch Spotfire Root Cause Analysis; diagnose issue Compare Live Data with Historical Data to make Human Decision
  89. 89. TIBCO Spotfire + StreamBase + TERR + Live Datamart Live Demo
  90. 90. Key Take-Aways  Insights are hidden in Historical Data on Big Data Platforms  Machine Learning and Big Data Analytics find these Insights by building Analytics Models  Event Processing uses these Models (without Rebuilding) to take Action in Real Time
  91. 91. Questions? Please contact me! Kai Wähner kwaehner@tibco.com @KaiWaehner www.kai-waehner.de LinkedIn / Xing  Please connect!

×