Big data analytics on teradata with revolution r enterprise bill jacobs

2,349 views
2,058 views

Published on

Revolution Analytics brings big data analytics to Teradata database. Presentation from Teradata Partners, October 2013 overviewing Revolution R Enterprise for Teradata by Bill Jacobs, Director, Product Marketing, Revolution Analytics.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,349
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
103
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Enterprise readinessPerformance architectureBig Data analyticsData source integrationDevelopment toolsDeployment tools
  • Enterprise readinessBuild assurance: Continuous testing, custom validationImplementation tools: validation utilityTechnical support, documentation, trainingPerformance architectureFast math librariesBetter memory managementMulti-core processingDistributed computing architectureBig Data analyticsDescriptive StatisticsCross TabulationStatistical TestsCorrelation, Covariance and SSCP MatricesLinear RegressionLogistic RegressionGeneralized Linear ModelsDecision TreesK-Means ClusteringData source integrationODBCTeradata (high speed)Text Files: Delimited & Fixed formatSASSPSSHadoop:HDFS & HbaseDevelopment toolsVisual DebuggerScript EditorR SnippetsObject BrowserSolution ExplorerCustomizable WorkspaceVersion Control Plug-InDeployment toolsR objects as JSON, XMLSupports Java, JavaScript, .NETRESTful web services APISecurity: LDAP, SSOBuilt-In load balancingAsynchronous schedulingManagement consoleAccelerators: Jaspersoft, Qlikview
  • A Revolution R Enterprise ScaleR analytic is provided a data source as inputThe analytic loops over data, reading a block at a time. Blocks of data are read by a separate worker thread (Thread 0).Worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memoryWhen all of the data is processed a master results object is created from the intermediate results objects
  • RRE environments on workstations and servers submit execution requests to TD just as they do for other platforms.Steps in the execution include: In response to an execution request from a user or a tool, the users’s RRE instance packages the R Script and environment metadata into a request.Request is transferred to Teradata via ODBC interface and contains the R code and environment details.Teradata’s Gateway Process receives the ODBC request, enqueuing it for one of the Parsing Engines (PE), typically the least busy of them across the machine.The PE invokes the RRE engine (deployed inside of an Extended Stored Procedure)The RRE instance inside the XSP is essentially a master process. It decomposes the run request into one or more set of executions The execution requests are delivered to PEMAs running as table operators by the XSPScaleR PEMAs run as table operators on chunks of data provided by the Teradata database (part of the new native Table Operator functionality)PEMAs PEMAs run in VAMPs as table operators. Can be run interatively. TD chunks the data via AMPs We produce intermediate result objects Int. Results are returned to XSPs.
  • Type ahead: the IDE recognizes an R function as you type in the first few characters and shows the completed formula and parametersCode snippets: Templates for common R functions e.g. for loop, xy plot. These are written in XML and users can add their ownSolution Window: The RPE organizes R scripts and data files in folders by Solution. This facilitates but does not implement versioningThe lists of packages of installed and the list of loaded packages are available for inspection. Clicking on these packages shows their components in the object windowThe top right Object Browser window shows all of the objects available in the R environmentThe bottom right object window shows the details of particular objectsDebugging Tools: when running in debugging mode the RPE supports breakpoints, stepping in and out of code and shows the contents of variables upon “mouse over”.Users may step through all code available in the Solution that is active.
  • DeployR Examples at: http://50.57.191.94/revolution/docs/examples/User:testuserPassword: secret
  • Alteryx:Alteryx has always been about enabling business users to create powerful analytic applications through simple drag and drop functionsAnd as our support for R has developed, we have seen an evolution, as we have added more functionality, that has been fervently adopted by our customersAs we see more and more customers like WalMart using R and doing predictive analytics, we are starting to see these customers come up against the limitations of Open R. To address the scalability issues, and we started talking to Revolution Analytics. RevolutionDelivering Enterprise-Scale Predictive Analytics to Line of Business AnalystsWe make R enterprise ready in a number of ways. The end result is a powerful, scalable way to run predictive analytics, that takes leverages the open source community’s innovation and broad pool of experts.Furthermore, we are Enabling a Broader Audience to Harness the Power of RR is the most widely adopted statistical language with over 2M usersR is the standard statistical platform at universities around the worldR has a vibrant ecosystem that is constantly improving and innovating and offering new ways to use RAlteryx enables experts to adopt these new innovations but then to make the available to analysts by incorporating them back into the workflow, and then enabling those applications to be made available to business users through the Alteryx Gallery.
  • Big data analytics on teradata with revolution r enterprise bill jacobs

    1. 1. 1877 Big Data Analytics on Teradata: An Introduction to Revolution R Enterprise Bill Jacobs Dir., Product Marketing, Revolution Analytics
    2. 2. Demystifying R  What is R  Why is it so popular?  Is it only open source?
    3. 3. 3
    4. 4. Our view: Big Data meets Big Math = New Business Outcomes THE PERFECT STORM + Computing Power + Data + Pace of Business + Customer Expectations + Data Science + Computer Science + Management Science Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013 Better Business Decisions New Business Outcomes 4
    5. 5. Big Analytics Delivers Value from Big Data Volume Variety Velocity The three Vs of Big Data: The three V’s of Big Data Big Analytics: Maximizing Value, accommodating data Volatility, while assuring Veracity of insights 5 Confidential to Revolution Analytics
    6. 6. R Open Source - Language, Community, Collaboration - Robert Gentleman & Ross Ihaka, 1993 - Version 1.0 released 2000 - 2.5 Million Global Users - Over 4,800 add-on ―Packages‖ - Why R? R in Universities = New Talent WELCOME & INTRODUCTIONS Emerging Modeling/Visualization Lower Cost Alternative Open Source = Flexible & Innovative Access to Free Packages Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013 6
    7. 7. R is Exploding in Popularity & Functionality Internet Discussion Package Growth Mean monthly traffic on email discussion list Number of R packages listed on CRAN 4,000 2500 R 2000 3,000 1500 2,000 Stata 1000 SAS 1,000 SPSS S-Plus 0 1995 2000 2005 500 0 2010 Web Site Popularity Scholarly Activity Number of links to main web site Google Scholar hits (’05-’09 CAGR) R 4,000 SAS 2,000 1,050 SPSS 900 S-Plus Stata 600 R SAS 46% -11% SPSS -27% S-Plus Stata Source: http://r4stats.com/popularity 0% 10% 7
    8. 8. R is exploding in popularity & functionality R Usage Growth Rexer Data Miner Survey, 2007-2013 70% of data miners report using R “I’ve been astonished by the rate at which R has been adopted. Four years ago, everyone in my economics department [at the University of Chicago] was using Stata; now, as far as I can tell, R is the standard tool, and students learn it first.” Deputy Editor for New Products at Forbes 24% use R as primary tool “A key benefit of R is that it provides near-instant availability of new and experimental methods created by its user base — without waiting for the development/release cycle of commercial software. SAS recognizes the value of R to our customer base…” Source: www.rexeranalytics.com Product Marketing Manager SAS Institute, Inc
    9. 9. R Is The Most Commonly Used Primarly Analytics Tool 70% of data miners report using R 24% use R as primary tool Source: www.rexeranalytics.com Source: www.rexeranalytics.com
    10. 10. Example of advanced visualization with R Facebook Network Graphic 10
    11. 11. R Community, collaboration and breadth: CRAN task views (sub set of 4800+ packages) Source: http://www.maths.lancs.ac.uk/~rowlings/R/TaskViews/ Confidential to Revolution Analytics and shared with Siemens under the NDA dated 27/9/2013 11
    12. 12. Key Big Data Challenge: The Analytics Talent Pool 12
    13. 13. The Analytics Talent Pool With R 2 Million R Users 13
    14. 14. R is open source and drives analytic innovation but….has some limitations for Enterprises Big Data In-memory bound Hybrid memory & disk scalability Operates on bigger volumes & factors Speed of Analysis Single threaded Parallel threading Shrinks analysis time Enterprise Readiness Community support Commercial support Delivers full service production support Analytic Breadth & Depth 5000+ innovative analytic packages Leverage open source packages plus Big Data ready packages Supercharges R Commerci al Viability Risk of deployment of open source Commercial license Eliminate risk with open source 14
    15. 15. Our History & Our Future Revolution R Enterprise V1 through V6.1 Revolution R Enterprise V6.2 through V9 Revolution R Enterprise V10 through v11 NA Offices NYC Dallas Company Founding Relocate HQ to Palo Alto 250 Customers 2007 500 Customers 2013 Chapter 1 Capture Mindshare 1000 Customers 2015 Chapter 2 Mobilize with Market Focus Company Confidential – Do not distribute 2017 Chapter 3 Scalable Growth 15
    16. 16. Revolution Confidential 200+ Customer Stories Finance & Insurance Academic & Gov’t Healthcare & Life Sciences Digital Media & Retail Manufacturing & High Tech 16
    17. 17. Revolution Analytics - Overview We are the only provider of a commercial analytics platform based on the open source R statistical computing language. Distributed, high performance analytical algorithms Power Easier to build and deploy analytic applications Stable, scalable multi-platform with Productivity Enterprise Readiness Professional services enablement world-wide support World Wide Support Teams • Standard and Premium Programs • Technical Account Managers • Customer Success Managers Professional Services • Architecture planning • Systems Integration • Advanced analytic applications • Full life cycle projects 17
    18. 18. Customers Revolutionize their Business Power 4X performance 50M records scored daily “…we saw about a 4x performance improvement on 50 million records. It works brilliantly.” - CEO, John Wallace, DataSong Scalability TB’s data from 200+ data sources 10’s thousands attributes 100’s millions of scores daily “We’ve been able to scale our solution to a problem that’s so big that most companies could not address it…..” - SVP Analytics, Kevin Lyons, eXelate Performance 2X data 2X attributes no impact on performance “We need a highperformance analytics …we can now identify opportunities for our clients that would otherwise be lost.” - Chief Analytics Officer, Leon Zemel, [x+1] 19
    19. 19. Revolution R Enterprise  What is Revolution R Enterprise?  How does Revolution R Enterprise work with Teradata Database?
    20. 20. Revolution R Enterprise is…. the only big data big analytics platform based on open source R, the defacto statistical computing language for modern analytics  High Performance, Scalable Analytics  Portable Across Enterprise Platforms  Easier to Build & Deploy Analytics 21
    21. 21. How is RRE Used? Discovering Patterns with Big Data Building Models Efficiently Flexibly Deploying Models to Consumers  Customer segmentation  Market basket analysis  Social networking analysis  Fraud detection  Marketing attribution  Sentiment analysis  …and much more        Customer lifetime value  Pricing optimization  Recommendation engines  …and much more Credit risk Customer churn Propensity to buy Market risk Operational risk …and much more 22
    22. 22. Introducing Revolution R Enterprise (RRE) The Big Data Big Analytics Platform  Big Data Big Analytics Ready – Enterprise readiness DevelopR ConnectR DeployR – High performance analytics – Multi-platform architecture – Data source integration – Development tools ScaleR – Deployment tools DistributedR 23
    23. 23. The Platform Step by Step: R Capabilities R+CRAN RevoR • Open source R interpreter • UPDATED R 3.0.2 • Freely-available R algorithms • Algorithms callable by RevoR • Embeddable in R scripts • 100% Compatible with existing R scripts, functions and packages • Performance enhanced R interpreter • Based on open source R • Adds high-performance math Available On: • • • • • • • • • • • PlatformTM LSFTM Linux® Microsoft® HPC Clusters Microsoft Azure Burst Windows® & Linux Servers Windows & Linux Workstations Teradata® Database IBM® Netezza® IBM BigInsightsTM Cloudera Hadoop® Hortonworks Hadoop Intel® Hadoop 24
    24. 24. Big Data Speed @ Scale with Revolution R Enterprise (RRE) In-Hadoop Execution First, we enhance and accelerate the Open Source R interpreter. In-Database Execution Parallelized User Code Parallelized Algorithms Multi-Core Processing Multi-Threaded Execution Memory Management Fast Math Libraries 25
    25. 25. Open Source R Performance: Multi-threaded Math Open Customers report 5-50x Source R Revolution R Enterprise performance improvements compared to Open Source R — without changing any code Computation (4-core laptop) Open Source R Revolution R Speedup Matrix Multiply 176 sec 9.3 sec 18x Cholesky Factorization 25.5 sec 1.3 sec 19x Linear Discriminant Analysis 189 sec 74 sec 3x R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable Linear Algebra1 General R Benchmarks2 1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php 2. http://r.research.att.com/benchmarks/ 26
    26. 26. The Platform Step by Step: Parallelization & Data Sourcing ConnectR • High-speed & direct connectors Available for: ScaleR • Ready-to-Use high-performance big data big analytics • Fully-parallelized analytics • Data prep & data distillation • Descriptive statistics & statistical tests • Correlation & covariance matrices • Predictive Models – linear, logistic, GLM • Machine learning • Monte Carlo simulation • NEW Tools for distributing customized algorithms across nodes • High-performance XDF • SAS, SPSS, delimited & fixed format text data files • Hadoop HDFS (text & XDF) • Teradata Database & Aster • EDWs and ADWs • ODBC DistributedR • Distributed computing framework • Delivers portability across platforms Available on: • • • • • • • • Windows Servers Red Hat and NEW SuSE Linux Servers IBM Platform LSF Linux Microsoft HPC Clusters Microsoft Azure Burst NEW Teradata Database NEW Cloudera Hadoop NEW Hortonworks Hadoop 27
    27. 27. Big Data Speed @ Scale with Revolution R Enterprise (RRE) In-Hadoop Execution Second, we built a platform for hosting R with Big Data on a variety of massively parallel platforms. In-Database Execution Parallelized User Code Parallelized Algorithms Multi-Core Processing Multi-Threaded Execution Memory Management Fast Math Libraries 28
    28. 28. Revolution R Enterprise Powering Next Generation Analytics COMBINE INTERMEDIATE RESULTS 29
    29. 29. SAS HPA Speed comparison* Logistic Regression Rows of data 1 billion Parameters “just a few” Time 80 seconds Data location In memory Nodes 32 Cores 384 RAM 1,536 GB 1 billion Double 7 45% 44 seconds On disk 1/6th 5 5% 20 5% 80 GB Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. Revolution R Enterprise Delivers Performance at 2% of the Cost *As published by SAS in HPC Wire, April 21, 2011 30
    30. 30. Analytics Layer: High Performance Big Data Analytics with ScaleR R Data Step Descriptive Statistics Statistical Tests Sampling Predictive Modeling Data Visualization Machine Learning Simulation 31
    31. 31. ScaleR: Fast Parallel External Memory Algorithms Data Prep, Distillation & Descriptive Analytics R Data Step           Data import – Delimited, Fixed, SAS, SPSS, O BDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort Merge Split Aggregate by category (means, sums) Use any of the functionality of the R language to transform and clean data row by row! Descriptive Statistics              Min / Max Mean Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Company Confidential – Do not distribute Statistical Tests     Chi Square Test t-Test F-Test Plus 100’s of other tests available in R! Sampling    Subsample (observations & variables) Random Sampling High quality, fast, parallel random number generators 32
    32. 32. ScaleR: Fast Parallel External Memory Algorithms Statistical Modeling Predictive Models         Covariance, Correlation, Sums of Squares (cross product matrix for set variables) matrices Multiple Linear Regression Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions. Logistic Regression Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models Machine Learning Data Visualization     Histogram Line Plot Lorenz Curve ROC Curves (actual data and predicted values) Plus numerous tools in R and ScaleR to generate big data visualizations  Cluster Analysis  K-Means Classification   Decision Trees Decision Forests Simulation   High quality, fast, parallel random number generators Use the rich functionality of R for simulations 33
    33. 33. The Power of Revolution R Enterprise Performance & Scalability ScaleR ScaleR Moves computation to data ScaleR V a l u e Moves computation to data Leverage CRAN ScaleR Labor saving power DistributedR Maximizes computation DistributedR Powerful divide & conquer DistributedR Effective memory utilization RevoR 3-50X faster Open Source Leverage latest innovation 34
    34. 34. Why Teradata And Revolution R Enterprise?  Teradata User Demand  Data Movement Penalty Growing  New Analytics Requiring MPP Approach  R Popularity  Open Source Limitations  Arrival of Teradata v14.10 35
    35. 35. + Revolution Analytics coupled with the Teradata Unified Data Architecture accelerates big data analytics using the widely-accepted R language. Available Today:  Scalable R analytics on servers connected to Teradata  High speed, parallel data transfer, 5x faster than RODBC  Integrated parallel analytics solution Teradata Version 14.0 Upcoming Capabilities (4Q13)  Parallel R in-database for big data analytics on Teradata  R programmers can immediately build parallel R models completely in R  Revolution parallel in-database algorithms exclusively available on Teradata Revolution R Enterprise 6.2 High-Speed TPT Connector Company Confidential Teradata Version 14.10 + Revolution R Enterprise V7 36
    36. 36. Introducing Revolution R Enterprise Version 7 on Teradata Database  New Teradata Table Operators  New Parallelized Algorithms  In-Database Execution of Parallelized Algorithms  Executes R Scripts From R Workstations or Servers  Provides Orders of Magnitude Performance Gains  Supports Multiple Platforms in UDA  Available Late 2013 37
    37. 37. Revolution Analytics in the UDA UNIFIED DATA ARCHITECTURE With Revolution R Enterprise RODBC Seamless use of R analytics across the Teradata UDA 38
    38. 38. Transparent Parallelization of Analytical, Predictive Modeling and Machine Learning in Teradata HOW DOES IT WORK? 39
    39. 39. Understanding R’s Compute Workload R Script < 1% Computational Workload Breakdown Compute Burden from Script or Command Compute Burden from Algorithmic Computations Algorithms 99.xxx% 40
    40. 40. ScaleR PEMAs: High Performance Analytical Algorithms  Users Script Calls ScaleR PEMA – No Unique Code or Setup for Parallelism – ScaleR Algorithms are “just another R package” – Using PEMAs is Transparent, Automatic, Fast and Scales Linearly  PEMAs Transparently Parallelize Algorithm Execution – Parallelized Versions of Statistics, Predictive Modeling and Machine Learning Algorithms – PEMAs Transparently Distribute Computations Across AMPs – Results are Consolidated Into A Single Result Set – Provides Write Once Deploy Anywhere (WODA) Portability 41
    41. 41. Transparent Distributed Computing with RRE ScaleR Transparent to the Script  Algorithm Starts A Master Process  Master Identifies Environment In Revolution R Enterprise:  Script Calls ScaleR PEMA  Algorithm Executes  Algorithm Returns to Script  Script Continues Execution     Threading? Cores? Chips? Distributed Nodes?  Master Initializes Algorithm  Prepares Instructions for Nodes  Master Executes Table Operators In Each VAMP    VAMPs process each data segment Table Operator runs in each VAMP Table Operator returns Intermediate Result Object (IRO) to master process  Master Process Combines IROs  Returns Consolidated Answer to Script 42
    42. 42. ScaleR PEMAs on Teradata: Transparent Distribution of R Analytics Desktops & Servers Revolution R Enterprise  For Each Call to a ScaleR Algorithm: – One Request – Many Subtasks – One Answer Corporate Applications Revolution R Enterprise ODBC Teradata Database + Revolution R Enterprise Extended Stored Procedure Table Operators AMPs 43
    43. 43. Revolution R Enterprise Ecosystem Power of Integration SI / Service Deployment / Consumption MSP / DSP Advanced Analytics ETL Corios Data / Infrastructure 46
    44. 44. The Platform Step by Step: Tools & Deployment DevelopR DeployR • Freely-available R algorithms • Callable by RevoR • Embeddable in R scripts • Web services software development kit • Integrates R Into application infrastructures Available on: • Can be called by RevoR • Can be run singe-node using RevoR • Analyze large data using RDataStep package • Run on multiple nodes using rxEXEC package DevelopR DeployR Capabilities: • Invokes R Scripts from web services calls • RESTful interface for easy integration • Works with leading desktop & BI tools 47
    45. 45. DevelopR Integrated Development Environment Script with type ahead and code snippets Sophisticated debugging with breakpoints , variable values etc. Solutions window for organizing code and data Objects loaded in the R Environment Packages installed and loaded Object details http://www.revolutionanalytics.com/demos/revolution-productivity-environment/demo.htm 48
    46. 46. Data Analysis DeployR R / Statistical Modeling Expert Deployment Expert Business Intelligence  Seamless Mobile Web Apps  Bring the power of R to any web enabled application  Simple  Leverage common APIs including JS, Java, .NET  Scalable  Robustly scale user and compute workloads  Secure Cloud / SaaS  Manage enterprise security with LDAP & SSO 49
    47. 47. Create Custom, On-Demand Analytical Apps Some Examples: On-demand sales forecasting Leveraging the power of R from Microsoft tools Real-time social media sentiment analysis 50
    48. 48. Alteryx and Revolution Analytics Making Predictive Analytics More Accessible and Scalable Empowering Analysts with Easy-to-Use Predictive Tools combined with the Leading R Platform Delivering Enterprise-Scale Predictive Analytics to Line of Business Analysts Enabling a Broader Audience to Harness the Universe of R 51
    49. 49. Summary.  R is Hot. – Most Broadly Used Analytical Language – Its Popularity Addresses Critical Talent Gap – Vast Functionality Via CRAN – R Needs a Platform For Big Data Big Analytics  Revolution Provides Enterprise-Capable Platforms for R. – High Performance. – Scalable via Transparent Distributed Execution – Portable – Write Once Deploy Anywhere - WODA – Commercial Support & Services Cut Project Risks  Teradata + Revolution Provide a Robust Solution – Teradata provides stable, high-performane big data environment – Revolution provides speed, scale, portability and stability for the enterprise 52
    50. 50. Next steps? The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR 53
    51. 51. Thank You. 54

    ×