The Modern Data Architecture for Predictive Analytics with Hortonworks and Revolution Analytics
 

The Modern Data Architecture for Predictive Analytics with Hortonworks and Revolution Analytics

on

  • 3,123 views

Hortonworks and Revolution Analytics have teamed up to bring the predictive analytics power of R to Hortonworks Data Platform. ...

Hortonworks and Revolution Analytics have teamed up to bring the predictive analytics power of R to Hortonworks Data Platform.

Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem.

This presentation covers:
- Trends and business drivers for Hadoop
- How Hortonworks and Revolution Analytics play a role in the modern data architecture
- How you can run R natively in Hortonworks Data Platform to simply move your R-powered analytics to Hadoop

Presentation replay at:
http://www.revolutionanalytics.com/news-events/free-webinars/2013/modern-data-architecture-revolution-hortonworks/

Statistics

Views

Total Views
3,123
Views on SlideShare
1,901
Embed Views
1,222

Actions

Likes
6
Downloads
191
Comments
0

8 Embeds 1,222

http://www.revolutionanalytics.com 1164
http://infrastacks.net 20
http://revolutionanalytics.com 14
https://twitter.com 13
http://yonnie.devcloud.acquia-sites.com 4
http://yonniedev.devcloud.acquia-sites.com 3
http://yonnietest.devcloud.acquia-sites.com 3
http://www.revolution-computing.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Remember that CRAN is a new term to IT professionals, and anyone who hasn’t learned much about R. Spend some time on it. The acronym stands for: Community R Archive Network – a single repository of R algorithms, test data, evaluations. Use by nearly all R programmers.

The Modern Data Architecture for Predictive Analytics with Hortonworks and Revolution Analytics The Modern Data Architecture for Predictive Analytics with Hortonworks and Revolution Analytics Presentation Transcript

  • © Hortonworks Inc. 2013 Modern Data Architecture …for Predictive Analytics David Smith VP Marketing and Community - Revolution Analytics John Kreisa VP Strategic Marketing- Hortonworks Page 1
  • © Hortonworks Inc. 2013 Your Presenters • David Smith (@revodavid) –VP Marketing and Community at Revolution Analytics –Data Scientist, Blogger and co-author of An Introduction to R • John Kreisa (@marked_man) –VP Strategic Marketing, Hortonworks –Over 20 years in data management as a developer and a marketer –Avid camper Page 2
  • © Hortonworks Inc. 2013 Today’s Topics • Introduction • Drivers for the Modern Data Architecture (MDA) • Apache Hadoop in the MDA • R’s role in the MDA • Q&A Page 3
  • © Hortonworks Inc. 2013 Poll #1: What stage are you at looking in Hadoop? •Research •Evaluation •Trial •Haven’t started research Page 4
  • © Hortonworks Inc. 2013 Existing Data Architecture Page 5 APPLICATIONSDATASYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMS EDW MPP OPERATIONAL TOOLS MANAGE & MONITOR DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Packaged Applications
  • © Hortonworks Inc. 2013 Existing Data Architecture Page 6 APPLICATIONSDATASYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMS EDW MPP Business Analytics Custom Applications Packaged Applications Source: IDC 2.8 ZB in 2012 85% from New Data Types 15x Machine Data by 2020 40 ZB by 2020
  • © Hortonworks Inc. 2013 - Confidential Modern Data Architecture Enabled Page 7 APPLICATIONSDATASYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMS EDW MPP Emerging Sources (Sensor, Sentiment, Geo, Unstructured) OPERATIONAL TOOLS MANAGE & MONITOR DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Packaged Applications
  • © Hortonworks Inc. 2013 - Confidential Hadoop Powers Modern Data Architecture Page 8 Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment. Hadoop Cluster compute & storage . . . . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  • © Hortonworks Inc. 2013 - Confidential Driving Efficiency Driving Opportunity Drivers for Hadoop Adoption Modern Data Architecture Hadoop has a central role in next generation data architectures while integrating with existing data systems Business Applications Use Hadoop to extract insights that enable new customer value and competitive edge Existing Traditional Server log Clickstream Big Data Sets Emerging Sentiment/Social Machine/Sensor Geo-locations
  • © Hortonworks Inc. 2013 - Confidential Opportunity in types of data 1. Sentiment Understand how your customers feel about your brand and products – right now 2. Clickstream Capture and analyze website visitors’ data trails and optimize your website 3. Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines 4. Geographic Analyze location-based data to manage operations where they occur 5. Server Logs Research logs to diagnose process failures and prevent security breaches 6. Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents Value Page 10
  • © Hortonworks Inc. 2013 - Confidential Efficiency in the Modern Data Architecture Page 11 APPLICATIONSDATASYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMS EDW MPP Emerging Sources (Sensor, Sentiment, Geo, Unstructured) Business Analytics Custom Applications Packaged Applications • Drive efficiency via modern data architecture • Store data once and access it in many ways • Often referred to a data lake or data repository • Infrastructure platform driven • IT-oriented, TCO based
  • © Hortonworks Inc. 2013 - Confidential Engineered for Interoperability Page 12 APPLICATIONSDATASYSTEMSOURCES RDBMS EDW MPP Emerging Sources (Sensor, Sentiment, Geo, Unstructured) HANA BusinessObjects BI OPERATIONAL TOOLS DEV & DATA TOOLS Existing Sources (CRM, ERP, Clickstream, Logs) INFRASTRUCTURE
  • © Hortonworks Inc. 2013 - Confidential Integrated Interoperable with existing data center investments Skills Leverage your existing skills: development, operations, analytics Requirements for Hadoop Adoption Page 13 Key Services Platform, operational and data services essential for the enterprise Requirements for Hadoop’s Role in the Modern Data Architecture
  • © Hortonworks Inc. 2013 - Confidential Revolution R Enterprise Architecture Page 14 APPLICATIONSDATASYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMS EDW MPP Emerging Sources (Sensor, Sentiment, Geo, Unstructured) OPERATIONAL TOOLS MANAGE & MONITOR DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Packaged Applications = Revolution R Enterprise
  • © Hortonworks Inc. 2013 Today’s Topics • Introduction • Drivers for the Modern Data Architecture (MDA) • Apache Hadoop’s role in the MDA • R’s role in the MDA • Q&A Page 15
  • © Hortonworks Inc. 2013 Poll #2: Which of the following best describes your use of R and Hadoop? •We have R+ Hadoop in Production •We have testing R+ Hadoop •We have started to investigate but nothing is implemented •No current plans Page 16
  • Revolution Confidential What is the Open Source R Project?  The R Language:  Object-Oriented Language for Stats, Math and Data Science  Comprehensive data visualization and statistical modeling capabilities  The R Community:  2M+ Users with the Skill to Tackle Big Data Statistical and Numerical Analysis and Machine Learning Projects  New graduates with data skills learn R  The R Ecosystem:  5000+ Freely Available Algorithms in CRAN  Specialized methods for finance, economics, genomics, linguistics, and every data-driven domain 17
  • Revolution Confidential R is open source and drives analytic innovation but has some limitations for Enterprises Bigger data sizes Speed of analysis Production support Memory Bound Big Data Single Threaded Scale out, parallel processing, high speed Community Support Commercial production support Innovation and scale Innovative 5000+ packages Exponential growth Combines with open source R packages where needed
  • Revolution Confidential Revolution R Enterprise 19 Enterprise-Ready Revolution R Enterprise is the only commercial big data analytics platform based on open source R statistical computing language Cross-Platform Big Data Analytics High Performance Analytics Easier Build & Deploy
  • Modern Data Architecture Extract and Analyze  Ad-hoc Data Distillation  Exploratory Data Analysis / Data Visualization  Model Development AMBARI MAPREDUCE YARN HDFS REST DATA REFINEMENT HIVEPIG CUSTOM HTTP STREAM LOAD SQOOP FLUME WebHDFS NFS STRUCTURE HCATALOG (metadata services) Query/Visualization/ Reporting/Analytical Tools and Apps SOURCE DATA - Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory DBs JMS Queue’s Fil es Fil esFiles LOAD SQOOP/Hive Web HDFS Data Sources CSV DATABASES INTERACTIVE HIVE Server2 Analytical Tools ANALYTICAL rHadoop
  • Revolution Confidential The Data Scientist’s Big Data Toolkit 21 Statistical Tests Machine Learning Simulation Descriptive Statistics Data Visualization R Data Step Predictive Models Sampling
  • Parallel External-Memory Algorithms 22 CPU CPU CPU SMP SERVER
  • Parallel External-Memory Algorithms 23 HADOOP NODE HADOOP NODE HADOOP NODE HADOOP CLUSTER
  • Revolution Confidential Modern Data Architecture with RRE7 In-Hadoop Predictive Analytics  Production Data Distillation (e.g. Semantic Analysis)  Production Model Processing / Re-Estimation  Production Model Scoring AMBARI MAPREDUCE YARN HDFS REST DATA REFINEMENT HIVEPIG CUSTOM DISTILLED DATA FILES HTTP STREAM LOAD SQOOP FLUME WebHDFS NFS STRUCTURE HCATALOG (metadata services) Query/Visualization/ Reporting/Analytical Tools and Apps SOURCE DATA - Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory DBs JMS Queue’s Fil es Fil esFiles LOAD SQOOP/Hive Web HDFS Data Sources CSV DATABASES INTERACTIVE HIVE Server2 Analytical Tools ANALYTICAL Revolution R Enterprise
  • Revolution Confidential Hadoop As An R Engine  Use Revolution R Enterprise PEMAs in Hadoop  No need to change existing R code  Simple R programming  No need to “Think In MapReduce”  Eliminate data movement to slash latencies  Use Hadoop nodes as parallel R computation engines 25 Hadoop
  • © Hortonworks Inc. 2013 Integrated Interoperable with existing data center investments Skills Leverage your existing skills: development, operations, analytics Requirements for Hadoop Adoption Page 26 Key Services Platform, operational and data services essential for the enterprise Requirements for Hadoop’s Role in the Modern Data Architecture
  • © Hortonworks Inc. 2013 Poll #3: Which of the following would you most like to accomplish with R + Hadoop? •Build a model to be put in product in Hadoop •Build a model to be put in product elsewhere •Create new data from Hadoop to supplement an existing analytics process •Something else Page 27
  • © Hortonworks Inc. 2013 Next Steps: Page 28 More about Revolution Analytics and Hadoop http://www.revolutionanalytics.com/products/r-for- hadoop.php Get started on Hadoop with Hortonworks Sandbox http://hortonworks.com/sandbox Follow us: @hortonworks @RevolutionR