Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Analysis Starts with R


Published on

Published in: Technology, Education
  • Be the first to comment

Big Data Analysis Starts with R

  1. 1. R evolution A nalytic sT he B ig Data A nalytic s R evolutionS tarts with RDec ember 20, 2011 1
  2. 2. In Today’s Webinar: About Revolution Analytics Getting Value with Advanced Analytics Implementing The Advanced Analytics Stack Resources and Further Reading
  3. 3. Most advanced statisticalanalysis software available The professor who invented analytic software forHalf the cost of the experts now wants to take it to the massescommercial alternatives2M+ Users Power 4,000+ Applications Finance Statistics Life Sciences Predictive Manufacturing Analytics Productivity Retail Data Mining Telecom Enterprise Visualization Social Media Readiness Government
  4. 4. What is R ?Data analysis software An open-source software projectA programming language A communityAn environment 4
  5. 5. What’s the Differenc e B etween R andR evolution R E nterpris e? Revolution R is 100% R and More® Multi-Threaded Web-Based Web Services Big Data Parallel Math Libraries GUI API Analysis Tools Technical IDE / Developer Support GUI 4,000+ Community Build Packages R Engine Assurance Language Libraries 5
  6. 6. L et’s Talk about B ig Data 6
  7. 7. E xtrac ting Value with A dvanc ed A nalytic s Missing the potential value of the data that is being collected Need more than counts and averages Advanced Analytics with Big Data Predict the Future Understand Risk and Uncertainty Embrace Complexity Identify the Unusual Think Big 7
  8. 8. R : A Unique P latform for E xtrac ting Value fromData Data Exploration • R is superior at exploring data to find unexpected trends and relationships…finding the best predictive models and identify critical “outliers”, such as clusters of customers who are particularly and Visualization profitable(or unprofitable!). • Google, LinkedIn and Facebook, rely on R and the skills of data scientists who are accustomed to hacking together large data sets Data Science from disparate sources, visualizing and exploring data to identify novel modeling techniques, and combining the results of several modeling strategies to optimize predictive power. Modeling •Other commercial programs push users through a pre-programmed procedure and discourages modeling innovation. R was created as a 4GL with the needs of modern data scientists in mind, with an interactive language that Innovation promotes data exploration, data visualization, and flexible data modeling. Talent •R is creating a massive amount of talent because is now the dominant tool of choice at the universities. 8
  9. 9. Making It WorkUs e C as es for B ig Data A nalytic s deployment 9
  10. 10. T he A dvanc ed A nalytic s S tac k Deployment / Consumption Advanced Analytics ETL Data / Infrastructure “Open Analytics Stack” White Paper: 10
  11. 11. B es t P rac tic es for Implementing an A dvanc edA nalytic s S tac k for B ig Data Limit sampling Reduce data movement and replication Bring the analytics as close as possible to the data Optimize computation speed – parallel algorithms 11
  12. 12. B ig Data C omputations Computations are data intensive To be effective, must rely on data parallelism Data is distributed across compute nodes Same task is run in parallel on each of the data partitions Examples of distributed computing frameworks that support data parallelism Traditional file based analytics using on-premise clusters Hadoop and MapReduce In-Database Analytics using parallel hardware architectures 12
  13. 13. R evolution R E nterpris e: B ig Data S tatis tic s in R US airlinedeparture and arrival,1987-2008File: AirlineData87to08.xdfRows: 123.5 millionVariables: 29Size on disk: 13.2Gb arrDelayLm2 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime),cube=TRUE) 13
  14. 14. R evoS c aleR – Dis tributed C omputing Compute • Portions of the data source are Data Node made available to each compute Partition (RevoScaleR) node • RevoScaleR on the master node Compute assigns a task to each compute Data Node node Partition (RevoScaleR) Master • Each compute node independently Node processes its data, and returns its Compute (RevoScaleR) intermediate results back to the Data Node master node Partition (RevoScaleR) • master node aggregates all of the intermediate results from each Compute compute node and produces the Data Node final result Partition (RevoScaleR) 14
  15. 15. R and Hadoop Capabilities delivered as individual HBASE R packages HDFS rhdfs - R and HDFS R Thrift rhbase - R and HBASE Map or Reduce rmr - R and MapReduce Task rhbase rhdfs Node Downloads available from R Client Github Job Tracker rmr 15
  16. 16. R evolution A nalytic s with Netezza A pplianc e 16
  17. 17. Deployment with R evolution R E nterpris eEnd User Desktop Business Interactive Web Applications Intelligence Applications (i.e. Excel) (i.e. QlikView)Application Client libraries (JavaScript, Java, .NET)Developer HTTP/HTTPS – JSON/XML RevoDeployR Web ServicesAdmin Session Data/Script Authentication Administration Management ManagementR RProgrammer R R 17
  18. 18. T hree final thoughts Now enterprise-ready, R offers innovation and flexibility needed to meet analytics challenges in a changing world R-enabled advanced analytics are key to unlocking value in big data Revolution Analytics optimizes R to take advantage of multiple data management paradigms and emerging best practices 18
  19. 19. R es ourc es Slides / Replay: “Open Analytics Stack” White Paper: McKinsey Report on Big Data: Conway, Data Science Intelligence: “Big Analytics” White Paper by Norman H. Nie: Revolution R Enterprise: Questions: 19
  20. 20. T hank you. The leading commercial provider of software and support for the popular open source R statistics language. 650.330.0553 Twitter: @RevolutionR 20