Revolution Analytics


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Revolution Analytics

  1. 1. Solution Spotlight Presents<br />
  2. 2. Integrating R and Hadoop<br />Part of Revolution Analytics’ <br />Big Analytics Strategy<br />Contact us at<br />2<br />
  3. 3. Outline<br />Introduction to Revolution Analytics<br />Opportunity and Challenges of Big Analytics<br />Revolution Analytics’ Support of Integration between R and Hadoop<br />Contact Info<br />
  4. 4. Open Source Analytics for the Enterprise<br /><ul><li>Most advanced statistical analysis software available</li></ul>The professor who invented analytic software for the experts now wants to take it to the masses<br /><ul><li>Half the cost of commercial alternatives
  5. 5. 2M+ Users
  6. 6. 2,500+ Applications</li></ul>Finance<br />Statistics<br />Life Sciences<br />Predictive Analytics<br />Manufacturing<br />Retail<br />Data Mining<br />Telecom<br />Social Media<br />Visualization<br />Government<br />
  7. 7. Revolution has garnered tremendous attention from media and analysts<br />
  8. 8. Big Analytics, Big Advantages<br />Big Analytics could be<br />Simple algorithms running on “Big Data”<br />Compute-intensive algorithms running on either “Big Data” or small data sets<br />Advanced Analytic routines for data visualization or statistical analysis<br />
  9. 9. Extracting Value with Big Analytics<br />Big Analytics’ Advantages<br />Predict the Future<br />Understand Risk and Uncertainty<br />Embrace Complexity<br />Identify the Unusual<br />Think Big<br />7<br />
  10. 10. Big Analytics Challenges <br />Computations are data intensive (i.e. require large amounts of data)<br />To be effective, must rely on data parallelism<br />Data is distributed across compute nodes<br />Same task is run in parallel on each of the data partitions<br />Examples of distributed computing frameworks that support data parallelism<br />Traditional file based analytics using on-premise clusters<br />Hadoop and MapReduce<br />In-Database Analytics using parallel hardware architectures<br />8<br />
  11. 11. Key Objectives for Big Analytics Deployments<br />Best performance is achieved when these Big Analytics challenges are overcome:<br />Avoid sampling / aggregation; <br />Reduce data movement and replication; <br />Bring the analytics as close as possible to the data and; <br />Optimize computation speed. <br />Revolution Analytics’ support for R and Hadoop helps overcome these challenges<br />
  12. 12. Revolution Analytics’ RevoConnectRsfor Hadoop<br />RevoHDFS provides connectivity from R to HDFS and RevoHBase<br />Allows an R programmer to manipulate Hadoop data stores directly from HDFS and HBASE<br />RevoHStream allows MapReduce jobs to be developed in R and executed as Hadoop Streaming jobs <br />Gives R programmers the ability to write MapReduce jobs in R using Hadoop Streaming<br />
  13. 13. R/Hadoop – Revolution Analytics<br />HDFS<br />HBASE<br /><ul><li>Connectors to HDFS and HBASE for interacting with data stores directly in R
  14. 14. Hadoop Streaming package for executing MapReduce jobs from R.</li></ul>R<br />Map Reduce<br />Task Tracker<br />Task Node<br />R Client<br />Job Tracker<br />
  15. 15. RevoHDFS<br />R package for working with HDFS<br />Connect and Browse HDFS<br />Read/Write/Delete/Copy/Rename files<br />Examples:<br />Read an HDFS text file into a data frame<br />Serialize a data frame to HDFS<br />Stream lines from HDFS text file that can be used with biglm or bigglm<br />12<br />
  16. 16. RevoHBase<br />R Package for working with HBASE<br />Connect and Browse HBASE<br />Get Rows/Columns of an HBASE table<br />Write data to HBASE table<br />Create/Delete HBASE table<br />Examples<br />Create a data frame in R from a collection of Rows/Columns from HBASE<br />Update an HBASE table with values from a data frame<br />13<br />
  17. 17. RevoHStream<br />RevoHStream – R package capable of performing the following types of Analysis using Hadoop Streaming<br />Simulations - Monte Carlo and other Stochastic analysis<br />R ‘apply’ family of operations (tapply, lapply…)<br />Binning, quantiles, summaries and crosstabs for input to displays (ggplot, lattice).<br />Data transformations<br />Data Mining<br />14<br />
  18. 18. Example MapReduce AlgorithmLogistic Regresion<br />## create test set as follows<br />## rhwrite(lapply (1:100, function(i) {eps = rnorm(1, sd =10) ; keyval(i, list(x = c(i,i+eps), y = 2 * (eps > 0) - 1))}), "/tmp/logreg")<br />## run as:<br />## rhLogisticRegression("/tmp/logreg", 10, 2, 0.05)<br />## max likelihood solution diverges for separable dataset, (-inf, inf) such as the above<br />rhLogisticRegression = function(input, iterations, dims, alpha){<br /> plane = rep(0, dims)<br />g = function(z) 1/(1 + exp(-z))<br /> for (i in 1:iterations) {<br /> gradient = rhread(revoMapReduce(input,<br /> map = function(k, v) keyval (1, v$y * v$x * g(-v$y * (plane %*% v$x))),<br /> reduce = function(k, vv) keyval(k, apply(,vv),2,sum)),<br /> combine = T))<br /> plane = plane + alpha * gradient[[1]]$val }<br />plane }<br />15<br />
  19. 19. Get more information about Revolution Analytics’ Big Analytics Solutions, including R connectors for Hadoop<br />1 855-GET-REVO<br />16<br /><br />
  20. 20.<br />