R evolution A nalytic s



T he B ig Data A nalytic s R evolution
S tarts with R

Dec ember 20, 2011



                                         1
In Today’s Webinar:
 About Revolution Analytics
 Getting Value with Advanced Analytics
 Implementing The Advanced Analytics Stack
 Resources and Further Reading
Most advanced statistical
analysis software available
                                 The professor who invented analytic software for

Half the cost of                the experts now wants to take it to the masses


commercial alternatives

2M+ Users
                                                                                    Power
 4,000+ Applications


                 Finance
    Statistics
                 Life Sciences
   Predictive    Manufacturing
    Analytics                                                                               Productivity
                 Retail
  Data Mining    Telecom                                 Enterprise
 Visualization
                 Social Media                            Readiness
                 Government
What is R ?



Data analysis software
                         An open-source
                         software project
A programming language
                            A community
An environment




                                        4
What’s the Differenc e B etween R and
R evolution R E nterpris e?
                               Revolution R is 100% R and More®
             Multi-Threaded      Web-Based       Web Services     Big Data          Parallel
             Math Libraries         GUI              API          Analysis           Tools


 Technical                                                                                     IDE / Developer
  Support                                                                                            GUI




                   4,000+ Community                                            Build
                       Packages                  R Engine                    Assurance
                                             Language Libraries




                                                                                                                 5
L et’s Talk about B ig Data


                              6
E xtrac ting Value with A dvanc ed A nalytic s
 Missing the potential value of the data that is
 being collected
 Need more than counts and averages
 Advanced Analytics with Big Data
    Predict the Future
    Understand Risk and Uncertainty
    Embrace Complexity
    Identify the Unusual
    Think Big

                                                   7
R : A Unique P latform for E xtrac ting Value from
Data

  Data Exploration    • R is superior at exploring data to find unexpected trends and
                        relationships…finding the best predictive models and identify critical
                        “outliers”, such as clusters of customers who are particularly
  and Visualization     profitable(or unprofitable!).


                      • Google, LinkedIn and Facebook, rely on R and the skills of data
                        scientists who are accustomed to hacking together large data sets
    Data Science        from disparate sources, visualizing and exploring data to identify
                        novel modeling techniques, and combining the results of several
                        modeling strategies to optimize predictive power.



      Modeling        •Other commercial programs push users through a pre-programmed procedure
                       and discourages modeling innovation. R was created as a 4GL with the
                       needs of modern data scientists in mind, with an interactive language that
     Innovation        promotes data exploration, data visualization, and flexible data modeling.




       Talent         •R is creating a massive amount of talent because is now the dominant tool of
                       choice at the universities.




                                                                                                      8
Making It Work
Us e C as es for B ig Data A nalytic s deployment


                                                    9
T he A dvanc ed A nalytic s S tac k
       Deployment / Consumption




       Advanced Analytics




       ETL




       Data / Infrastructure




                “Open Analytics Stack” White Paper: bit.ly/lC43Kw
                                                                    10
B es t P rac tic es for Implementing an A dvanc ed
A nalytic s S tac k for B ig Data

  Limit sampling
  Reduce data movement and replication
  Bring the analytics as close as possible to
  the data
  Optimize computation speed – parallel
  algorithms




                                                     11
B ig Data C omputations
 Computations are data intensive
 To be effective, must rely on data parallelism
   Data is distributed across compute nodes
   Same task is run in parallel on each of the data partitions
 Examples of distributed computing frameworks that
 support data parallelism
   Traditional file based analytics using on-premise clusters
   Hadoop and MapReduce
   In-Database Analytics using parallel hardware
   architectures



                                                             12
R evolution R E nterpris e: B ig Data S tatis tic s in R
                              www.revolutionanalytics.com/bigdata



Every US airline
departure and arrival,
1987-2008


File: AirlineData87to08.xdf
Rows: 123.5 million
Variables: 29
Size on disk: 13.2Gb




                 arrDelayLm2 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime),cube=TRUE)




                                                                                         13
R evoS c aleR – Dis tributed C omputing

              Compute                      •   Portions of the data source are
  Data         Node                            made available to each compute
 Partition   (RevoScaleR)                      node

                                           •   RevoScaleR on the master node
              Compute                          assigns a task to each compute
  Data         Node                            node
 Partition   (RevoScaleR)
                             Master        •   Each compute node independently
                             Node              processes its data, and returns its
              Compute       (RevoScaleR)       intermediate results back to the
  Data         Node                            master node
 Partition   (RevoScaleR)
                                           •   master node aggregates all of the
                                               intermediate results from each
              Compute                          compute node and produces the
  Data         Node                            final result
 Partition   (RevoScaleR)




                                                                             14
R and Hadoop

                                                Capabilities delivered as individual
                        HBASE                   R packages
              HDFS
                                                       rhdfs - R and HDFS
   R
                                   Thrift              rhbase - R and HBASE
 Map or
 Reduce
                                                       rmr - R and MapReduce
 Task                                        rhbase
                    rhdfs
 Node

                                                      Downloads available from
                                  R Client            Github
            Job
          Tracker           rmr




                                                                                  15
R evolution A nalytic s with Netezza A pplianc e




                                              16
Deployment with R evolution R E nterpris e
End User         Desktop                    Business
                                                                   Interactive Web
               Applications                Intelligence
                                                                     Applications
                (i.e. Excel)             (i.e. QlikView)

Application
                          Client libraries (JavaScript, Java, .NET)
Developer


                                                   HTTP/HTTPS – JSON/XML

                               RevoDeployR Web Services


Admin           Session                              Data/Script
                                 Authentication                      Administration
              Management                            Management


R
                  R
Programmer            R
                          R

                                                                                      17
T hree final thoughts
 Now enterprise-ready, R offers innovation
 and flexibility needed to meet analytics
 challenges in a changing world
 R-enabled advanced analytics are key to
 unlocking value in big data
 Revolution Analytics optimizes R to take
 advantage of multiple data management
 paradigms and emerging best practices


                                             18
R es ourc es
  Slides / Replay: bit.ly/r-big-data

  “Open Analytics Stack” White Paper: bit.ly/lC43Kw

  McKinsey Report on Big Data: bit.ly/jWyrFM

  Conway, Data Science Intelligence: bit.ly/myMwak

  “Big Analytics” White Paper by Norman H. Nie: bit.ly/biganalytics

  Revolution R Enterprise: bit.ly/Enterprise-R

  Questions: david.champagne@revolutionanalytics.com




                                                                      19
T hank you.




            The leading commercial provider of software and support for the popular
                             open source R statistics language.




  www.revolutionanalytics.com            650.330.0553                  Twitter: @RevolutionR




                                                                                               20

Big Data Analysis Starts with R

  • 1.
    R evolution Analytic s T he B ig Data A nalytic s R evolution S tarts with R Dec ember 20, 2011 1
  • 2.
    In Today’s Webinar: About Revolution Analytics Getting Value with Advanced Analytics Implementing The Advanced Analytics Stack Resources and Further Reading
  • 3.
    Most advanced statistical analysissoftware available The professor who invented analytic software for Half the cost of the experts now wants to take it to the masses commercial alternatives 2M+ Users Power  4,000+ Applications Finance Statistics Life Sciences Predictive Manufacturing Analytics Productivity Retail Data Mining Telecom Enterprise Visualization Social Media Readiness Government
  • 4.
    What is R? Data analysis software An open-source software project A programming language A community An environment 4
  • 5.
    What’s the Difference B etween R and R evolution R E nterpris e? Revolution R is 100% R and More® Multi-Threaded Web-Based Web Services Big Data Parallel Math Libraries GUI API Analysis Tools Technical IDE / Developer Support GUI 4,000+ Community Build Packages R Engine Assurance Language Libraries 5
  • 6.
    L et’s Talkabout B ig Data 6
  • 7.
    E xtrac tingValue with A dvanc ed A nalytic s Missing the potential value of the data that is being collected Need more than counts and averages Advanced Analytics with Big Data Predict the Future Understand Risk and Uncertainty Embrace Complexity Identify the Unusual Think Big 7
  • 8.
    R : AUnique P latform for E xtrac ting Value from Data Data Exploration • R is superior at exploring data to find unexpected trends and relationships…finding the best predictive models and identify critical “outliers”, such as clusters of customers who are particularly and Visualization profitable(or unprofitable!). • Google, LinkedIn and Facebook, rely on R and the skills of data scientists who are accustomed to hacking together large data sets Data Science from disparate sources, visualizing and exploring data to identify novel modeling techniques, and combining the results of several modeling strategies to optimize predictive power. Modeling •Other commercial programs push users through a pre-programmed procedure and discourages modeling innovation. R was created as a 4GL with the needs of modern data scientists in mind, with an interactive language that Innovation promotes data exploration, data visualization, and flexible data modeling. Talent •R is creating a massive amount of talent because is now the dominant tool of choice at the universities. 8
  • 9.
    Making It Work Use C as es for B ig Data A nalytic s deployment 9
  • 10.
    T he Advanc ed A nalytic s S tac k Deployment / Consumption Advanced Analytics ETL Data / Infrastructure “Open Analytics Stack” White Paper: bit.ly/lC43Kw 10
  • 11.
    B es tP rac tic es for Implementing an A dvanc ed A nalytic s S tac k for B ig Data Limit sampling Reduce data movement and replication Bring the analytics as close as possible to the data Optimize computation speed – parallel algorithms 11
  • 12.
    B ig DataC omputations Computations are data intensive To be effective, must rely on data parallelism Data is distributed across compute nodes Same task is run in parallel on each of the data partitions Examples of distributed computing frameworks that support data parallelism Traditional file based analytics using on-premise clusters Hadoop and MapReduce In-Database Analytics using parallel hardware architectures 12
  • 13.
    R evolution RE nterpris e: B ig Data S tatis tic s in R www.revolutionanalytics.com/bigdata Every US airline departure and arrival, 1987-2008 File: AirlineData87to08.xdf Rows: 123.5 million Variables: 29 Size on disk: 13.2Gb arrDelayLm2 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime),cube=TRUE) 13
  • 14.
    R evoS caleR – Dis tributed C omputing Compute • Portions of the data source are Data Node made available to each compute Partition (RevoScaleR) node • RevoScaleR on the master node Compute assigns a task to each compute Data Node node Partition (RevoScaleR) Master • Each compute node independently Node processes its data, and returns its Compute (RevoScaleR) intermediate results back to the Data Node master node Partition (RevoScaleR) • master node aggregates all of the intermediate results from each Compute compute node and produces the Data Node final result Partition (RevoScaleR) 14
  • 15.
    R and Hadoop Capabilities delivered as individual HBASE R packages HDFS rhdfs - R and HDFS R Thrift rhbase - R and HBASE Map or Reduce rmr - R and MapReduce Task rhbase rhdfs Node Downloads available from R Client Github Job Tracker rmr 15
  • 16.
    R evolution Analytic s with Netezza A pplianc e 16
  • 17.
    Deployment with Revolution R E nterpris e End User Desktop Business Interactive Web Applications Intelligence Applications (i.e. Excel) (i.e. QlikView) Application Client libraries (JavaScript, Java, .NET) Developer HTTP/HTTPS – JSON/XML RevoDeployR Web Services Admin Session Data/Script Authentication Administration Management Management R R Programmer R R 17
  • 18.
    T hree finalthoughts Now enterprise-ready, R offers innovation and flexibility needed to meet analytics challenges in a changing world R-enabled advanced analytics are key to unlocking value in big data Revolution Analytics optimizes R to take advantage of multiple data management paradigms and emerging best practices 18
  • 19.
    R es ources Slides / Replay: bit.ly/r-big-data “Open Analytics Stack” White Paper: bit.ly/lC43Kw McKinsey Report on Big Data: bit.ly/jWyrFM Conway, Data Science Intelligence: bit.ly/myMwak “Big Analytics” White Paper by Norman H. Nie: bit.ly/biganalytics Revolution R Enterprise: bit.ly/Enterprise-R Questions: david.champagne@revolutionanalytics.com 19
  • 20.
    T hank you. The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com 650.330.0553 Twitter: @RevolutionR 20