The Elephant In The Room
 Big Data Analytics In the Cloud
    Bill Peer, Principal, Infosys Labs
                 UP 2012
      Cloud Computing Conference
San Francisco, California – December 12, 2012
What’s on the agenda?
•   Definitions
•   Big Data and Analytic Technologies
•   Architecture Stuff
•   Summary




Appendix : References

                                         2
Definitions – Big Data
• Big Data –       data processing scenarios wherein the volume,
                   variety, and/or velocity of the data is such that
                   conventional RDBMS and/or Data Warehouse
                   technologies alone do not suffice for the need
,   Bill’s Stake In The Ground:
          • Volume - Greater than 100 GB
          • Variety    - Structured and Unstructured (forms, video, blogs, photos, …)
          • Velocity - 10 GB per hour




                                         3
Definitions – Analytics
• Analytics –   discovery of meaningful patterns in data

  Bill’s Two Uses:
        • Decision Support (to help make a choice)
                    -Business Intelligence
                    -Operational Intelligence
        • Value Creation (to add worth)
                    -Algorithm Discovery
                    -Analytics as a Service




                                       4
Past, Present, Future




(images sourced from WikiCommons)




                                    5
Big Data and Analytic Technologies




                      6
Big Data and Analytic Technology : 3 to Know


• Based on Google Paper published in 2004 (MapReduce)
• Can be segmented into 2 key capabilities: MapReduce and HDFS
• Designed to work in a distributed, fault possible environment
 MapReduce –              HDFS –                    Job Based!
 Processing               Hadoop File System
 Orchestration            (Reliable independent     Pig Latin - Language to explore data
 Framework                of persistence            Hive QL– SQL like calls
 (Great if a problem      mechanism by way of       Mahout – Machine Learning collection
 can be easily divided)   multi-node replication)


                                              7
Big Data and Analytic Technology : 3 to Know
                                DRILL
• Based on Google Paper published in 2010 (Dremel)
• Provides analysis of large-scale datasets
• Designed to work in a distributed environment
 Query Languages-   Low-Latency        Apache Incubator Phase
 Google BigQuery    Distributed
                                       “[Dremel] is capable of running aggregation
                    Execution –
                                       queries over trillion-row tables in seconds.
                    Columnar centric   The system scales to thousands of CPUs
                    storage            and petabytes of data, and has thousands of
                                       users at Google.” src: Google Dremel Paper

                                  8
Big Data and Analytic Technology : 3 to Know
                                      Storm
• Event Streaming platform used by Twitter
• Allows for continuous real-time data spelunking
• Designed to work in a distributed environment leveraging clusters
 Resident Queries-      Topology Centric-   Event Streaming is Different
 Requests for event     You create graphs
                        of computation      Storm can be used effectively to build a
 patterns of interest
                                            Complex Event Processing (CEP)
 are continuously                           capability by an enterprise. As with other
 watched for                                CEP type frameworks, it requires a shift to
                                            an uncommon perspective to be effective.

                                        9
A Cloud Centric Big Data NRT Architecture
                   CEP
                                Interactive
                                   Query



                                                  *Architecture
                                                  Graphic is a
                                                  modified version of
                                                  WSO2’s BAM picture




    Not Cloud        In Cloud                 Not Cloud

                         10
Big, Big Data Analytic Architecture Consideration
• Data Transfer Speed
   • Where is your data? Is it where you will be processing?
       • 1TB of Data takes:
           • 300 hours over a 10Mbps network
           • 30 hours over a 100Mbps network
           • 3 hours over a 1Gbps network
           • 20 minutes over a 10Gbps network




                                   11
Framework For Selecting Approach




                     12
Summary
•   “approaches for near-real time Business Intelligence and Analytics”
•   “Info. on technologies ranging from Hadoop to Dremel to Event Streaming “
•   “applicability and limitations of these when in the Cloud”
•   “high-level architectures that must be considered will be shared”
•   “entertained, energized, and enlightened”
•   “realistic frame of reference to bring back to their organization”
•   “Journey to the Clouds”
•   “Dumbo can really fly”



                                        13
Feedback Forms
Please extract from your wallet
One of the feedback forms to the right

Add any commentary you have in the
White space, and hand to the
Presenter after the session


Thank you for attending!
See you in the Clouds!




                                         14
References
• Big Data Spectrum, Infosys
         http://www.infosys.com/cloud/resource-center/Pages/big-data-spectrum.aspx
• Dremel: Interactive Analysis of Web-Scale Datasets, Melnik et. all, Google
         http://research.google.com/pubs/pub36632.html
• DrillProposal, Apache
         http://wiki.apache.org/incubator/DrillProposal
• Storm Rationale
         https://github.com/nathanmarz/storm/wiki/Rationale
• WSO2 BAM, wso2
         http://wso2.com/products/business-activity-monitor/



                                            15

The elephantintheroom bigdataanalyticsinthecloud

  • 1.
    The Elephant InThe Room Big Data Analytics In the Cloud Bill Peer, Principal, Infosys Labs UP 2012 Cloud Computing Conference San Francisco, California – December 12, 2012
  • 2.
    What’s on theagenda? • Definitions • Big Data and Analytic Technologies • Architecture Stuff • Summary Appendix : References 2
  • 3.
    Definitions – BigData • Big Data – data processing scenarios wherein the volume, variety, and/or velocity of the data is such that conventional RDBMS and/or Data Warehouse technologies alone do not suffice for the need , Bill’s Stake In The Ground: • Volume - Greater than 100 GB • Variety - Structured and Unstructured (forms, video, blogs, photos, …) • Velocity - 10 GB per hour 3
  • 4.
    Definitions – Analytics •Analytics – discovery of meaningful patterns in data Bill’s Two Uses: • Decision Support (to help make a choice) -Business Intelligence -Operational Intelligence • Value Creation (to add worth) -Algorithm Discovery -Analytics as a Service 4
  • 5.
    Past, Present, Future (imagessourced from WikiCommons) 5
  • 6.
    Big Data andAnalytic Technologies 6
  • 7.
    Big Data andAnalytic Technology : 3 to Know • Based on Google Paper published in 2004 (MapReduce) • Can be segmented into 2 key capabilities: MapReduce and HDFS • Designed to work in a distributed, fault possible environment MapReduce – HDFS – Job Based! Processing Hadoop File System Orchestration (Reliable independent Pig Latin - Language to explore data Framework of persistence Hive QL– SQL like calls (Great if a problem mechanism by way of Mahout – Machine Learning collection can be easily divided) multi-node replication) 7
  • 8.
    Big Data andAnalytic Technology : 3 to Know DRILL • Based on Google Paper published in 2010 (Dremel) • Provides analysis of large-scale datasets • Designed to work in a distributed environment Query Languages- Low-Latency Apache Incubator Phase Google BigQuery Distributed “[Dremel] is capable of running aggregation Execution – queries over trillion-row tables in seconds. Columnar centric The system scales to thousands of CPUs storage and petabytes of data, and has thousands of users at Google.” src: Google Dremel Paper 8
  • 9.
    Big Data andAnalytic Technology : 3 to Know Storm • Event Streaming platform used by Twitter • Allows for continuous real-time data spelunking • Designed to work in a distributed environment leveraging clusters Resident Queries- Topology Centric- Event Streaming is Different Requests for event You create graphs of computation Storm can be used effectively to build a patterns of interest Complex Event Processing (CEP) are continuously capability by an enterprise. As with other watched for CEP type frameworks, it requires a shift to an uncommon perspective to be effective. 9
  • 10.
    A Cloud CentricBig Data NRT Architecture CEP Interactive Query *Architecture Graphic is a modified version of WSO2’s BAM picture Not Cloud In Cloud Not Cloud 10
  • 11.
    Big, Big DataAnalytic Architecture Consideration • Data Transfer Speed • Where is your data? Is it where you will be processing? • 1TB of Data takes: • 300 hours over a 10Mbps network • 30 hours over a 100Mbps network • 3 hours over a 1Gbps network • 20 minutes over a 10Gbps network 11
  • 12.
  • 13.
    Summary • “approaches for near-real time Business Intelligence and Analytics” • “Info. on technologies ranging from Hadoop to Dremel to Event Streaming “ • “applicability and limitations of these when in the Cloud” • “high-level architectures that must be considered will be shared” • “entertained, energized, and enlightened” • “realistic frame of reference to bring back to their organization” • “Journey to the Clouds” • “Dumbo can really fly” 13
  • 14.
    Feedback Forms Please extractfrom your wallet One of the feedback forms to the right Add any commentary you have in the White space, and hand to the Presenter after the session Thank you for attending! See you in the Clouds! 14
  • 15.
    References • Big DataSpectrum, Infosys http://www.infosys.com/cloud/resource-center/Pages/big-data-spectrum.aspx • Dremel: Interactive Analysis of Web-Scale Datasets, Melnik et. all, Google http://research.google.com/pubs/pub36632.html • DrillProposal, Apache http://wiki.apache.org/incubator/DrillProposal • Storm Rationale https://github.com/nathanmarz/storm/wiki/Rationale • WSO2 BAM, wso2 http://wso2.com/products/business-activity-monitor/ 15