Building Big Data Applications Services for Private Clouds

Richard McDougall
Chief Architect, Storage and Application Services
VMware, Inc
@richardmcdougll




                                                    © 2009 VMware Inc. All rights reserved
Infrastructure, Apps and now Data…




                            Build    Run
     Private
               Public


                                Manage



Simplify Infrastructure   Simplify App Platform
                                                  Simplify Data
     With Cloud              Through PaaS




 2
Trend 1/3: New Data Growing at 60% Y/Y

Exabytes of information stored                                           20 Zetta by 2015

                                                                         1 Yotta by 2030

                                                                         Yes, you are part
                                                                         of the yotta
                                                        audio(           generation…
                                                  digital(tv(
                                               digital(photos(
                                       camera(phones,(rfid(
                                   medical(imaging,(sensors(
                  satellite(images,(logs,(scanners,(twi7er(
       cad/cam,(appliances,(machine(data,(digital(movies(



                                                          Source: The Information Explosion, 2009


3
Data Growth in the Enterprise




4
Trend 2/3: Big Data – Driven by Real-World Benefit




5
Enterprise : Early Adopter Industries and Use Cases




6
Early Adopters: Enterprise Segmentation

Verticals!                          Targets!                      Use Cases!

•        Financial Services"   •    Existing Hadoop Users"   •    Business Trend Analytics"
•        Retail"               •    Business Analysts"       •    Revenue analytics"
•        Telco"                •    Data Scientists"         •    CDR, call pattern analytics"
•        Manufacturing"        •    LOB managers"            •    Sensor data analytics"
•        Government"           •    IT/Ops"                  •    Log, machine data analytics"
                                                             •    Fraud detection"
                                                             •    Homeland security"
                                                             •    Predictive analytics"




     7
Early Adopters: Non-enterprise Segmentation

Verticals!                          Targets!                     Use Cases!

•        Online Advertising"   •    End users/Exec users"   •    Behavioral Analytics"
•        eCommerce"            •    Business Analysts"      •    Audience segmentation"
•        Mobile"               •    PM, LOB managers"       •    Revenue Optimization"
•        Social Media"         •    Marketing/Sales"        •    User activity monetization"
•        Gaming"               •    Data Engineers"         •    Inventory, price
                               •    Data Scientists"             management"
                               •    IT/Operations"          •    Recommendations"
                                                            •    Predictive analytics"




     8
Why now? more transactions (Social/Mobile/Local)



SoMoLo
                                  30B
                  500 TB          messages/     35 check-ins/   13k API calls/
                  data/day        month         sec             sec




Big
“traditional”
companies       1TB data/
                day                                     10k card
                                  3.7B calls/           transactions/sec
                                  month

                Size of data   communications            transactions
    9
Trend 3/3: Value from Data Exceeds Hardware Cost

!  Value from the intelligence of data analytics now outstrips the cost
  of hardware
  •  Hadoop enables the use of 10x lower cost hardware
  •  Hardware cost halving every 18mo
                                                         Value
                 Big Iron:
                 $40k/CPU

                                                                 Commodity
                                                                 Cluster:
                                                                 $1k/CPU
                                       Cost




 10
The Old Big Data Stack                                   Business
                                                        Intelligence




                    Extract, Transform,
                                                                Data
                                          Statistics
                           Load           (SAS, SPSS)       Visualization
                       (Informatica)                       (Crystal, Bus O)
          Files

           SQL
        Databases           E
                            T
                            L                Column Oriented
                                            Relational Database
                                          (Oracle, Teradata, DB2)
                       Master Data
                      Management
                      (Oracle, SAP)




11
The Old Big Data Stack

!  Unable to handle large data volumes & diversity of
 data
!  Iterative, brute-force and slow process
                                                                                        Business
!  Lack of ad-hoc data navigation across events and                                   Intelligence

 time
!  Cumbersome ETL to “process” and DBAs to
 “prepare”
!  Focused on structured data that is warehoused
!  Web analytics solutions force real-time events into                                        Data
 rigid schemas in DBs                             Extract, Transform,
                                                         Load
                                                                        Statistics
                                                                        (SAS, SPSS)
                                                                                          Visualization
                                                                                           (Crystal, Bus
                                                     (Informatica)                              O)
                                        Files

                                         SQL
                                      Databases           E                  Column Oriented
                                                          T                 Relational Database
                                                          L               (Oracle, Teradata, DB2)


                                                     Master Data
                                                    Management
                                                    (Oracle, SAP)


 12
The Journey To Big Data Analytics




1                              2                                  3
     All Data                      Data Science                       Real Time Decisions
     Faster Answers                Collaboration                      New Applications
     Elastic & Scalable            Self-Service                       Data Monetization


                                                                  Big Data Enabled Apps

                                   Agile Process & Tools


        Analytics Engines
        Analytic Engines       Analytic Productivity Platform


       Cloud Infrastructure

        BI As A Service                Agile Analytics             Predictive Enterprise
          Technology Focus          People & Productivity Focus        Application Focus

          Goal: encourage           Goal: discover meaningful        Goal: operationalize
          experimentation                  insights that                those insights
          with existing data          impact the business           as quickly as possible


13
Customer profiles

1.  Business analysts, LOB managers, execs
     •  Need: out-of-the-box analytics
     •  Designed for: self-service for end-user leveraging app
       developers
2.  Data engineers/analysts
     •  Need: out-of-the-box + some customization
     •  Designed for: admin + operations
3.  Data scientists
     •  Need: power capabilities + heavy customization
     •  Designed for: data scientists
4.  IT, Operations
     •  Need: out-of-the-box + some customization
     •  Designed for: IT/admin, ops


14
What is Data Science and Data Engineering?




              Distributed,
                                     Math and Statistical
        Parallelization Algorithm
                                        Knowledge
         & programming Skills


                          Data Science
                                &
                         Data Engineering

           Business Domain           Vertical or Horizontal
             and Problem             Use case and Analytics
            Understanding            Experience




15
What is Driving Big Data?




 Structured

                                                                                           Largely
                                                                                           Unstructured

Semi-structured




      Source: IBM and Oxford Survey: Getting Closer to Customers Tops Big Data Agenda, October 17, 2012




 16
Today’s Big Data System:


                Real Time
                 Streams


                     Real-Time
                     Processing
                      (s4, storm)
                                                   Analytics

     ETL
                                                             Data
                            Real Time                       Parallel
                            Structured      Big SQL          Batch
                            Database                      Processing




                            Unstructured Data (HDFS)



17
The Unified Analytics Cloud Platform



          Madlib
                             Analytics Tools       Karmasphere
              Data Meer                                      Tableau

         Hadoop R              Developer           Spring
                                                               PaaS
          Python              Frameworks       Cloudfoundry

        Cassandra                                    hBase
                    HDFS   Database/DataStore
           HawQ                                        Impala


         Data-Director
                              Data Platform           Data PaaS
            EMC Chorus



             vSphere       Cloud Infrastructure
                                                     Private
                                                                 Public




18
Business
The New Big Data System                                            Intelligence




                    Real Time
                     Streams

                           Automated
                            Models


                             Real-Time
                              Stream                     Data Visualization
                            Processing                       (Excel, Tableau)

               E
      Common
       Query   T                    Real Time        Structured         Unstructured
               L                    Structured          Data
                                                                         and Batch
                                                                         Processing
                                    Database           Engine            (Hadoop, Hive)
Federated Query
(SQL aggregation)
                                Structured and Unstructured Data
                                            (HDFS, S3)

                                      Cloud Infrastructure
                      Compute            Storage         Networking

 19
An Example – Automated Performance Management

    10M
Performance
 Stats/min




                           Trigger
                           Models



                                                            Batch
                                                           Baseline
                                                          Calculation




                                     Stats Database

                                 Cloud Infrastructure
                 Compute             Storage          Networking

20
Big (Data) problems: becoming the standardized stack

                           Google(       Facebook( Yahoo(       Linked(in( Cloudera( Twi7er(



Metadata&                  Dremel&       Hive&      Hive&                  Hive&
Schedule&&&pipeline&
workloads&                 Evenflow&      Databee&   Oozie&      Azkaban&   Oozie&
dataflow/queries&           A/Sawzall&    /Hive&     Pig/Hive&   Pig&       Pig/Hive&   Cascading&
MoreAstructured&data&store& Bigtable&     Hbase&    Hbase&      Voldemort& Hbase&      Cassandra&
DB&data&collecGon/
integraGon&                 MySQL&gateway&                      Sqoop&     Sqoop&
                                                    Data&
Event&data&collecGon&                    Scribe&    Highway&    KaLa?&     Flume&      Scribe&
Streaming&data&processing& A&            A&         A&          A&         A&          A&
Batch&data&processing&     Map/Reduce&   Hadoop&    Hadoop&     Hadoop&    Hadoop&     Hadoop&
File&Storage&              GFS&          Hadoop&    Hadoop&     Hadoop&    Hadoop&     Hadoop&


CoordinaGon&               Chubby&       Zookeeper& Zookeeper& Zookeeper& Zookeeper& Zookeeper&


  21
Business
New Technologies                                                      Intelligence



                          Twitter              Machine
                       Real Time
                        Sensor Data            Learning
                                                                                      CETAS
                        Streams
                       Mobile Events
                       Machine Logs
                               Automated
                                Models

                                  S4, Storm
                                 Real-Time
                                  Stream                   Data Visualization
                                                                   …
                                Processing                      (Excel, Tableau)

                E
      Common
       Query    T                        SPARK
                                       Real Time        Aster,             Unstructured
                L                        SHARK
                                       Structured     Greenplum
                                                                            and Batch
                                                                           Map-Reduce
                                        Gemfire                             Processing
                                       Database
                                         hBase?          Etc,               (Hadoop, Hive)
Query Virtualization
        …
(SQL aggregation)

                                   HDFS, Ceph, MAPR, Collosos

                                         Cloud Infrastructure
                         Compute            Storage         Networking

 22
Agenda

!  Frameworks
     •    Batch processing: Hadoop, Spark
     •    Graph processing: Pregel, Apache Giraph
     •    Real-time processing: Storm, S4, D-Streams
     •    Interactive processing: Hive, Impala, Shark
!  New requirements
     •  Better network architectures, abstractions and end-to-end resource
        management
     •  Whither disk-locality and the flexibility to move data to compute
        instead
     •  Cluster/Datacenter-wide storage abstractions and services
     •  The silo-less datacenter (multiple frameworks sharing a single
          physical cluster and sharing sticky data)



23
Big Data Processing Patterns (batch, real-time or interactive)


Hadoop,
Hive, Impala   Funnel                    Reverse Funnel        Data transform
Storm, S4,     (large input, small       (small input, large   (input and output
D-Streams,     output, e.g., link/ad     output, e.g.,         sizes similar, e.g,
Shark          click-statistics)         logfile loading)      data conversion/
                                                               translation)

 Spark
               Iterative, e.g, Machine
               learning tasks




 Pregel,
 Giraph
               Graph-based analyses
               to reason about relationships,
               e.g., PageRank, Ravi s social approach to VI management

   24
Batch processing frameworks (1/2)
!  Apache Hadoop MapReduce (Yahoo!)




      •  Parallel data-processing paradigm (made popular by Google). Uses a
        distributed file system (HDFS) for persistence. Uses commodity h/w
      •  Model of operation: Mapper (read from HDFS + compute in parallel) ->
        Reducer (process map outputs in parallel) -> write to HDFS
      •  Key components: Namenode, Datanode, TaskTracker, JobTracker
      •  Apache Zookeeper sometimes used for coordination
      •  Weakness: Not well-suited for iterative (or graph) computations
 25
Batch processing frameworks (2/2)
!  Spark (UC Berkeley)




     •  Support for iterative computations and interactive data-mining by
       caching data in cluster RAM. Uses commodity machines
     •  Core abstraction: Resilient Distributed Datasets (RDDs) used as
       variables in Spark programs. RDDs include lineage data for easy
       recovery/reconstruction
     •  Up to ~20X speedup over Hadoop. Used by Quantifind, Conviva, …

        Image courtesy Zaharia et al.: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

26
Graph processing frameworks

!  Pregel (Google)/Apache Giraph
             Compute               Communicate




                                                                   Barrier
     VM1         VM2




      •  Multiple instances of vertex-programs: user-defined functions running
           at/on each vertex
      •  Bulk Synchronous Parallel (BSP) processing, e.g., used for PageRank
      •  Stateful in-memory computations. Fault-tolerance via checkpoints
      •  Runs on commodity hardware (racks with high intra-rack bandwidth)

27
Real-time processing frameworks (stream-processing) 1/2

!  S4 (Yahoo!), Storm (Twitter)
       •  Record-at-a-time processing. Checkpointing for fault-tolerance (S4)




Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf


  28
Real-time processing frameworks (stream-processing) 2/2

!  Discretized Streams/D-Streams (UC Berkeley)
       •  Treat a streaming computation as a series of batch computations on
           small time intervals. D-Stream = chain of RDDs
       •  Fault-tolerance without replication or upstream backup (buffering)




                                                                                                                       Time
Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf


  29
Interactive processing frameworks 1/4

!  Apache Hive (Facebook)
       •  Open-source data warehouse built on top of Hadoop. HiveQL
           queries compiled into MapReduce jobs. Expensive Where clauses =
           Table scans = high latency




Image courtesy Cubrid: http://www.cubrid.org/blog/dev-platform/platforms-for-big-data/


  30
Interactive processing frameworks 2/4

!  Interactive Processing Frameworks – Pivotal Hawk




31
Interactive processing frameworks 3/4

!  Impala (Cloudera)
       •  Inspired by Dremel (Google). Key concepts: columnar-data storage
           (Trevni), aggregation trees for distributed query evaluation
       •  Takes advantage of Hive tables. Uses memory as a cache for tables
       •  Does not use MapReduce to answer queries (unlike Hive).
       •  3X - 90X faster than Hive




Image courtesy Cloudera: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/


  32
Interactive processing frameworks 4/4

!  Shark (UC Berkeley)
       •  Key concepts: columnar-data storage (in-memory), Directed Acyclic
           Graphs of Tasks for distributed query optimization and evaluation,
           dynamic mid-query replanning
       •  Uses Spark RDDs to store data and query processing results
       •  SQL-interface (HiveQL compatible)
       •  100X faster than Hadoop, 100X faster than Hive




Image courtesy Xin et al.: http://shark.cs.berkeley.edu/presentations/2012-11-26-shark-tech-report.pdf


  33
Unifying the Big Data Platform using Virtualization

!  Goals
 •  Make it fast and easy to provision new data Clusters on Demand
 •  Allow Mixing of Workloads
 •  Leverage virtual machines to provide isolation (esp. for Multi-tenant)
 •  Optimize data performance based on virtual topologies
 •  Make the system reliable based on virtual topologies
!  Leveraging Virtualization
 •  Elastic scale
 •  Use high-availability to protect key services, e.g., Hadoop’s namenode/job
     tracker
 •  Resource controls and sharing: re-use underutilized memory, cpu
 •  Prioritize Workloads: limit or guarantee resource usage in a mixed
     environment
                                        Cloud Infrastructure

                                                                Private
                                                                          Public

34
A Unified Analytics Cloud Significantly Simplifies

                                 !  Simplify
                                   •  Single Hardware Infrastructure
                                   •  Faster/Easier provisioning
SQLCluster



                                         Big SQL        NoSQL          Hadoop
      NoSQL Cluster

                                               Unifed Analytics Infrastructure

                                                   Private
                                                             Public
 Hadoop Cluster

                                  !  Optimize
                                    •  Shared Resources = higher utilization
      Decision Support Cluster
                                    •  Elastic resources = faster on-demand access

 35
Simplify Hetrogeneous Data Management via Data PaaS



                                      Large-
                        File-                            In-            Big
                                       Scale
                       system                          Memory           SQL
                                      NoSQL



  Analytics Tools


     Developer

     Databases
                       Data PaaS – Common Data Management Layer

   Data Platform       Provisioning      Multi-tenancy          Import/Export
Cloud Infrastructure       Management             Data Discovery




                                         Cloud Infrastructure



36
Technology: Databases and Data Stores for Big Data

                      Unstructured                                               Structured


                                         Large-
                   File-                                           In-                      Big
                                          Scale
                  system                                         Memory                     SQL
                                         NoSQL



              Log files, machine   Loosely typed device
 Types of     generated data,      data, records, events,   Structured,
                                                                                      Structured data
   Data       documents,           statistics, complex      partitionable data
              device data, etc…    relations/graphs

 Techno-      NAS, HDFS, Blob,     Cassandra, hBase,        Gemfire, Redis,           HawQ, Impala, Aster,
  logies      S3, MAPR, etc..      Voldemort                Membase, SPARK            …

              Store any data,                                                         High performance for
                                   Easy to scale-out,
              easy to scale-out,                            High Throughput, low      repetitive queries.
     Values                        flexible and dynamic
              can optimize for                              latency                   Ease of query
                                   schema’s
              cost                                                                    language.

37
The Unified Analytics Cloud Platform



          Madlib
                            Analytics Tools       Karmasphere
              Data Meer                                     Tableau

         Hadoop R             Developer           Spring
                                                              PaaS
          Python             Frameworks       Cloudfoundry

        Cassandra                                   hBase
                  HDFS    Database/DataStore
           Greenplum                                  Voldemort


         Data-Director
                             Data Platform           Data PaaS
            EMC Chorus



             vSphere      Cloud Infrastructure
                                                    Private
                                                                Public




38
Summary

!  Revolution in Big Data is under way
 •  Data centric applications are now critical
!  Hadoop on Virtualization
 •  Proven performance
 •  Cloud/Virtualization values apparent for Hadoop use
!  Simplify through a Unified Analytics Cloud
 •  One Platform for today’s and future big-data systems
 •  Better Utilization
 •  Faster deployment, elastic resources
 •  Secure, Isolated, Multi-tenant capability for Analytics




39
References

!  Twitter
  •  @richardmcdougll
!  My CTO Blog
  •  http://communities.vmware.com/community/vmtn/cto/cloud

!  Hadoop on vSphere
  •  Talk @ Hadoop World
  •  Performance Paper – http://www.vmware.com/files/.../VMW-Hadoop-Performance-vSphere5.pdf
!  Spring Hadoop
  •  http://blog.springsource.org/2012/02/29/introducing-spring-hadoop




40

Building Big Data Applications

  • 1.
    Building Big DataApplications Services for Private Clouds Richard McDougall Chief Architect, Storage and Application Services VMware, Inc @richardmcdougll © 2009 VMware Inc. All rights reserved
  • 2.
    Infrastructure, Apps andnow Data… Build Run Private Public Manage Simplify Infrastructure Simplify App Platform Simplify Data With Cloud Through PaaS 2
  • 3.
    Trend 1/3: NewData Growing at 60% Y/Y Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio( generation… digital(tv( digital(photos( camera(phones,(rfid( medical(imaging,(sensors( satellite(images,(logs,(scanners,(twi7er( cad/cam,(appliances,(machine(data,(digital(movies( Source: The Information Explosion, 2009 3
  • 4.
    Data Growth inthe Enterprise 4
  • 5.
    Trend 2/3: BigData – Driven by Real-World Benefit 5
  • 6.
    Enterprise : EarlyAdopter Industries and Use Cases 6
  • 7.
    Early Adopters: EnterpriseSegmentation Verticals! Targets! Use Cases! •  Financial Services" •  Existing Hadoop Users" •  Business Trend Analytics" •  Retail" •  Business Analysts" •  Revenue analytics" •  Telco" •  Data Scientists" •  CDR, call pattern analytics" •  Manufacturing" •  LOB managers" •  Sensor data analytics" •  Government" •  IT/Ops" •  Log, machine data analytics" •  Fraud detection" •  Homeland security" •  Predictive analytics" 7
  • 8.
    Early Adopters: Non-enterpriseSegmentation Verticals! Targets! Use Cases! •  Online Advertising" •  End users/Exec users" •  Behavioral Analytics" •  eCommerce" •  Business Analysts" •  Audience segmentation" •  Mobile" •  PM, LOB managers" •  Revenue Optimization" •  Social Media" •  Marketing/Sales" •  User activity monetization" •  Gaming" •  Data Engineers" •  Inventory, price •  Data Scientists" management" •  IT/Operations" •  Recommendations" •  Predictive analytics" 8
  • 9.
    Why now? moretransactions (Social/Mobile/Local) SoMoLo 30B 500 TB messages/ 35 check-ins/ 13k API calls/ data/day month sec sec Big “traditional” companies 1TB data/ day 10k card 3.7B calls/ transactions/sec month Size of data communications transactions 9
  • 10.
    Trend 3/3: Valuefrom Data Exceeds Hardware Cost !  Value from the intelligence of data analytics now outstrips the cost of hardware •  Hadoop enables the use of 10x lower cost hardware •  Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost 10
  • 11.
    The Old BigData Stack Business Intelligence Extract, Transform, Data Statistics Load (SAS, SPSS) Visualization (Informatica) (Crystal, Bus O) Files SQL Databases E T L Column Oriented Relational Database (Oracle, Teradata, DB2) Master Data Management (Oracle, SAP) 11
  • 12.
    The Old BigData Stack !  Unable to handle large data volumes & diversity of data !  Iterative, brute-force and slow process Business !  Lack of ad-hoc data navigation across events and Intelligence time !  Cumbersome ETL to “process” and DBAs to “prepare” !  Focused on structured data that is warehoused !  Web analytics solutions force real-time events into Data rigid schemas in DBs Extract, Transform, Load Statistics (SAS, SPSS) Visualization (Crystal, Bus (Informatica) O) Files SQL Databases E Column Oriented T Relational Database L (Oracle, Teradata, DB2) Master Data Management (Oracle, SAP) 12
  • 13.
    The Journey ToBig Data Analytics 1 2 3 All Data Data Science Real Time Decisions Faster Answers Collaboration New Applications Elastic & Scalable Self-Service Data Monetization Big Data Enabled Apps Agile Process & Tools Analytics Engines Analytic Engines Analytic Productivity Platform Cloud Infrastructure BI As A Service Agile Analytics Predictive Enterprise Technology Focus People & Productivity Focus Application Focus Goal: encourage Goal: discover meaningful Goal: operationalize experimentation insights that those insights with existing data impact the business as quickly as possible 13
  • 14.
    Customer profiles 1.  Businessanalysts, LOB managers, execs •  Need: out-of-the-box analytics •  Designed for: self-service for end-user leveraging app developers 2.  Data engineers/analysts •  Need: out-of-the-box + some customization •  Designed for: admin + operations 3.  Data scientists •  Need: power capabilities + heavy customization •  Designed for: data scientists 4.  IT, Operations •  Need: out-of-the-box + some customization •  Designed for: IT/admin, ops 14
  • 15.
    What is DataScience and Data Engineering? Distributed, Math and Statistical Parallelization Algorithm Knowledge & programming Skills Data Science & Data Engineering Business Domain Vertical or Horizontal and Problem Use case and Analytics Understanding Experience 15
  • 16.
    What is DrivingBig Data? Structured Largely Unstructured Semi-structured Source: IBM and Oxford Survey: Getting Closer to Customers Tops Big Data Agenda, October 17, 2012 16
  • 17.
    Today’s Big DataSystem: Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Data Real Time Parallel Structured Big SQL Batch Database Processing Unstructured Data (HDFS) 17
  • 18.
    The Unified AnalyticsCloud Platform Madlib Analytics Tools Karmasphere Data Meer Tableau Hadoop R Developer Spring PaaS Python Frameworks Cloudfoundry Cassandra hBase HDFS Database/DataStore HawQ Impala Data-Director Data Platform Data PaaS EMC Chorus vSphere Cloud Infrastructure Private Public 18
  • 19.
    Business The New BigData System Intelligence Real Time Streams Automated Models Real-Time Stream Data Visualization Processing (Excel, Tableau) E Common Query T Real Time Structured Unstructured L Structured Data and Batch Processing Database Engine (Hadoop, Hive) Federated Query (SQL aggregation) Structured and Unstructured Data (HDFS, S3) Cloud Infrastructure Compute Storage Networking 19
  • 20.
    An Example –Automated Performance Management 10M Performance Stats/min Trigger Models Batch Baseline Calculation Stats Database Cloud Infrastructure Compute Storage Networking 20
  • 21.
    Big (Data) problems:becoming the standardized stack Google( Facebook( Yahoo( Linked(in( Cloudera( Twi7er( Metadata& Dremel& Hive& Hive& Hive& Schedule&&&pipeline& workloads& Evenflow& Databee& Oozie& Azkaban& Oozie& dataflow/queries& A/Sawzall& /Hive& Pig/Hive& Pig& Pig/Hive& Cascading& MoreAstructured&data&store& Bigtable& Hbase& Hbase& Voldemort& Hbase& Cassandra& DB&data&collecGon/ integraGon& MySQL&gateway& Sqoop& Sqoop& Data& Event&data&collecGon& Scribe& Highway& KaLa?& Flume& Scribe& Streaming&data&processing& A& A& A& A& A& A& Batch&data&processing& Map/Reduce& Hadoop& Hadoop& Hadoop& Hadoop& Hadoop& File&Storage& GFS& Hadoop& Hadoop& Hadoop& Hadoop& Hadoop& CoordinaGon& Chubby& Zookeeper& Zookeeper& Zookeeper& Zookeeper& Zookeeper& 21
  • 22.
    Business New Technologies Intelligence Twitter Machine Real Time Sensor Data Learning CETAS Streams Mobile Events Machine Logs Automated Models S4, Storm Real-Time Stream Data Visualization … Processing (Excel, Tableau) E Common Query T SPARK Real Time Aster, Unstructured L SHARK Structured Greenplum and Batch Map-Reduce Gemfire Processing Database hBase? Etc, (Hadoop, Hive) Query Virtualization … (SQL aggregation) HDFS, Ceph, MAPR, Collosos Cloud Infrastructure Compute Storage Networking 22
  • 23.
    Agenda !  Frameworks •  Batch processing: Hadoop, Spark •  Graph processing: Pregel, Apache Giraph •  Real-time processing: Storm, S4, D-Streams •  Interactive processing: Hive, Impala, Shark !  New requirements •  Better network architectures, abstractions and end-to-end resource management •  Whither disk-locality and the flexibility to move data to compute instead •  Cluster/Datacenter-wide storage abstractions and services •  The silo-less datacenter (multiple frameworks sharing a single physical cluster and sharing sticky data) 23
  • 24.
    Big Data ProcessingPatterns (batch, real-time or interactive) Hadoop, Hive, Impala Funnel Reverse Funnel Data transform Storm, S4, (large input, small (small input, large (input and output D-Streams, output, e.g., link/ad output, e.g., sizes similar, e.g, Shark click-statistics) logfile loading) data conversion/ translation) Spark Iterative, e.g, Machine learning tasks Pregel, Giraph Graph-based analyses to reason about relationships, e.g., PageRank, Ravi s social approach to VI management 24
  • 25.
    Batch processing frameworks(1/2) !  Apache Hadoop MapReduce (Yahoo!) •  Parallel data-processing paradigm (made popular by Google). Uses a distributed file system (HDFS) for persistence. Uses commodity h/w •  Model of operation: Mapper (read from HDFS + compute in parallel) -> Reducer (process map outputs in parallel) -> write to HDFS •  Key components: Namenode, Datanode, TaskTracker, JobTracker •  Apache Zookeeper sometimes used for coordination •  Weakness: Not well-suited for iterative (or graph) computations 25
  • 26.
    Batch processing frameworks(2/2) !  Spark (UC Berkeley) •  Support for iterative computations and interactive data-mining by caching data in cluster RAM. Uses commodity machines •  Core abstraction: Resilient Distributed Datasets (RDDs) used as variables in Spark programs. RDDs include lineage data for easy recovery/reconstruction •  Up to ~20X speedup over Hadoop. Used by Quantifind, Conviva, … Image courtesy Zaharia et al.: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf 26
  • 27.
    Graph processing frameworks ! Pregel (Google)/Apache Giraph Compute Communicate Barrier VM1 VM2 •  Multiple instances of vertex-programs: user-defined functions running at/on each vertex •  Bulk Synchronous Parallel (BSP) processing, e.g., used for PageRank •  Stateful in-memory computations. Fault-tolerance via checkpoints •  Runs on commodity hardware (racks with high intra-rack bandwidth) 27
  • 28.
    Real-time processing frameworks(stream-processing) 1/2 !  S4 (Yahoo!), Storm (Twitter) •  Record-at-a-time processing. Checkpointing for fault-tolerance (S4) Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf 28
  • 29.
    Real-time processing frameworks(stream-processing) 2/2 !  Discretized Streams/D-Streams (UC Berkeley) •  Treat a streaming computation as a series of batch computations on small time intervals. D-Stream = chain of RDDs •  Fault-tolerance without replication or upstream backup (buffering) Time Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf 29
  • 30.
    Interactive processing frameworks1/4 !  Apache Hive (Facebook) •  Open-source data warehouse built on top of Hadoop. HiveQL queries compiled into MapReduce jobs. Expensive Where clauses = Table scans = high latency Image courtesy Cubrid: http://www.cubrid.org/blog/dev-platform/platforms-for-big-data/ 30
  • 31.
    Interactive processing frameworks2/4 !  Interactive Processing Frameworks – Pivotal Hawk 31
  • 32.
    Interactive processing frameworks3/4 !  Impala (Cloudera) •  Inspired by Dremel (Google). Key concepts: columnar-data storage (Trevni), aggregation trees for distributed query evaluation •  Takes advantage of Hive tables. Uses memory as a cache for tables •  Does not use MapReduce to answer queries (unlike Hive). •  3X - 90X faster than Hive Image courtesy Cloudera: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/ 32
  • 33.
    Interactive processing frameworks4/4 !  Shark (UC Berkeley) •  Key concepts: columnar-data storage (in-memory), Directed Acyclic Graphs of Tasks for distributed query optimization and evaluation, dynamic mid-query replanning •  Uses Spark RDDs to store data and query processing results •  SQL-interface (HiveQL compatible) •  100X faster than Hadoop, 100X faster than Hive Image courtesy Xin et al.: http://shark.cs.berkeley.edu/presentations/2012-11-26-shark-tech-report.pdf 33
  • 34.
    Unifying the BigData Platform using Virtualization !  Goals •  Make it fast and easy to provision new data Clusters on Demand •  Allow Mixing of Workloads •  Leverage virtual machines to provide isolation (esp. for Multi-tenant) •  Optimize data performance based on virtual topologies •  Make the system reliable based on virtual topologies !  Leveraging Virtualization •  Elastic scale •  Use high-availability to protect key services, e.g., Hadoop’s namenode/job tracker •  Resource controls and sharing: re-use underutilized memory, cpu •  Prioritize Workloads: limit or guarantee resource usage in a mixed environment Cloud Infrastructure Private Public 34
  • 35.
    A Unified AnalyticsCloud Significantly Simplifies !  Simplify •  Single Hardware Infrastructure •  Faster/Easier provisioning SQLCluster Big SQL NoSQL Hadoop NoSQL Cluster Unifed Analytics Infrastructure Private Public Hadoop Cluster !  Optimize •  Shared Resources = higher utilization Decision Support Cluster •  Elastic resources = faster on-demand access 35
  • 36.
    Simplify Hetrogeneous DataManagement via Data PaaS Large- File- In- Big Scale system Memory SQL NoSQL Analytics Tools Developer Databases Data PaaS – Common Data Management Layer Data Platform Provisioning Multi-tenancy Import/Export Cloud Infrastructure Management Data Discovery Cloud Infrastructure 36
  • 37.
    Technology: Databases andData Stores for Big Data Unstructured Structured Large- File- In- Big Scale system Memory SQL NoSQL Log files, machine Loosely typed device Types of generated data, data, records, events, Structured, Structured data Data documents, statistics, complex partitionable data device data, etc… relations/graphs Techno- NAS, HDFS, Blob, Cassandra, hBase, Gemfire, Redis, HawQ, Impala, Aster, logies S3, MAPR, etc.. Voldemort Membase, SPARK … Store any data, High performance for Easy to scale-out, easy to scale-out, High Throughput, low repetitive queries. Values flexible and dynamic can optimize for latency Ease of query schema’s cost language. 37
  • 38.
    The Unified AnalyticsCloud Platform Madlib Analytics Tools Karmasphere Data Meer Tableau Hadoop R Developer Spring PaaS Python Frameworks Cloudfoundry Cassandra hBase HDFS Database/DataStore Greenplum Voldemort Data-Director Data Platform Data PaaS EMC Chorus vSphere Cloud Infrastructure Private Public 38
  • 39.
    Summary !  Revolution inBig Data is under way •  Data centric applications are now critical !  Hadoop on Virtualization •  Proven performance •  Cloud/Virtualization values apparent for Hadoop use !  Simplify through a Unified Analytics Cloud •  One Platform for today’s and future big-data systems •  Better Utilization •  Faster deployment, elastic resources •  Secure, Isolated, Multi-tenant capability for Analytics 39
  • 40.
    References !  Twitter •  @richardmcdougll !  My CTO Blog •  http://communities.vmware.com/community/vmtn/cto/cloud !  Hadoop on vSphere •  Talk @ Hadoop World •  Performance Paper – http://www.vmware.com/files/.../VMW-Hadoop-Performance-vSphere5.pdf !  Spring Hadoop •  http://blog.springsource.org/2012/02/29/introducing-spring-hadoop 40