SlideShare a Scribd company logo
1 of 65
Cloud Computing and Management
                            Mostafa Ead




October 17, 2011       CS854: Cloud Computing and Management   1
Objective

   Harness the enormous power
        of current clusters of
   commodity machines and fast
    reaction to the data-tsunami

October 17, 2011   CS854: Cloud Computing and Management   2
Outline
  Parallel Processing
  Google MapReduce
  DryadLINQ
     Dryad
  General Comments
  Takeaways
  Discussion




October 17, 2011   CS854: Cloud Computing and Management   3
Parallel Processing
  Why Parallel Processing?
    Execution time reduction
    Cheap clusters of commodity hardware
    Data-Driven world
    Exploit multi-cores in your workstation or many
     machines in your cluster.




October 17, 2011    CS854: Cloud Computing and Management   4
Parallel Processing
  Tasks of a parallel programs developer:
    1. Identifying concurrent portions
    2. Mapping the concurrent portions to multiple
       processes running in parallel
               Static mapping hinders scalability
       3. Distribute input, intermediate, output data, or any
          combination of them
       4. Manage accesses to shared data
       5. Handle failures
               In commodity clusters, failure is the norm rather than the
                exception.

October 17, 2011          CS854: Cloud Computing and Management              5
Parallel Processing
  Nightmare for a parallel programs developer
  Which task can be automated ?




October 17, 2011    CS854: Cloud Computing and Management   6
Parallel Processing
  Tasks of a parallel programs developer:
    1. Identifying concurrent portions  No
    2. Mapping the concurrent portions to multiple
       processes running in parallel
    3. Distribute input, intermediate, output data, or any
       combination of them
    4. Manage accesses to shared data
    5. Handle failures




October 17, 2011    CS854: Cloud Computing and Management    7
Parallel Processing
  Tasks of a parallel programs developer:
    1. Identifying concurrent portions  No
    2. Mapping the concurrent portions to multiple
       processes running in parallel  Yes iff dynamic
       mapping
    3. Distribute input, intermediate, output data, or any
       combination of them
    4. Manage accesses to shared data
    5. Handle failures



October 17, 2011    CS854: Cloud Computing and Management    8
Parallel Processing
  Tasks of a parallel programs developer:
    1. Identifying concurrent portions  No
    2. Mapping the concurrent portions to multiple
       processes running in parallel  Yes iff dynamic
       mapping
    3. Distribute input, intermediate, output data, or any
       combination of them  Yes: GFS and HDFS
    4. Manage accesses to shared data
    5. Handle failures



October 17, 2011    CS854: Cloud Computing and Management    9
Parallel Processing
  Tasks of a parallel programs developer:
    1. Identifying concurrent portions  No
    2. Mapping the concurrent portions to multiple
       processes running in parallel  Yes iff dynamic
       mapping
    3. Distribute input, intermediate, output data, or any
       combination of them  Yes: GFS and HDFS
    4. Manage accesses to shared data  Yes iff read only
       access
    5. Handle failures

October 17, 2011    CS854: Cloud Computing and Management    10
Parallel Processing
  Tasks of a parallel programs developer:
    1. Identifying concurrent portions  No
    2. Mapping the concurrent portions to multiple
       processes running in parallel  Yes iff dynamic
       mapping
    3. Distribute input, intermediate, output data, or any
       combination of them  Yes: GFS and HDFS
    4. Manage accesses to shared data  Yes iff read only
       access
    5. Handle failures  Yes with restrictions

October 17, 2011    CS854: Cloud Computing and Management    11
MapReduce: Simplified Data
               Processing on Large Clusters
                     Jeffrey Dean and Sanjay Ghemawat
                                Google, Inc.
                                 OSDI-2004
                   Citation Count: 3487 (Google Scholar)




October 17, 2011       CS854: Cloud Computing and Management   12
MapReduce: Programming model
  Developer specifies one map and one reduce
   functions.
  Map:
        Input: one key-value pair
        Output: set of intermediate key-value pairs
  Reduce:
     Input: intermediate key-value pairs grouped by the key
     Output: one or multiple output key-value pairs



October 17, 2011    CS854: Cloud Computing and Management      13
MapReduce: Wordcount Example
                                                                  Mostafa: 1
          Key: offset in input file
                                                                  is: 1
          Value: “Mostafa is                   map1
                                                                  presenting: 1
          presenting MapReduce”
                                                                  MapReduce: 1




                                                   Execution
                                                     Parallel
                                                                  Mostafa: 1
          Key: offset in input file
                                                                  is: 1
          Value: “Mostafa is                    map2
                                                                  presenting: 1
          presenting DryadLINQ”
                                                                  DryadLINQ: 1




October 17, 2011          CS854: Cloud Computing and Management                   14
MapReduce: Wordcount Example
            Mostafa: 1                                              DryadLINQ: 1
            is: 1
            presenting: 1                                           is: 1
            MapReduce: 1                                            is: 1

                                              Shuffle               MapReduce: 1
                                              & Sort
                                                                    Mostafa: 1
            Mostafa: 1                                              Mostafa: 1
            is: 1
            presenting: 1                                           presenting: 1
            DryadLINQ: 1                                            presenting: 1




October 17, 2011            CS854: Cloud Computing and Management                   15
MapReduce: Wordcount Example
         DryadLINQ: 1
                                                                 DryadLINQ: 1
         is: 1
         is: 1
                                                                 is: 2
         MapReduce: 1         5 calls
                                                reduce           MapReduce: 1
         Mostafa: 1
                                                                 Mostafa: 2
         Mostafa: 1
                                                                 presenting: 2
         presenting: 1
         presenting: 1




October 17, 2011         CS854: Cloud Computing and Management                   16
Execution Overview




October 17, 2011     CS854: Cloud Computing and Management   17
MapReduce: Fault Tolerance
  States of worker tasks: idle, in-progress or completed
  Worker Failure:
    Master pings workers periodically.
    In-Progress tasks are set to idle
    Completed map tasks are set to idle, why?
    All in-progress (no completed yet) reduce tasks are
     informed with the re-execution of map tasks




October 17, 2011   CS854: Cloud Computing and Management    18
MapReduce: Fault Tolerance
  Master Failure:
     Current (in 2004) implementation aborts the whole MR
       job
  It is not a free lunch:
     The developer should write deterministic map/reduce
       functions.




October 17, 2011   CS854: Cloud Computing and Management     19
MapReduce: Backup Tasks
  Problem:
     Straggler Tasks
              Bad disk performance (30 MB/s vs 1 MB/s)
              Contention on the machine’s resources by other competing
               tasks.
        Total elapsed time of the job will be increased.
  Solution:
     Master schedules some of the in-progress tasks for re-
      execution.
     The first task to finish will be considered and the other
      will be ignored.

October 17, 2011         CS854: Cloud Computing and Management            20
MapReduce: Refinements
 1.        Combiner Function:
              Reduce function should be commutative & associative
                  <“the”, 1>, <“the”, 1>, <“the”, 1>, <“the”, 1>  <“the”, 4>
                  <“the”, 2>, <“the”, 2>  <“the”, 4>
              Decreases intermediate data size sent over the
               network from mappers to reducers.
              The same code of reduce function can be executed as
               the combiner in the map side.




October 17, 2011            CS854: Cloud Computing and Management                21
MapReduce: Refinements
 2. Counters:
    Count number of uppercase words, or count number
      of German documents
    Useful also for sanity check:
                  Sort: number of input key-value pairs should be equal to the
                   number of output key-value pairs.
              Local copy of the counter at mappers/reducers
              Periodic propagation to the master for aggregation.




October 17, 2011           CS854: Cloud Computing and Management              22
MapReduce: Performance
  Cluster Configuration:
     1800 machines
     Machine: 2x 2GHz xeon processor, 4GB RAM, 2x 160 GB
      local storage, and 1Gbps Eth
     Network Topology: two-level tree-shaped switched
      network




October 17, 2011   CS854: Cloud Computing and Management    23
MapReduce: Performance
  Grep:
     Input: 1010 100-byte records
     3-char pattern occurs in 92,337 records.
     M = 15000 and R = 1




October 17, 2011   CS854: Cloud Computing and Management   24
MapReduce: Performance
  Grep:
     Input: 1010 100-byte records
     3-char pattern occurs in 92,337 records.
     M = 15000 and R = 1          Low input rate as overhead of          program
                                              propagation to   all workers and data
                                              locality             optimizations




October 17, 2011   CS854: Cloud Computing and Management                        25
MapReduce: Performance
  Grep:
     Input: 1010 100-byte records
     3-char pattern occurs in 92,337 records.
     M = 15000 and R = 1          30 GB/s when 1764 workers




October 17, 2011   CS854: Cloud Computing and Management       26
MapReduce: Performance
  Sort:
     Input: 1010 100-byte records
     Map function extracts sortKey from the record and
      emits sortKey-record pairs
     Identity Reduce function
     Actual sort occurs at each reducer and handled by the
      library.
     M = 15000 and R = 4000




October 17, 2011   CS854: Cloud Computing and Management      27
MapReduce: Performance
  Sort:           Peak is 13 GB/s,
                   lower than Grep




October 17, 2011         CS854: Cloud Computing and Management   28
MapReduce: Performance
  Sort:           Peak is 13 GB/s,
                   lower than Grep



        Shuffle starts as soon as the
        first map finishes.




October 17, 2011         CS854: Cloud Computing and Management   29
MapReduce: Performance
  Sort:           Peak is 13 GB/s,
                   lower than Grep



        Shuffle starts as soon as the
        first map finishes.

     Two humps: all workers are
     assigned reduce tasks




October 17, 2011         CS854: Cloud Computing and Management   30
MapReduce: Performance
  Sort:           Peak is 13 GB/s,
                   lower than Grep



        Shuffle starts as soon as the
        first map finishes.

     Two humps: all workers are
     assigned reduce tasks


 Note that rate of
 input > shuffle > output


October 17, 2011         CS854: Cloud Computing and Management   31
MapReduce: Performance
  Sort:
     No Backup tasks




                                                           After 960 s: 5 straggler
                                                           reduce tasks and finish
                                                           300 s later; 44% increase
                                                           in elapsed time

October 17, 2011   CS854: Cloud Computing and Management                         32
MapReduce: Performance
  Sort:
   200 tasks killed.
  5% increase in
   elapsed time




October 17, 2011   CS854: Cloud Computing and Management   33
Hadoop MapReduce
  Hadoop[3] MapReduce is the open source
   implementation of Google’s MapReduce
  Terminology mappings:

                      Google MR                            Hadoop MR
                   Scheduling System                        JobTracker
                        Worker                             TaskTracker
                         GFS                                      HDFS




October 17, 2011          CS854: Cloud Computing and Management          34
DryadLINQ: A System for General Purpose
           Distributed Data-Parallel Computing Using a
                       High-Level Language
   Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar
        Erlingsson, Pradeep Kumar Gunda and Job Currey
                                 OSDI-2008




October 17, 2011   CS854: Cloud Computing and Management         35
DryadLINQ




                   Dryad                                           LINQ




October 17, 2011           CS854: Cloud Computing and Management          36
DryadLINQ




                   Dryad [4]                                       LINQ



   A general purpose distributed execution
   engine for coarse-grain data-parallel
   applications




October 17, 2011           CS854: Cloud Computing and Management          37
DryadLINQ




                   Dryad                                           LINQ [5]



                                                 LINQ: Language INtegrated Query
                                                 A set of extensions to the .NET Framework
                                                 that encompass language-integrated query,
                                                 and set operations.




October 17, 2011           CS854: Cloud Computing and Management                        38
Dryad: Distributed Data-Parallel Programs from
                Sequential Building Blocks
      Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell and
                           Dennis Fetterly
                               EuroSys-2007




October 17, 2011   CS854: Cloud Computing and Management        39
Dryad: System Overview
  One Dryad job is a Directed Acyclic Graph (DAG)
  Vertex is a sequential program, or a program that
   exploits multi-cores on chip.
  Edge is a data communication channel




October 17, 2011   CS854: Cloud Computing and Management   40
Dryad: System Overview
  Tasks of Dryad Job
     Manager:
        Instantiation of DAG
        Vertices assignment
        Fault Tolerance
        Job execution
           progress




October 17, 2011      CS854: Cloud Computing and Management   41
Dryad: DAG Description Language




October 17, 2011   CS854: Cloud Computing and Management   42
Dryad: Writing a Program
  Use this description language to describe the
   concurrency between tasks of the job.
  Identify channel types between communicating
   vertices:
        Shared-Memory and the awareness of each node
           resources.




October 17, 2011        CS854: Cloud Computing and Management   43
DryadLINQ: Objective
  DryadLINQ compiles LINQ programs into distributed
     computations to run on Dryad
        Instead of using DAG description language
        It automatically specifies channel types in DAG
  Targets wide variety of developers
     Declarative and imperative programming paradigms
  The illusion of writing programs that will be executed
     sequentially.
        Independence property of data


October 17, 2011      CS854: Cloud Computing and Management   44
DryadLINQ: System Overview




October 17, 2011   CS854: Cloud Computing and Management   45
DryadLINQ: Programming
  DryadTable <T>
    Supports underlying DFS, collections of NTFS files and
     sets of database tables
    Schema for data items
    Partitioning schemes
  HashParition<T, K> and RangePartition<T, K>




October 17, 2011   CS854: Cloud Computing and Management      46
DryadLINQ: Programming
  If computation can not be expressed using any of
     LINQ operators:
        Apply: windowed computations
        Fork:
              Sharing scans, or eliminating common sub-expressions




October 17, 2011        CS854: Cloud Computing and Management         47
DryadLINQ: example




October 17, 2011     CS854: Cloud Computing and Management   48
DryadLINQ: example




October 17, 2011     CS854: Cloud Computing and Management   49
DryadLINQ: EPG
  Every LINQ operator is represented by one vertex
  Each vertex is replicated at runtime to represent
   one Dryad stage
  Vertex and Edge annotations




October 17, 2011   CS854: Cloud Computing and Management   50
DryadLINQ: Optimizations
  Static optimizations
     Pipelining
     Removing Redundancy
     I/O Reduction
  Dynamic optimizations:
     Modifications to the DAG at runtime.




October 17, 2011   CS854: Cloud Computing and Management   51
DryadLINQ: Dynamic Optimizations
  Dynamic Aggregation (Combiners):
     Node, Rack then Cluster levels
     Aggregation topology is computed at runtime




  Number of replicas of one vertex is dependent on the
     number of independent partitions of input data
        Job skeleton will remain the same

October 17, 2011     CS854: Cloud Computing and Management   52
DryadLINQ: OrderBy optimization


     statically




October 17, 2011   CS854: Cloud Computing and Management   53
DryadLINQ: OrderBy optimization


     statically       at
                   runtime




October 17, 2011    CS854: Cloud Computing and Management   54
DryadLINQ: OrderBy optimization


     statically       at
                   runtime


                                                                Suitable
                                                            partition sizes
                                                            for in-memory
                                                                  sort




October 17, 2011    CS854: Cloud Computing and Management               55
DryadLINQ: Performance
  Cluster Configurations:
     240 machines
     Machine: 2x dual-core AMD Opteron 2.6 GHz, 16 GB
      RAM, 4x 750 GB SATA
     Network Topology: two-level tree-shaped switched
      network.




October 17, 2011   CS854: Cloud Computing and Management   56
DryadLINQ: TeraSort
  Data is partitioned on a
   key other than the
   sortKey.
  Each machine stores
   3.87 GB
  At n = 240, TeraBytes
   data are sorted




October 17, 2011     CS854: Cloud Computing and Management   57
DryadLINQ: TeraSort
  The more nodes added,
   the increased data size,
   and hence elapsed time
   should be constant.
  At n=1, no sampling , no
   re-partitioning is
   performed and no
   network communication
  2 ≤ n ≤ 20, machines are
   connected to the same
   switch.

October 17, 2011     CS854: Cloud Computing and Management   58
DryadLINQ: SkyServer
  Compares locations and colors of stars in a large
     astronomical table
    Join two tables: 11.8 GB and 41.8 GB.
    Input tables are manually range-partitioned into 40
     partitions using the join key.
    Number of machines n is varied between 1 and 40.
    Output of joining two partitions is stored locally.




October 17, 2011      CS854: Cloud Computing and Management   59
DryadLINQ: SkyServer
  DryadLINQ is 1.3 times
   slower than Dryad
  DryadLINQ is written in
   a higher level language
  Overhead of
   communication between
   .Net-DryadLINQ layer
   and the Dryad layer



October 17, 2011      CS854: Cloud Computing and Management   60
General Comments
  Stragglers and interaction with databases
     mapred.map.tasks.speculative.execution property in
      Hadoop MR
  Fault tolerance and the blocking property
  Missing Scalability evaluation of Google MR.




October 17, 2011    CS854: Cloud Computing and Management   61
Takeaways
  Parallel processing becomes an easier task
  Write deterministic functions
  Independence property of data




October 17, 2011   CS854: Cloud Computing and Management   62
Big-Data Analytics Software Stack

                                        Pig, Hive,
            Sawzall                    Cascading,             DryadLINQ
                                       Abacus, Jaql


         MapReduce                    Hadoop MR                 Dryad


               GFS                         HDFS               COSMOS




October 17, 2011      CS854: Cloud Computing and Management               63
Discussion
October 17, 2011    CS854: Cloud Computing and Management   64
References
 [1] Principles of Parallel Algorithm Design
 [2] DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified data processing on
     large clusters. In Proceedings of the 6th Symposium on Operating Systems
     Design and Implementation (OSDI), 2004
 [3] Hadoop MapReduce Project
      http://hadoop.apache.org/mapreduce/
 [4] ISARD, M., BUDIU, M., YU, Y., BIRRELL, A., AND FETTERLY, D. Dryad:
     Distributed data-parallel programs from sequential building blocks. In
     Proceedings of European Conference on Computer Systems (EuroSys), 2007.
 [5] The LINQ project.
      http://msdn.microsoft.com/netframework/future/linq/.




October 17, 2011     CS854: Cloud Computing and Management                      65

More Related Content

What's hot

Functional Programming and Composing Actors
Functional Programming and Composing ActorsFunctional Programming and Composing Actors
Functional Programming and Composing Actorslegendofklang
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep DiveVasia Kalavri
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Single-Pass Graph Stream Analytics with Apache Flink
Single-Pass Graph Stream Analytics with Apache FlinkSingle-Pass Graph Stream Analytics with Apache Flink
Single-Pass Graph Stream Analytics with Apache FlinkParis Carbone
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
 
SICS: Apache Flink Streaming
SICS: Apache Flink StreamingSICS: Apache Flink Streaming
SICS: Apache Flink StreamingTuri, Inc.
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantParis Carbone
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingVasia Kalavri
 
Predictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonPredictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonVasia Kalavri
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Spark Summit
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReducePietro Michiardi
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis -  Massimo PeriniDeep Stream Dynamic Graph Analytics with Grapharis -  Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin MeetupMárton Balassi
 
Gelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupGelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupVasia Kalavri
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)Spark Summit
 

What's hot (20)

Functional Programming and Composing Actors
Functional Programming and Composing ActorsFunctional Programming and Composing Actors
Functional Programming and Composing Actors
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Single-Pass Graph Stream Analytics with Apache Flink
Single-Pass Graph Stream Analytics with Apache FlinkSingle-Pass Graph Stream Analytics with Apache Flink
Single-Pass Graph Stream Analytics with Apache Flink
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 
SICS: Apache Flink Streaming
SICS: Apache Flink StreamingSICS: Apache Flink Streaming
SICS: Apache Flink Streaming
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are important
 
Self-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processingSelf-managed and automatically reconfigurable stream processing
Self-managed and automatically reconfigurable stream processing
 
Predictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with StrymonPredictive Datacenter Analytics with Strymon
Predictive Datacenter Analytics with Strymon
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis -  Massimo PeriniDeep Stream Dynamic Graph Analytics with Grapharis -  Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Flink Streaming Berlin Meetup
Flink Streaming Berlin MeetupFlink Streaming Berlin Meetup
Flink Streaming Berlin Meetup
 
Gelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area MeetupGelly in Apache Flink Bay Area Meetup
Gelly in Apache Flink Bay Area Meetup
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
 

Similar to BigData Analysis Frameworks

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKMatt Stubbs
 
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Rafał Leszko
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladevPavel Tsukanov
 

Similar to BigData Analysis Frameworks (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Disco workshop
Disco workshopDisco workshop
Disco workshop
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Hadoop
HadoopHadoop
Hadoop
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
 
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladev
 

BigData Analysis Frameworks

  • 1. Cloud Computing and Management Mostafa Ead October 17, 2011 CS854: Cloud Computing and Management 1
  • 2. Objective Harness the enormous power of current clusters of commodity machines and fast reaction to the data-tsunami October 17, 2011 CS854: Cloud Computing and Management 2
  • 3. Outline  Parallel Processing  Google MapReduce  DryadLINQ  Dryad  General Comments  Takeaways  Discussion October 17, 2011 CS854: Cloud Computing and Management 3
  • 4. Parallel Processing  Why Parallel Processing?  Execution time reduction  Cheap clusters of commodity hardware  Data-Driven world  Exploit multi-cores in your workstation or many machines in your cluster. October 17, 2011 CS854: Cloud Computing and Management 4
  • 5. Parallel Processing  Tasks of a parallel programs developer: 1. Identifying concurrent portions 2. Mapping the concurrent portions to multiple processes running in parallel  Static mapping hinders scalability 3. Distribute input, intermediate, output data, or any combination of them 4. Manage accesses to shared data 5. Handle failures  In commodity clusters, failure is the norm rather than the exception. October 17, 2011 CS854: Cloud Computing and Management 5
  • 6. Parallel Processing  Nightmare for a parallel programs developer  Which task can be automated ? October 17, 2011 CS854: Cloud Computing and Management 6
  • 7. Parallel Processing  Tasks of a parallel programs developer: 1. Identifying concurrent portions  No 2. Mapping the concurrent portions to multiple processes running in parallel 3. Distribute input, intermediate, output data, or any combination of them 4. Manage accesses to shared data 5. Handle failures October 17, 2011 CS854: Cloud Computing and Management 7
  • 8. Parallel Processing  Tasks of a parallel programs developer: 1. Identifying concurrent portions  No 2. Mapping the concurrent portions to multiple processes running in parallel  Yes iff dynamic mapping 3. Distribute input, intermediate, output data, or any combination of them 4. Manage accesses to shared data 5. Handle failures October 17, 2011 CS854: Cloud Computing and Management 8
  • 9. Parallel Processing  Tasks of a parallel programs developer: 1. Identifying concurrent portions  No 2. Mapping the concurrent portions to multiple processes running in parallel  Yes iff dynamic mapping 3. Distribute input, intermediate, output data, or any combination of them  Yes: GFS and HDFS 4. Manage accesses to shared data 5. Handle failures October 17, 2011 CS854: Cloud Computing and Management 9
  • 10. Parallel Processing  Tasks of a parallel programs developer: 1. Identifying concurrent portions  No 2. Mapping the concurrent portions to multiple processes running in parallel  Yes iff dynamic mapping 3. Distribute input, intermediate, output data, or any combination of them  Yes: GFS and HDFS 4. Manage accesses to shared data  Yes iff read only access 5. Handle failures October 17, 2011 CS854: Cloud Computing and Management 10
  • 11. Parallel Processing  Tasks of a parallel programs developer: 1. Identifying concurrent portions  No 2. Mapping the concurrent portions to multiple processes running in parallel  Yes iff dynamic mapping 3. Distribute input, intermediate, output data, or any combination of them  Yes: GFS and HDFS 4. Manage accesses to shared data  Yes iff read only access 5. Handle failures  Yes with restrictions October 17, 2011 CS854: Cloud Computing and Management 11
  • 12. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI-2004 Citation Count: 3487 (Google Scholar) October 17, 2011 CS854: Cloud Computing and Management 12
  • 13. MapReduce: Programming model  Developer specifies one map and one reduce functions.  Map:  Input: one key-value pair  Output: set of intermediate key-value pairs  Reduce:  Input: intermediate key-value pairs grouped by the key  Output: one or multiple output key-value pairs October 17, 2011 CS854: Cloud Computing and Management 13
  • 14. MapReduce: Wordcount Example Mostafa: 1 Key: offset in input file is: 1 Value: “Mostafa is map1 presenting: 1 presenting MapReduce” MapReduce: 1 Execution Parallel Mostafa: 1 Key: offset in input file is: 1 Value: “Mostafa is map2 presenting: 1 presenting DryadLINQ” DryadLINQ: 1 October 17, 2011 CS854: Cloud Computing and Management 14
  • 15. MapReduce: Wordcount Example Mostafa: 1 DryadLINQ: 1 is: 1 presenting: 1 is: 1 MapReduce: 1 is: 1 Shuffle MapReduce: 1 & Sort Mostafa: 1 Mostafa: 1 Mostafa: 1 is: 1 presenting: 1 presenting: 1 DryadLINQ: 1 presenting: 1 October 17, 2011 CS854: Cloud Computing and Management 15
  • 16. MapReduce: Wordcount Example DryadLINQ: 1 DryadLINQ: 1 is: 1 is: 1 is: 2 MapReduce: 1 5 calls reduce MapReduce: 1 Mostafa: 1 Mostafa: 2 Mostafa: 1 presenting: 2 presenting: 1 presenting: 1 October 17, 2011 CS854: Cloud Computing and Management 16
  • 17. Execution Overview October 17, 2011 CS854: Cloud Computing and Management 17
  • 18. MapReduce: Fault Tolerance  States of worker tasks: idle, in-progress or completed  Worker Failure:  Master pings workers periodically.  In-Progress tasks are set to idle  Completed map tasks are set to idle, why?  All in-progress (no completed yet) reduce tasks are informed with the re-execution of map tasks October 17, 2011 CS854: Cloud Computing and Management 18
  • 19. MapReduce: Fault Tolerance  Master Failure:  Current (in 2004) implementation aborts the whole MR job  It is not a free lunch:  The developer should write deterministic map/reduce functions. October 17, 2011 CS854: Cloud Computing and Management 19
  • 20. MapReduce: Backup Tasks  Problem:  Straggler Tasks  Bad disk performance (30 MB/s vs 1 MB/s)  Contention on the machine’s resources by other competing tasks.  Total elapsed time of the job will be increased.  Solution:  Master schedules some of the in-progress tasks for re- execution.  The first task to finish will be considered and the other will be ignored. October 17, 2011 CS854: Cloud Computing and Management 20
  • 21. MapReduce: Refinements 1. Combiner Function:  Reduce function should be commutative & associative  <“the”, 1>, <“the”, 1>, <“the”, 1>, <“the”, 1>  <“the”, 4>  <“the”, 2>, <“the”, 2>  <“the”, 4>  Decreases intermediate data size sent over the network from mappers to reducers.  The same code of reduce function can be executed as the combiner in the map side. October 17, 2011 CS854: Cloud Computing and Management 21
  • 22. MapReduce: Refinements 2. Counters:  Count number of uppercase words, or count number of German documents  Useful also for sanity check:  Sort: number of input key-value pairs should be equal to the number of output key-value pairs.  Local copy of the counter at mappers/reducers  Periodic propagation to the master for aggregation. October 17, 2011 CS854: Cloud Computing and Management 22
  • 23. MapReduce: Performance  Cluster Configuration:  1800 machines  Machine: 2x 2GHz xeon processor, 4GB RAM, 2x 160 GB local storage, and 1Gbps Eth  Network Topology: two-level tree-shaped switched network October 17, 2011 CS854: Cloud Computing and Management 23
  • 24. MapReduce: Performance  Grep:  Input: 1010 100-byte records  3-char pattern occurs in 92,337 records.  M = 15000 and R = 1 October 17, 2011 CS854: Cloud Computing and Management 24
  • 25. MapReduce: Performance  Grep:  Input: 1010 100-byte records  3-char pattern occurs in 92,337 records.  M = 15000 and R = 1 Low input rate as overhead of program propagation to all workers and data locality optimizations October 17, 2011 CS854: Cloud Computing and Management 25
  • 26. MapReduce: Performance  Grep:  Input: 1010 100-byte records  3-char pattern occurs in 92,337 records.  M = 15000 and R = 1 30 GB/s when 1764 workers October 17, 2011 CS854: Cloud Computing and Management 26
  • 27. MapReduce: Performance  Sort:  Input: 1010 100-byte records  Map function extracts sortKey from the record and emits sortKey-record pairs  Identity Reduce function  Actual sort occurs at each reducer and handled by the library.  M = 15000 and R = 4000 October 17, 2011 CS854: Cloud Computing and Management 27
  • 28. MapReduce: Performance  Sort: Peak is 13 GB/s, lower than Grep October 17, 2011 CS854: Cloud Computing and Management 28
  • 29. MapReduce: Performance  Sort: Peak is 13 GB/s, lower than Grep Shuffle starts as soon as the first map finishes. October 17, 2011 CS854: Cloud Computing and Management 29
  • 30. MapReduce: Performance  Sort: Peak is 13 GB/s, lower than Grep Shuffle starts as soon as the first map finishes. Two humps: all workers are assigned reduce tasks October 17, 2011 CS854: Cloud Computing and Management 30
  • 31. MapReduce: Performance  Sort: Peak is 13 GB/s, lower than Grep Shuffle starts as soon as the first map finishes. Two humps: all workers are assigned reduce tasks Note that rate of input > shuffle > output October 17, 2011 CS854: Cloud Computing and Management 31
  • 32. MapReduce: Performance  Sort: No Backup tasks After 960 s: 5 straggler reduce tasks and finish 300 s later; 44% increase in elapsed time October 17, 2011 CS854: Cloud Computing and Management 32
  • 33. MapReduce: Performance  Sort: 200 tasks killed.  5% increase in elapsed time October 17, 2011 CS854: Cloud Computing and Management 33
  • 34. Hadoop MapReduce  Hadoop[3] MapReduce is the open source implementation of Google’s MapReduce  Terminology mappings: Google MR Hadoop MR Scheduling System JobTracker Worker TaskTracker GFS HDFS October 17, 2011 CS854: Cloud Computing and Management 34
  • 35. DryadLINQ: A System for General Purpose Distributed Data-Parallel Computing Using a High-Level Language Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda and Job Currey OSDI-2008 October 17, 2011 CS854: Cloud Computing and Management 35
  • 36. DryadLINQ Dryad LINQ October 17, 2011 CS854: Cloud Computing and Management 36
  • 37. DryadLINQ Dryad [4] LINQ A general purpose distributed execution engine for coarse-grain data-parallel applications October 17, 2011 CS854: Cloud Computing and Management 37
  • 38. DryadLINQ Dryad LINQ [5] LINQ: Language INtegrated Query A set of extensions to the .NET Framework that encompass language-integrated query, and set operations. October 17, 2011 CS854: Cloud Computing and Management 38
  • 39. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell and Dennis Fetterly EuroSys-2007 October 17, 2011 CS854: Cloud Computing and Management 39
  • 40. Dryad: System Overview  One Dryad job is a Directed Acyclic Graph (DAG)  Vertex is a sequential program, or a program that exploits multi-cores on chip.  Edge is a data communication channel October 17, 2011 CS854: Cloud Computing and Management 40
  • 41. Dryad: System Overview  Tasks of Dryad Job Manager:  Instantiation of DAG  Vertices assignment  Fault Tolerance  Job execution progress October 17, 2011 CS854: Cloud Computing and Management 41
  • 42. Dryad: DAG Description Language October 17, 2011 CS854: Cloud Computing and Management 42
  • 43. Dryad: Writing a Program  Use this description language to describe the concurrency between tasks of the job.  Identify channel types between communicating vertices:  Shared-Memory and the awareness of each node resources. October 17, 2011 CS854: Cloud Computing and Management 43
  • 44. DryadLINQ: Objective  DryadLINQ compiles LINQ programs into distributed computations to run on Dryad  Instead of using DAG description language  It automatically specifies channel types in DAG  Targets wide variety of developers  Declarative and imperative programming paradigms  The illusion of writing programs that will be executed sequentially.  Independence property of data October 17, 2011 CS854: Cloud Computing and Management 44
  • 45. DryadLINQ: System Overview October 17, 2011 CS854: Cloud Computing and Management 45
  • 46. DryadLINQ: Programming  DryadTable <T>  Supports underlying DFS, collections of NTFS files and sets of database tables  Schema for data items  Partitioning schemes  HashParition<T, K> and RangePartition<T, K> October 17, 2011 CS854: Cloud Computing and Management 46
  • 47. DryadLINQ: Programming  If computation can not be expressed using any of LINQ operators:  Apply: windowed computations  Fork:  Sharing scans, or eliminating common sub-expressions October 17, 2011 CS854: Cloud Computing and Management 47
  • 48. DryadLINQ: example October 17, 2011 CS854: Cloud Computing and Management 48
  • 49. DryadLINQ: example October 17, 2011 CS854: Cloud Computing and Management 49
  • 50. DryadLINQ: EPG  Every LINQ operator is represented by one vertex  Each vertex is replicated at runtime to represent one Dryad stage  Vertex and Edge annotations October 17, 2011 CS854: Cloud Computing and Management 50
  • 51. DryadLINQ: Optimizations  Static optimizations  Pipelining  Removing Redundancy  I/O Reduction  Dynamic optimizations:  Modifications to the DAG at runtime. October 17, 2011 CS854: Cloud Computing and Management 51
  • 52. DryadLINQ: Dynamic Optimizations  Dynamic Aggregation (Combiners):  Node, Rack then Cluster levels  Aggregation topology is computed at runtime  Number of replicas of one vertex is dependent on the number of independent partitions of input data  Job skeleton will remain the same October 17, 2011 CS854: Cloud Computing and Management 52
  • 53. DryadLINQ: OrderBy optimization statically October 17, 2011 CS854: Cloud Computing and Management 53
  • 54. DryadLINQ: OrderBy optimization statically at runtime October 17, 2011 CS854: Cloud Computing and Management 54
  • 55. DryadLINQ: OrderBy optimization statically at runtime Suitable partition sizes for in-memory sort October 17, 2011 CS854: Cloud Computing and Management 55
  • 56. DryadLINQ: Performance  Cluster Configurations:  240 machines  Machine: 2x dual-core AMD Opteron 2.6 GHz, 16 GB RAM, 4x 750 GB SATA  Network Topology: two-level tree-shaped switched network. October 17, 2011 CS854: Cloud Computing and Management 56
  • 57. DryadLINQ: TeraSort  Data is partitioned on a key other than the sortKey.  Each machine stores 3.87 GB  At n = 240, TeraBytes data are sorted October 17, 2011 CS854: Cloud Computing and Management 57
  • 58. DryadLINQ: TeraSort  The more nodes added, the increased data size, and hence elapsed time should be constant.  At n=1, no sampling , no re-partitioning is performed and no network communication  2 ≤ n ≤ 20, machines are connected to the same switch. October 17, 2011 CS854: Cloud Computing and Management 58
  • 59. DryadLINQ: SkyServer  Compares locations and colors of stars in a large astronomical table  Join two tables: 11.8 GB and 41.8 GB.  Input tables are manually range-partitioned into 40 partitions using the join key.  Number of machines n is varied between 1 and 40.  Output of joining two partitions is stored locally. October 17, 2011 CS854: Cloud Computing and Management 59
  • 60. DryadLINQ: SkyServer  DryadLINQ is 1.3 times slower than Dryad  DryadLINQ is written in a higher level language  Overhead of communication between .Net-DryadLINQ layer and the Dryad layer October 17, 2011 CS854: Cloud Computing and Management 60
  • 61. General Comments  Stragglers and interaction with databases  mapred.map.tasks.speculative.execution property in Hadoop MR  Fault tolerance and the blocking property  Missing Scalability evaluation of Google MR. October 17, 2011 CS854: Cloud Computing and Management 61
  • 62. Takeaways  Parallel processing becomes an easier task  Write deterministic functions  Independence property of data October 17, 2011 CS854: Cloud Computing and Management 62
  • 63. Big-Data Analytics Software Stack Pig, Hive, Sawzall Cascading, DryadLINQ Abacus, Jaql MapReduce Hadoop MR Dryad GFS HDFS COSMOS October 17, 2011 CS854: Cloud Computing and Management 63
  • 64. Discussion October 17, 2011 CS854: Cloud Computing and Management 64
  • 65. References [1] Principles of Parallel Algorithm Design [2] DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI), 2004 [3] Hadoop MapReduce Project http://hadoop.apache.org/mapreduce/ [4] ISARD, M., BUDIU, M., YU, Y., BIRRELL, A., AND FETTERLY, D. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of European Conference on Computer Systems (EuroSys), 2007. [5] The LINQ project. http://msdn.microsoft.com/netframework/future/linq/. October 17, 2011 CS854: Cloud Computing and Management 65