High Level Parallel Processing Models for
             Data Analysis
                 Mingliang Sun
Motivation

●   Ever-increasing amount of data

●   High cost of traditional approaches

●   Limitation of the bare MapReduce
    approach
Example
A. Pavlo et al, “A Comparison of Approaches to Large-scale
Data Analysis,” Proceedings of the 35th SIGMOD international
conference on Management of data, New York, NY, USA 2009


●   Pros of Parallel DW:
    ○   superior runtime performance
●   Cons of Parallel DW:
    ○   time consuming up-front set-up
    ○   sophisticated configuration and tuning
New Model – Pig Latin
●   Comes from Yahoo
●   Pig Latin, a high-level data analysis scripting
    language
●   Features of Pig, and motivation for them
●   Language features, data model, and motivation for
●   Implementation of Pig
●   A novel debugging approach brought by the system
●   A few real usage scenarios
New Model - SCOPE
●   Developed by Microsoft
●   SCOPE, a declarative and extensible scripting
    language
●   Underlying parallel data processing and storage
    system
●   Language features and data model
●   System design and architecture
●   TPC-H benchmark
New Model - Hive
●   Comes from Facebook
●   HiveQL, a high-level data analysis scripting language
●   Language features, data model, and type system
●   Data storage in HDFS (Hadoop File System)
●   System architecture and components
●   Usage statistics at Facebook
Comparison
                RDB/DW                Pig Latin             SCOPE                Hive


Programming     SQL/MDX: a            "A sequence of        * "A sequence of     * "HiveQL
Style           single block of       steps where each      data processing      comprises of a
                declarative           step specifies only   commands"            subset of SQL
                constraints that      a single, high-       * "Has a strong      and some
                collectively define   level relational-     resemblance to       extensions"
                the result            algebra style data    SQL -- an            * "Working
                                      transformation"       intentional design   towards making
                                                            choice"              HiveQL subsume
                                                                                 SQL syntax"

Extensibility   Vendor / product      * Currently           Support C#           * Support UDF of
                specific UDF          support JAVA                               arbitrary
                (User Defined         UDF                                        programming
                Function)             * With future                              languages
                                      support of                                 * Data types can
                                      arbitrary                                  also be
                                      languages                                  customized
Comparison (Cont')
                 RDB/DW               Pig Latin            SCOPE             Hive


Nested Data      No, unless one is    Yes,supports         (Not directly     Yes, supports
Model            willing to violate   complex data         mentioned or      complex data
                 1NF                  types (set, map,     demonstrated in   (map, list, and
                                      and tuple)           paper)            struct)




Data Ownership   Yes                  No                   No                Yes or No




Data Storage     Internal data        HDFS (Hadoop         Cosmos files      HDFS files
                 structure            File System) files
Comparison (Cont')
                  RDB/DW             Pig Latin            SCOPE                Hive


Data Schema       Predefined and     Defined on the fly   Defined on the fly   Defined on the fly
                  stored in system                                             and/or stored in
                                                                               system
                                                                               (Metadata)


Inteoperability   Poor (must         Good (Operate on     Good (operate on     Good (operate on
                  operate on         external data)       external data)       both internal and
                  system-owned,                                                external data)
                  internal data)

Optimization      SQL execution      * basic              * Complie-time:      * "Currently has a
                  plan               optimization         better execution     naive rule-based
                                     * Not directly       plan                 optimizer with a
                                     discussed in the     * Run-time:          small number of
                                     paper                reduced traffic /    simple rules"
                                                          workload (Rack-      * Plan to build a
                                                          awareness, partial   cost-based
                                                          aggregation,         optimizer and
                                                          grouping             adaptive
                                                          heuristics)          optimization"
Conclusions
●   The ideas behind these 3 papers are very
    similar
    ○   Addressing the same problem: limitation of the bare
        MapReduce model
    ○   Similar approach: high-level data processing scripts
        compiled into optimized, low-level parallel processing tasks
        supported by the underlying parallel processing system
●   Yet there are interesting differences
    ○   data schema, data ownership, and extensibility
    ○   Underlying system

high_level_parallel_processing_model

  • 1.
    High Level ParallelProcessing Models for Data Analysis Mingliang Sun
  • 2.
    Motivation ● Ever-increasing amount of data ● High cost of traditional approaches ● Limitation of the bare MapReduce approach
  • 3.
    Example A. Pavlo etal, “A Comparison of Approaches to Large-scale Data Analysis,” Proceedings of the 35th SIGMOD international conference on Management of data, New York, NY, USA 2009 ● Pros of Parallel DW: ○ superior runtime performance ● Cons of Parallel DW: ○ time consuming up-front set-up ○ sophisticated configuration and tuning
  • 4.
    New Model –Pig Latin ● Comes from Yahoo ● Pig Latin, a high-level data analysis scripting language ● Features of Pig, and motivation for them ● Language features, data model, and motivation for ● Implementation of Pig ● A novel debugging approach brought by the system ● A few real usage scenarios
  • 5.
    New Model -SCOPE ● Developed by Microsoft ● SCOPE, a declarative and extensible scripting language ● Underlying parallel data processing and storage system ● Language features and data model ● System design and architecture ● TPC-H benchmark
  • 6.
    New Model -Hive ● Comes from Facebook ● HiveQL, a high-level data analysis scripting language ● Language features, data model, and type system ● Data storage in HDFS (Hadoop File System) ● System architecture and components ● Usage statistics at Facebook
  • 7.
    Comparison RDB/DW Pig Latin SCOPE Hive Programming SQL/MDX: a "A sequence of * "A sequence of * "HiveQL Style single block of steps where each data processing comprises of a declarative step specifies only commands" subset of SQL constraints that a single, high- * "Has a strong and some collectively define level relational- resemblance to extensions" the result algebra style data SQL -- an * "Working transformation" intentional design towards making choice" HiveQL subsume SQL syntax" Extensibility Vendor / product * Currently Support C# * Support UDF of specific UDF support JAVA arbitrary (User Defined UDF programming Function) * With future languages support of * Data types can arbitrary also be languages customized
  • 8.
    Comparison (Cont') RDB/DW Pig Latin SCOPE Hive Nested Data No, unless one is Yes,supports (Not directly Yes, supports Model willing to violate complex data mentioned or complex data 1NF types (set, map, demonstrated in (map, list, and and tuple) paper) struct) Data Ownership Yes No No Yes or No Data Storage Internal data HDFS (Hadoop Cosmos files HDFS files structure File System) files
  • 9.
    Comparison (Cont') RDB/DW Pig Latin SCOPE Hive Data Schema Predefined and Defined on the fly Defined on the fly Defined on the fly stored in system and/or stored in system (Metadata) Inteoperability Poor (must Good (Operate on Good (operate on Good (operate on operate on external data) external data) both internal and system-owned, external data) internal data) Optimization SQL execution * basic * Complie-time: * "Currently has a plan optimization better execution naive rule-based * Not directly plan optimizer with a discussed in the * Run-time: small number of paper reduced traffic / simple rules" workload (Rack- * Plan to build a awareness, partial cost-based aggregation, optimizer and grouping adaptive heuristics) optimization"
  • 10.
    Conclusions ● The ideas behind these 3 papers are very similar ○ Addressing the same problem: limitation of the bare MapReduce model ○ Similar approach: high-level data processing scripts compiled into optimized, low-level parallel processing tasks supported by the underlying parallel processing system ● Yet there are interesting differences ○ data schema, data ownership, and extensibility ○ Underlying system