high_level_parallel_processing_model

High Level Parallel Processing Models for
Data Analysis
Mingliang Sun

Motivation

● Ever-increasing amount of data

● High cost of traditional approaches

● Limitation of the bare MapReduce
approach

Example
A. Pavlo et al, “A Comparison of Approaches to Large-scale
Data Analysis,” Proceedings of the 35th SIGMOD international
conference on Management of data, New York, NY, USA 2009

● Pros of Parallel DW:
○ superior runtime performance
● Cons of Parallel DW:
○ time consuming up-front set-up
○ sophisticated configuration and tuning

New Model – Pig Latin
● Comes from Yahoo
● Pig Latin, a high-level data analysis scripting
language
● Features of Pig, and motivation for them
● Language features, data model, and motivation for
● Implementation of Pig
● A novel debugging approach brought by the system
● A few real usage scenarios

New Model - SCOPE
● Developed by Microsoft
● SCOPE, a declarative and extensible scripting
language
● Underlying parallel data processing and storage
system
● Language features and data model
● System design and architecture
● TPC-H benchmark

New Model - Hive
● Comes from Facebook
● HiveQL, a high-level data analysis scripting language
● Language features, data model, and type system
● Data storage in HDFS (Hadoop File System)
● System architecture and components
● Usage statistics at Facebook

Comparison
RDB/DW Pig Latin SCOPE Hive

Programming SQL/MDX: a "A sequence of * "A sequence of * "HiveQL
Style single block of steps where each data processing comprises of a
declarative step specifies only commands" subset of SQL
constraints that a single, high- * "Has a strong and some
collectively define level relational- resemblance to extensions"
the result algebra style data SQL -- an * "Working
transformation" intentional design towards making
choice" HiveQL subsume
SQL syntax"

Extensibility Vendor / product * Currently Support C# * Support UDF of
specific UDF support JAVA arbitrary
(User Defined UDF programming
Function) * With future languages
support of * Data types can
arbitrary also be
languages customized

Comparison (Cont')

Nested Data No, unless one is Yes,supports (Not directly Yes, supports
Model willing to violate complex data mentioned or complex data
1NF types (set, map, demonstrated in (map, list, and
and tuple) paper) struct)

Data Ownership Yes No No Yes or No

Data Storage Internal data HDFS (Hadoop Cosmos files HDFS files
structure File System) files

Comparison (Cont')

Data Schema Predefined and Defined on the fly Defined on the fly Defined on the fly
stored in system and/or stored in
system
(Metadata)

Inteoperability Poor (must Good (Operate on Good (operate on Good (operate on
operate on external data) external data) both internal and
system-owned, external data)
internal data)

Optimization SQL execution * basic * Complie-time: * "Currently has a
plan optimization better execution naive rule-based
* Not directly plan optimizer with a
discussed in the * Run-time: small number of
paper reduced traffic / simple rules"
workload (Rack- * Plan to build a
awareness, partial cost-based
aggregation, optimizer and
grouping adaptive
heuristics) optimization"

Conclusions
● The ideas behind these 3 papers are very
similar
○ Addressing the same problem: limitation of the bare
MapReduce model
○ Similar approach: high-level data processing scripts
compiled into optimized, low-level parallel processing tasks
supported by the underlying parallel processing system
● Yet there are interesting differences
○ data schema, data ownership, and extensibility
○ Underlying system

high_level_parallel_processing_model

More Related Content

What's hot

Viewers also liked

Similar to high_level_parallel_processing_model

Recently uploaded

high_level_parallel_processing_model