Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. High Level Parallel Processing Models for Data Analysis Mingliang Sun
  2. 2. Motivation● Ever-increasing amount of data● High cost of traditional approaches● Limitation of the bare MapReduce approach
  3. 3. ExampleA. Pavlo et al, “A Comparison of Approaches to Large-scaleData Analysis,” Proceedings of the 35th SIGMOD internationalconference on Management of data, New York, NY, USA 2009● Pros of Parallel DW: ○ superior runtime performance● Cons of Parallel DW: ○ time consuming up-front set-up ○ sophisticated configuration and tuning
  4. 4. New Model – Pig Latin● Comes from Yahoo● Pig Latin, a high-level data analysis scripting language● Features of Pig, and motivation for them● Language features, data model, and motivation for● Implementation of Pig● A novel debugging approach brought by the system● A few real usage scenarios
  5. 5. New Model - SCOPE● Developed by Microsoft● SCOPE, a declarative and extensible scripting language● Underlying parallel data processing and storage system● Language features and data model● System design and architecture● TPC-H benchmark
  6. 6. New Model - Hive● Comes from Facebook● HiveQL, a high-level data analysis scripting language● Language features, data model, and type system● Data storage in HDFS (Hadoop File System)● System architecture and components● Usage statistics at Facebook
  7. 7. Comparison RDB/DW Pig Latin SCOPE HiveProgramming SQL/MDX: a "A sequence of * "A sequence of * "HiveQLStyle single block of steps where each data processing comprises of a declarative step specifies only commands" subset of SQL constraints that a single, high- * "Has a strong and some collectively define level relational- resemblance to extensions" the result algebra style data SQL -- an * "Working transformation" intentional design towards making choice" HiveQL subsume SQL syntax"Extensibility Vendor / product * Currently Support C# * Support UDF of specific UDF support JAVA arbitrary (User Defined UDF programming Function) * With future languages support of * Data types can arbitrary also be languages customized
  8. 8. Comparison (Cont) RDB/DW Pig Latin SCOPE HiveNested Data No, unless one is Yes,supports (Not directly Yes, supportsModel willing to violate complex data mentioned or complex data 1NF types (set, map, demonstrated in (map, list, and and tuple) paper) struct)Data Ownership Yes No No Yes or NoData Storage Internal data HDFS (Hadoop Cosmos files HDFS files structure File System) files
  9. 9. Comparison (Cont) RDB/DW Pig Latin SCOPE HiveData Schema Predefined and Defined on the fly Defined on the fly Defined on the fly stored in system and/or stored in system (Metadata)Inteoperability Poor (must Good (Operate on Good (operate on Good (operate on operate on external data) external data) both internal and system-owned, external data) internal data)Optimization SQL execution * basic * Complie-time: * "Currently has a plan optimization better execution naive rule-based * Not directly plan optimizer with a discussed in the * Run-time: small number of paper reduced traffic / simple rules" workload (Rack- * Plan to build a awareness, partial cost-based aggregation, optimizer and grouping adaptive heuristics) optimization"
  10. 10. Conclusions● The ideas behind these 3 papers are very similar ○ Addressing the same problem: limitation of the bare MapReduce model ○ Similar approach: high-level data processing scripts compiled into optimized, low-level parallel processing tasks supported by the underlying parallel processing system● Yet there are interesting differences ○ data schema, data ownership, and extensibility ○ Underlying system