high_level_parallel_processing_model

•

0 likes•237 views

This document summarizes and compares three high-level parallel processing models: Pig Latin, SCOPE, and Hive. It discusses how each aims to address the limitations of traditional approaches to large-scale data analysis by providing a high-level scripting language that is compiled into optimized parallel tasks. While the ideas are similar, there are differences in programming style, extensibility, data models, and optimization strategies. Overall, the models evaluate tradeoffs between flexibility, performance, and usability for large-scale data analysis.

Technology

High Level Parallel Processing Models for
Data Analysis
Mingliang Sun

Motivation

● Ever-increasing amount of data

● High cost of traditional approaches

● Limitation of the bare MapReduce
approach

Example
A. Pavlo et al, “A Comparison of Approaches to Large-scale
Data Analysis,” Proceedings of the 35th SIGMOD international
conference on Management of data, New York, NY, USA 2009

● Pros of Parallel DW:
○ superior runtime performance
● Cons of Parallel DW:
○ time consuming up-front set-up
○ sophisticated configuration and tuning

New Model – Pig Latin
● Comes from Yahoo
● Pig Latin, a high-level data analysis scripting
language
● Features of Pig, and motivation for them
● Language features, data model, and motivation for
● Implementation of Pig
● A novel debugging approach brought by the system
● A few real usage scenarios

New Model - SCOPE
● Developed by Microsoft
● SCOPE, a declarative and extensible scripting
language
● Underlying parallel data processing and storage
system
● Language features and data model
● System design and architecture
● TPC-H benchmark

New Model - Hive
● Comes from Facebook
● HiveQL, a high-level data analysis scripting language
● Language features, data model, and type system
● Data storage in HDFS (Hadoop File System)
● System architecture and components
● Usage statistics at Facebook

Comparison
RDB/DW Pig Latin SCOPE Hive

Programming SQL/MDX: a "A sequence of * "A sequence of * "HiveQL
Style single block of steps where each data processing comprises of a
declarative step specifies only commands" subset of SQL
constraints that a single, high- * "Has a strong and some
collectively define level relational- resemblance to extensions"
the result algebra style data SQL -- an * "Working
transformation" intentional design towards making
choice" HiveQL subsume
SQL syntax"

Extensibility Vendor / product * Currently Support C# * Support UDF of
specific UDF support JAVA arbitrary
(User Defined UDF programming
Function) * With future languages
support of * Data types can
arbitrary also be
languages customized

Comparison (Cont')
RDB/DW Pig Latin SCOPE Hive

Nested Data No, unless one is Yes,supports (Not directly Yes, supports
Model willing to violate complex data mentioned or complex data
1NF types (set, map, demonstrated in (map, list, and
and tuple) paper) struct)

Data Ownership Yes No No Yes or No

Data Storage Internal data HDFS (Hadoop Cosmos files HDFS files
structure File System) files

Comparison (Cont')
RDB/DW Pig Latin SCOPE Hive

Data Schema Predefined and Defined on the fly Defined on the fly Defined on the fly
stored in system and/or stored in
system
(Metadata)

Inteoperability Poor (must Good (Operate on Good (operate on Good (operate on
operate on external data) external data) both internal and
system-owned, external data)
internal data)

Optimization SQL execution * basic * Complie-time: * "Currently has a
plan optimization better execution naive rule-based
* Not directly plan optimizer with a
discussed in the * Run-time: small number of
paper reduced traffic / simple rules"
workload (Rack- * Plan to build a
awareness, partial cost-based
aggregation, optimizer and
grouping adaptive
heuristics) optimization"

Conclusions
● The ideas behind these 3 papers are very
similar
○ Addressing the same problem: limitation of the bare
MapReduce model
○ Similar approach: high-level data processing scripts
compiled into optimized, low-level parallel processing tasks
supported by the underlying parallel processing system
● Yet there are interesting differences
○ data schema, data ownership, and extensibility
○ Underlying system

What's hot

AnjuAnju Shekhawat

Large scale computing with mapreducehansen3032

Hadoop architecture-tutorialvinayiqbusiness

Hadoop ppt2Ankit Gupta

4. hbase overviewAnuja Gunale

Small Overview of Skype Database Toolselliando dias

Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee

An Introduction to HadoopDerrekYoungDotCom

PostgreSQL - Object Relational DatabaseMubashar Iqbal

Hadoop TechnologiesKannappan Sirchabesan

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D

Spark corePrashant Gupta

1. Apache HIVEAnuja Gunale

Bigtable: A Distributed Storage System for Structured Dataelliando dias

The Evolution of the Hadoop EcosystemCloudera, Inc.

Gfs vs hdfsYuval Carmel

Hadoop Shamama Kamal

Google BigTableNew York City College of Technology Computer Systems Technology Colloquium

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1

What's hot (20)

Anju

Large scale computing with mapreduce

Hadoop architecture-tutorial

Hadoop ppt2

4. hbase overview

Small Overview of Skype Database Tools

Parallel Data Processing with MapReduce: A Survey

An Introduction to Hadoop

PostgreSQL - Object Relational Database

Hadoop Technologies

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab

Spark core

1. Apache HIVE

Bigtable: A Distributed Storage System for Structured Data

The Evolution of the Hadoop Ecosystem

Gfs vs hdfs

Hadoop

Google BigTable

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women

Viewers also liked

Cs782 presentation group7Mingliang Sun

Class 9: Consistent HashingDavid Evans

Overview of Zookeeper, Helix and Kafka (Oakjug)Chris Richardson

Consistent hashingJooho Lee

Distributed Hash Tableravindra.devagiri

Design principles of scalable, distributed systemsTinniam V Ganesh (TV)

Distributed Hash Table and Consistent HashingCloudFundoo

How to Become a Thought Leader in Your NicheLeslie Samuel

Viewers also liked (8)

Cs782 presentation group7

Class 9: Consistent Hashing

Overview of Zookeeper, Helix and Kafka (Oakjug)

Consistent hashing

Distributed Hash Table

Design principles of scalable, distributed systems

Distributed Hash Table and Consistent Hashing

How to Become a Thought Leader in Your Niche

Similar to high_level_parallel_processing_model

Microsoft's Hadoop StoryMichael Rys

Big Data: An OverviewC. Scyphers

NosqlMuluken Sholaye Tesfaye

Drill njhug -19 feb2013MapR Technologies

HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.

Big data & hadoopAbhi Goyan

hadoopDeep Mehta

Hadoop programmingMuthusamy Manigandan

Deploying Grid Services Using HadoopGeorge Ang

Big data Analytics HadoopMishika Bharadwaj

Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts

getFamiliarWithHadoopAmirReza Mohammadi

High level languages for Big Data Analytics (Report)Jose Luis Lopez Pino

Apache SparkSugumarSarDurai

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev

Big data pptShweta Sahu

Hadoop seminarKrishnenduKrishh

4. hadoop גיא לבנברגTaldor Group

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Abdul Nasir

Apache Hadoop 1.1Sperasoft

Similar to high_level_parallel_processing_model (20)

Microsoft's Hadoop Story

Big Data: An Overview

Nosql

Drill njhug -19 feb2013

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

Big data & hadoop

hadoop

Hadoop programming

Deploying Grid Services Using Hadoop

Big data Analytics Hadoop

Large-Scale Data Storage and Processing for Scientists with Hadoop

getFamiliarWithHadoop

High level languages for Big Data Analytics (Report)

Apache Spark

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015

Big data ppt

Hadoop seminar

4. hadoop גיא לבנברג

Hadoop Distriubted File System (HDFS) presentation 27- 5-2015

Apache Hadoop 1.1

Recently uploaded

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Elevate Developer Efficiency & build GenAI Application with Amazon QBhuvaneswari Subramani

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

DBX First Quarter 2024 Investor PresentationDropbox

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Exploring Multimodal Embeddings with MilvusZilliz

ICT role in 21st century education and its challengesrafiqahmad00786416

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

MINDCTI Revenue Release Quarter One 2024MIND CTI

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Understanding the FAA Part 107 License ..

Elevate Developer Efficiency & build GenAI Application with Amazon Q

Introduction to Multilingual Retrieval Augmented Generation (RAG)

DBX First Quarter 2024 Investor Presentation

WSO2's API Vision: Unifying Control, Empowering Developers

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Exploring Multimodal Embeddings with Milvus

ICT role in 21st century education and its challenges

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

MS Copilot expands with MS Graph connectors

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Vector Search -An Introduction in Oracle Database 23ai.pptx

Boost Fertility New Invention Ups Success Rates.pdf

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

MINDCTI Revenue Release Quarter One 2024

high_level_parallel_processing_model

1. High Level Parallel Processing Models for Data Analysis Mingliang Sun

2. Motivation ● Ever-increasing amount of data ● High cost of traditional approaches ● Limitation of the bare MapReduce approach

3. Example A. Pavlo et al, “A Comparison of Approaches to Large-scale Data Analysis,” Proceedings of the 35th SIGMOD international conference on Management of data, New York, NY, USA 2009 ● Pros of Parallel DW: ○ superior runtime performance ● Cons of Parallel DW: ○ time consuming up-front set-up ○ sophisticated configuration and tuning

4. New Model – Pig Latin ● Comes from Yahoo ● Pig Latin, a high-level data analysis scripting language ● Features of Pig, and motivation for them ● Language features, data model, and motivation for ● Implementation of Pig ● A novel debugging approach brought by the system ● A few real usage scenarios

5. New Model - SCOPE ● Developed by Microsoft ● SCOPE, a declarative and extensible scripting language ● Underlying parallel data processing and storage system ● Language features and data model ● System design and architecture ● TPC-H benchmark

6. New Model - Hive ● Comes from Facebook ● HiveQL, a high-level data analysis scripting language ● Language features, data model, and type system ● Data storage in HDFS (Hadoop File System) ● System architecture and components ● Usage statistics at Facebook

7. Comparison RDB/DW Pig Latin SCOPE Hive Programming SQL/MDX: a "A sequence of * "A sequence of * "HiveQL Style single block of steps where each data processing comprises of a declarative step specifies only commands" subset of SQL constraints that a single, high- * "Has a strong and some collectively define level relational- resemblance to extensions" the result algebra style data SQL -- an * "Working transformation" intentional design towards making choice" HiveQL subsume SQL syntax" Extensibility Vendor / product * Currently Support C# * Support UDF of specific UDF support JAVA arbitrary (User Defined UDF programming Function) * With future languages support of * Data types can arbitrary also be languages customized

8. Comparison (Cont') RDB/DW Pig Latin SCOPE Hive Nested Data No, unless one is Yes,supports (Not directly Yes, supports Model willing to violate complex data mentioned or complex data 1NF types (set, map, demonstrated in (map, list, and and tuple) paper) struct) Data Ownership Yes No No Yes or No Data Storage Internal data HDFS (Hadoop Cosmos files HDFS files structure File System) files

9. Comparison (Cont') RDB/DW Pig Latin SCOPE Hive Data Schema Predefined and Defined on the fly Defined on the fly Defined on the fly stored in system and/or stored in system (Metadata) Inteoperability Poor (must Good (Operate on Good (operate on Good (operate on operate on external data) external data) both internal and system-owned, external data) internal data) Optimization SQL execution * basic * Complie-time: * "Currently has a plan optimization better execution naive rule-based * Not directly plan optimizer with a discussed in the * Run-time: small number of paper reduced traffic / simple rules" workload (Rack- * Plan to build a awareness, partial cost-based aggregation, optimizer and grouping adaptive heuristics) optimization"

10. Conclusions ● The ideas behind these 3 papers are very similar ○ Addressing the same problem: limitation of the bare MapReduce model ○ Similar approach: high-level data processing scripts compiled into optimized, low-level parallel processing tasks supported by the underlying parallel processing system ● Yet there are interesting differences ○ data schema, data ownership, and extensibility ○ Underlying system

high_level_parallel_processing_model

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to high_level_parallel_processing_model

Similar to high_level_parallel_processing_model (20)

Recently uploaded

Recently uploaded (20)

high_level_parallel_processing_model