SlideShare a Scribd company logo
1 of 10
High Level Parallel Processing Models for
             Data Analysis
                 Mingliang Sun
Motivation

●   Ever-increasing amount of data

●   High cost of traditional approaches

●   Limitation of the bare MapReduce
    approach
Example
A. Pavlo et al, “A Comparison of Approaches to Large-scale
Data Analysis,” Proceedings of the 35th SIGMOD international
conference on Management of data, New York, NY, USA 2009


●   Pros of Parallel DW:
    ○   superior runtime performance
●   Cons of Parallel DW:
    ○   time consuming up-front set-up
    ○   sophisticated configuration and tuning
New Model – Pig Latin
●   Comes from Yahoo
●   Pig Latin, a high-level data analysis scripting
    language
●   Features of Pig, and motivation for them
●   Language features, data model, and motivation for
●   Implementation of Pig
●   A novel debugging approach brought by the system
●   A few real usage scenarios
New Model - SCOPE
●   Developed by Microsoft
●   SCOPE, a declarative and extensible scripting
    language
●   Underlying parallel data processing and storage
    system
●   Language features and data model
●   System design and architecture
●   TPC-H benchmark
New Model - Hive
●   Comes from Facebook
●   HiveQL, a high-level data analysis scripting language
●   Language features, data model, and type system
●   Data storage in HDFS (Hadoop File System)
●   System architecture and components
●   Usage statistics at Facebook
Comparison
                RDB/DW                Pig Latin             SCOPE                Hive


Programming     SQL/MDX: a            "A sequence of        * "A sequence of     * "HiveQL
Style           single block of       steps where each      data processing      comprises of a
                declarative           step specifies only   commands"            subset of SQL
                constraints that      a single, high-       * "Has a strong      and some
                collectively define   level relational-     resemblance to       extensions"
                the result            algebra style data    SQL -- an            * "Working
                                      transformation"       intentional design   towards making
                                                            choice"              HiveQL subsume
                                                                                 SQL syntax"

Extensibility   Vendor / product      * Currently           Support C#           * Support UDF of
                specific UDF          support JAVA                               arbitrary
                (User Defined         UDF                                        programming
                Function)             * With future                              languages
                                      support of                                 * Data types can
                                      arbitrary                                  also be
                                      languages                                  customized
Comparison (Cont')
                 RDB/DW               Pig Latin            SCOPE             Hive


Nested Data      No, unless one is    Yes,supports         (Not directly     Yes, supports
Model            willing to violate   complex data         mentioned or      complex data
                 1NF                  types (set, map,     demonstrated in   (map, list, and
                                      and tuple)           paper)            struct)




Data Ownership   Yes                  No                   No                Yes or No




Data Storage     Internal data        HDFS (Hadoop         Cosmos files      HDFS files
                 structure            File System) files
Comparison (Cont')
                  RDB/DW             Pig Latin            SCOPE                Hive


Data Schema       Predefined and     Defined on the fly   Defined on the fly   Defined on the fly
                  stored in system                                             and/or stored in
                                                                               system
                                                                               (Metadata)


Inteoperability   Poor (must         Good (Operate on     Good (operate on     Good (operate on
                  operate on         external data)       external data)       both internal and
                  system-owned,                                                external data)
                  internal data)

Optimization      SQL execution      * basic              * Complie-time:      * "Currently has a
                  plan               optimization         better execution     naive rule-based
                                     * Not directly       plan                 optimizer with a
                                     discussed in the     * Run-time:          small number of
                                     paper                reduced traffic /    simple rules"
                                                          workload (Rack-      * Plan to build a
                                                          awareness, partial   cost-based
                                                          aggregation,         optimizer and
                                                          grouping             adaptive
                                                          heuristics)          optimization"
Conclusions
●   The ideas behind these 3 papers are very
    similar
    ○   Addressing the same problem: limitation of the bare
        MapReduce model
    ○   Similar approach: high-level data processing scripts
        compiled into optimized, low-level parallel processing tasks
        supported by the underlying parallel processing system
●   Yet there are interesting differences
    ○   data schema, data ownership, and extensibility
    ○   Underlying system

More Related Content

What's hot

Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Small Overview of Skype Database Tools
Small Overview of Skype Database ToolsSmall Overview of Skype Database Tools
Small Overview of Skype Database Toolselliando dias
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabaseMubashar Iqbal
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured DataBigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Dataelliando dias
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 

What's hot (20)

Anju
AnjuAnju
Anju
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
4. hbase overview
4. hbase overview4. hbase overview
4. hbase overview
 
Small Overview of Skype Database Tools
Small Overview of Skype Database ToolsSmall Overview of Skype Database Tools
Small Overview of Skype Database Tools
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
Spark core
Spark coreSpark core
Spark core
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured DataBigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Hadoop
Hadoop Hadoop
Hadoop
 
Google BigTable
Google BigTableGoogle BigTable
Google BigTable
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 

Viewers also liked

Cs782 presentation group7
Cs782 presentation group7Cs782 presentation group7
Cs782 presentation group7Mingliang Sun
 
Class 9: Consistent Hashing
Class 9: Consistent HashingClass 9: Consistent Hashing
Class 9: Consistent HashingDavid Evans
 
Overview of Zookeeper, Helix and Kafka (Oakjug)
Overview of Zookeeper, Helix and Kafka (Oakjug)Overview of Zookeeper, Helix and Kafka (Oakjug)
Overview of Zookeeper, Helix and Kafka (Oakjug)Chris Richardson
 
Consistent hashing
Consistent hashingConsistent hashing
Consistent hashingJooho Lee
 
Design principles of scalable, distributed systems
Design principles of scalable, distributed systemsDesign principles of scalable, distributed systems
Design principles of scalable, distributed systemsTinniam V Ganesh (TV)
 
Distributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingDistributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingCloudFundoo
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

Viewers also liked (8)

Cs782 presentation group7
Cs782 presentation group7Cs782 presentation group7
Cs782 presentation group7
 
Class 9: Consistent Hashing
Class 9: Consistent HashingClass 9: Consistent Hashing
Class 9: Consistent Hashing
 
Overview of Zookeeper, Helix and Kafka (Oakjug)
Overview of Zookeeper, Helix and Kafka (Oakjug)Overview of Zookeeper, Helix and Kafka (Oakjug)
Overview of Zookeeper, Helix and Kafka (Oakjug)
 
Consistent hashing
Consistent hashingConsistent hashing
Consistent hashing
 
Distributed Hash Table
Distributed Hash TableDistributed Hash Table
Distributed Hash Table
 
Design principles of scalable, distributed systems
Design principles of scalable, distributed systemsDesign principles of scalable, distributed systems
Design principles of scalable, distributed systems
 
Distributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingDistributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent Hashing
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to high_level_parallel_processing_model

Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)Jose Luis Lopez Pino
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Abdul Nasir
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 

Similar to high_level_parallel_processing_model (20)

Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Nosql
NosqlNosql
Nosql
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 

Recently uploaded

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

high_level_parallel_processing_model

  • 1. High Level Parallel Processing Models for Data Analysis Mingliang Sun
  • 2. Motivation ● Ever-increasing amount of data ● High cost of traditional approaches ● Limitation of the bare MapReduce approach
  • 3. Example A. Pavlo et al, “A Comparison of Approaches to Large-scale Data Analysis,” Proceedings of the 35th SIGMOD international conference on Management of data, New York, NY, USA 2009 ● Pros of Parallel DW: ○ superior runtime performance ● Cons of Parallel DW: ○ time consuming up-front set-up ○ sophisticated configuration and tuning
  • 4. New Model – Pig Latin ● Comes from Yahoo ● Pig Latin, a high-level data analysis scripting language ● Features of Pig, and motivation for them ● Language features, data model, and motivation for ● Implementation of Pig ● A novel debugging approach brought by the system ● A few real usage scenarios
  • 5. New Model - SCOPE ● Developed by Microsoft ● SCOPE, a declarative and extensible scripting language ● Underlying parallel data processing and storage system ● Language features and data model ● System design and architecture ● TPC-H benchmark
  • 6. New Model - Hive ● Comes from Facebook ● HiveQL, a high-level data analysis scripting language ● Language features, data model, and type system ● Data storage in HDFS (Hadoop File System) ● System architecture and components ● Usage statistics at Facebook
  • 7. Comparison RDB/DW Pig Latin SCOPE Hive Programming SQL/MDX: a "A sequence of * "A sequence of * "HiveQL Style single block of steps where each data processing comprises of a declarative step specifies only commands" subset of SQL constraints that a single, high- * "Has a strong and some collectively define level relational- resemblance to extensions" the result algebra style data SQL -- an * "Working transformation" intentional design towards making choice" HiveQL subsume SQL syntax" Extensibility Vendor / product * Currently Support C# * Support UDF of specific UDF support JAVA arbitrary (User Defined UDF programming Function) * With future languages support of * Data types can arbitrary also be languages customized
  • 8. Comparison (Cont') RDB/DW Pig Latin SCOPE Hive Nested Data No, unless one is Yes,supports (Not directly Yes, supports Model willing to violate complex data mentioned or complex data 1NF types (set, map, demonstrated in (map, list, and and tuple) paper) struct) Data Ownership Yes No No Yes or No Data Storage Internal data HDFS (Hadoop Cosmos files HDFS files structure File System) files
  • 9. Comparison (Cont') RDB/DW Pig Latin SCOPE Hive Data Schema Predefined and Defined on the fly Defined on the fly Defined on the fly stored in system and/or stored in system (Metadata) Inteoperability Poor (must Good (Operate on Good (operate on Good (operate on operate on external data) external data) both internal and system-owned, external data) internal data) Optimization SQL execution * basic * Complie-time: * "Currently has a plan optimization better execution naive rule-based * Not directly plan optimizer with a discussed in the * Run-time: small number of paper reduced traffic / simple rules" workload (Rack- * Plan to build a awareness, partial cost-based aggregation, optimizer and grouping adaptive heuristics) optimization"
  • 10. Conclusions ● The ideas behind these 3 papers are very similar ○ Addressing the same problem: limitation of the bare MapReduce model ○ Similar approach: high-level data processing scripts compiled into optimized, low-level parallel processing tasks supported by the underlying parallel processing system ● Yet there are interesting differences ○ data schema, data ownership, and extensibility ○ Underlying system