SlideShare a Scribd company logo
1 of 25
EEDC
                          34330
Execution                                   Apache Pig
Environments for
Distributed
Computing
Master in Computer Architecture,
Networks and Systems - CANS



                                           Homework number: 3
                                          Group number: EEDC-3
                                             Group members:
                                       Javier Álvarez – javicid@gmail.com
                                   Francesc Lordan – francesc.lordan@gmail.com
                                     Roger Rafanell – rogerrafanell@gmail.com
Outline

1.- Introduction

2.- Pig Latin
    2.1.- Data model
    2.2.- Relational commands

3.- Implementation

4.- Conclusions




                                2
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,      Part 1
Networks and Systems - CANS        Introduction
Why Apache Pig?

Today’s Internet companies needs to process hugh data sets:

   – Parallel databases can be prohibitively expensive at this scale.

   – Programmers tend to find declarative languages such as SQL very
     unnatural.

   – Other approaches such map-reduce are low-level and rigid.




                                       4
What is Apache Pig?

A platform for analyzing large data sets that:

   – It is based in Pig Latin which lies between declarative (SQL) and
     procedural (C++) programming languages.

   – At the same time, enables the construction of programs with an easy
     parallelizable structure.




                                      5
Which features does it have?
 Dataflow Language
   – Data processing is expressed step-by-step.

 Quick Start & Interoperability
   – Pig can work over any kind of input and produce any kind of output.

 Nested Data Model
   – Pig works with complex types like tuples, bags, ...

 User Defined Functions (UDFs)
   – Potentially in any programming language (only Java for the moment).

 Only parallel
   – Pig Latin forces to use directives that are parallelizable in a direct way.

 Debugging environment
   – Debugging at programming time.
                                        6
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,     Part 2
Networks and Systems - CANS        Pig Latin
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,    Section 2.1
Networks and Systems - CANS        Data model
Data Model
Very rich data model consisting on 4 simple data types:

 Atom: Simple atomic value such as strings or numbers.
                                        ‘Alice’


 Tuple: Sequence of fields of any type of data.
                                   (‘Alice’, ‘Apple’)
                            (‘Alice’, (‘Barça’, ‘football’))


 Bag: collection of tuples with possible duplicates.
                                      (‘Alice’, ‘Apple’)
                               (‘Alice’, (‘Barça’, ‘football’))


 Map: collection of data items with an associated key (always an atom).

                              ‘Fan of’        (‘Apple’)
                                           (‘Barça’, ‘football’)
                                            9
EEDC
                          34330
Execution
Environments for
Distributed
Computing                           Section 2.2
Master in Computer Architecture,
Networks and Systems - CANS        Relational
                                   commands
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);




visits:   (‘Amy’, ‘cnn.com’, ‘8am’)
          (‘Amy’, ‘nytimes.com’, ‘9am’)
          (‘Bob’, ‘elmundotoday.com’, ’11am’)

pages: (‘cnn.com’, ‘0.8’)
       (‘nytimes.com’, ‘0.6’)
       (‘elmundotoday’, ‘0.2’)


                                       11
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url




v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’)
    (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)
    (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)




                                      12
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user




user:   (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’),
                  (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)})

        (‘Bob’,   {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})


                                       13
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr




user:   (‘Amy’, ‘0.7’)
        (‘Bob’, ‘0.2’)




                             14
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
answer = FILTER useravg BY avgpr > ‘0.5’




answer: (‘Amy’, ‘0.7’)




                             15
Relational commands
Other relational operators:

    – STORE : exports data into a file.
          STORE var1_name INTO 'output.txt‘;

    – COGROUP : groups together tuples from diferent datasets.
          COGROUP var1_name BY field_id, var2_name BY field_id

    –   UNION : computes the union of two variables.
    –   CROSS : computes the cross product.
    –   ORDER : sorts a data set by one or more fields.
    –   DISTINCT : removes replicated tuples in a dataset.




                                           16
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,        Part 3
Networks and Systems - CANS        Implementation
Implementation: Highlights

 Works on top of Hadoop ecosystem:
   – Current implementation uses Hadoop as execution platform.

 On-the-fly compilation:
   – Pig translates the Pig Latin commands to Map and Reduce methods.

 Lazy style language:
   – Pig try to pospone the data materialization (on disk writes) as much as
     possible.




                                     18
Implementation: Building the logical plan

 Query parsing:
   – Pig interpreter parses the commands verifying that the input files and
     bags referenced are valid.

 On-the-fly compilation:
   – Pig compiles the logical plan for that bag into physical plan (Map-Reduce
     statements) when the command cannot be more delayed and must be
     executed.

 Lazy characteristics:
   – No processing are carried out when the logical plan are build up.
   – Processing is triggered only when the user invokes STORE command
     on a bag.
   – Lazy style execution permits in-memory pipelining and other interesting
     optimizations.



                                      19
Implementation: Map-Reduce plan compilation
 CO(GROUP):
   – Each command is compiled in a distinct map-reduce job with its own
     map and reduce functions.
   – Parallelism is achieved since the output of multiple map instances is
     repartitioned in parallel to multiple reduce instances.

 LOAD:
   – Parallelism is obtained since Pig operates over files residing in the
     Hadoop distributed file system.

 FILTER/FOREACH:
   – Automatic parallelism is given since for a map-reduce job several map
     and reduce instances are run in parallel.

 ORDER (compiled in two map-reduce jobs):
   – First: Determine quantiles of the sort key
   – Second: Chops the job according the quantiles and performs a local
     sorting in the reduce phase resulting in a global sorted file.
                                      20
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,      Part 4
Networks and Systems - CANS        Conclusions
Conclusions
   Advantages:
     –   Step-by-step syntaxis.
     –   Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time).
     –   Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, …
     –   Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance.
     –   Debugging environment.
     –   Open Source (IMPORTANT!!)

   Disadvantages:
     –   UDFs methods could be a source of performance loss (the control relies on user).
     –   Overhead while compiling Pig Latin into map-reduce jobs.

   Usage Scenarios:
     –   Temporal analysis: search logs mainly involves studying how search query distribution changes
         over time.
     –   Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are
         analized to calculate some metrics such:
     –   how long is the average user session?
     –   how many links does a user click on before leaving a website?
     –   Others, ...



                                                    22
Q&A




      23
JOIN vs COGROUP




             24
FLATTERED




            25

More Related Content

What's hot

Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010BOSC 2010
 
RapidInsight for OpenNMS
RapidInsight for OpenNMSRapidInsight for OpenNMS
RapidInsight for OpenNMSmberkay
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop appsBasant Verma
 
Asian Spirit 3 Day Dba On Ubl
Asian Spirit 3 Day Dba On UblAsian Spirit 3 Day Dba On Ubl
Asian Spirit 3 Day Dba On Ublnewrforce
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Rupak Roy
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
OpenNMS introduction
OpenNMS introductionOpenNMS introduction
OpenNMS introductionGuider Lee
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoverySteven Francia
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Configuring and manipulating HDFS files
Configuring and manipulating HDFS filesConfiguring and manipulating HDFS files
Configuring and manipulating HDFS filesRupak Roy
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 

What's hot (20)

mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Linux
LinuxLinux
Linux
 
02unixintro
02unixintro02unixintro
02unixintro
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010
 
RapidInsight for OpenNMS
RapidInsight for OpenNMSRapidInsight for OpenNMS
RapidInsight for OpenNMS
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop apps
 
Asian Spirit 3 Day Dba On Ubl
Asian Spirit 3 Day Dba On UblAsian Spirit 3 Day Dba On Ubl
Asian Spirit 3 Day Dba On Ubl
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
OpenNMS introduction
OpenNMS introductionOpenNMS introduction
OpenNMS introduction
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Linux
LinuxLinux
Linux
 
Configuring and manipulating HDFS files
Configuring and manipulating HDFS filesConfiguring and manipulating HDFS files
Configuring and manipulating HDFS files
 
Drupal Basics
Drupal BasicsDrupal Basics
Drupal Basics
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 

Similar to EEDC Apache Pig Language

A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialRoger Rafanell Mas
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1Hassy Veldstra
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Research computing at ILRI
Research computing at ILRIResearch computing at ILRI
Research computing at ILRIILRI
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...Lviv Startup Club
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Holistic Aggregate Resource Environment
Holistic Aggregate Resource EnvironmentHolistic Aggregate Resource Environment
Holistic Aggregate Resource EnvironmentEric Van Hensbergen
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 

Similar to EEDC Apache Pig Language (20)

Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Research computing at ILRI
Research computing at ILRIResearch computing at ILRI
Research computing at ILRI
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Holistic Aggregate Resource Environment
Holistic Aggregate Resource EnvironmentHolistic Aggregate Resource Environment
Holistic Aggregate Resource Environment
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
ERS downscale2016
ERS downscale2016ERS downscale2016
ERS downscale2016
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 

More from Roger Rafanell Mas

How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?Roger Rafanell Mas
 
Activate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifiedsActivate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifiedsRoger Rafanell Mas
 
Storm distributed cache workshop
Storm distributed cache workshopStorm distributed cache workshop
Storm distributed cache workshopRoger Rafanell Mas
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
MRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingMRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingRoger Rafanell Mas
 
EEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersEEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersRoger Rafanell Mas
 

More from Roger Rafanell Mas (12)

How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?
 
Activate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifiedsActivate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifieds
 
Pensamiento lateral
Pensamiento lateralPensamiento lateral
Pensamiento lateral
 
Storm distributed cache workshop
Storm distributed cache workshopStorm distributed cache workshop
Storm distributed cache workshop
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
MRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingMRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud Computing
 
SDS Amazon RDS
SDS Amazon RDSSDS Amazon RDS
SDS Amazon RDS
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
EEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersEEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of Datacenters
 
EEDC Everthing as a Service
EEDC Everthing as a ServiceEEDC Everthing as a Service
EEDC Everthing as a Service
 
EEDC Distributed Systems
EEDC Distributed SystemsEEDC Distributed Systems
EEDC Distributed Systems
 
EEDC SOAP vs REST
EEDC SOAP vs RESTEEDC SOAP vs REST
EEDC SOAP vs REST
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

EEDC Apache Pig Language

  • 1. EEDC 34330 Execution Apache Pig Environments for Distributed Computing Master in Computer Architecture, Networks and Systems - CANS Homework number: 3 Group number: EEDC-3 Group members: Javier Álvarez – javicid@gmail.com Francesc Lordan – francesc.lordan@gmail.com Roger Rafanell – rogerrafanell@gmail.com
  • 2. Outline 1.- Introduction 2.- Pig Latin 2.1.- Data model 2.2.- Relational commands 3.- Implementation 4.- Conclusions 2
  • 3. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 1 Networks and Systems - CANS Introduction
  • 4. Why Apache Pig? Today’s Internet companies needs to process hugh data sets: – Parallel databases can be prohibitively expensive at this scale. – Programmers tend to find declarative languages such as SQL very unnatural. – Other approaches such map-reduce are low-level and rigid. 4
  • 5. What is Apache Pig? A platform for analyzing large data sets that: – It is based in Pig Latin which lies between declarative (SQL) and procedural (C++) programming languages. – At the same time, enables the construction of programs with an easy parallelizable structure. 5
  • 6. Which features does it have?  Dataflow Language – Data processing is expressed step-by-step.  Quick Start & Interoperability – Pig can work over any kind of input and produce any kind of output.  Nested Data Model – Pig works with complex types like tuples, bags, ...  User Defined Functions (UDFs) – Potentially in any programming language (only Java for the moment).  Only parallel – Pig Latin forces to use directives that are parallelizable in a direct way.  Debugging environment – Debugging at programming time. 6
  • 7. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 2 Networks and Systems - CANS Pig Latin
  • 8. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Section 2.1 Networks and Systems - CANS Data model
  • 9. Data Model Very rich data model consisting on 4 simple data types:  Atom: Simple atomic value such as strings or numbers. ‘Alice’  Tuple: Sequence of fields of any type of data. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Bag: collection of tuples with possible duplicates. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Map: collection of data items with an associated key (always an atom). ‘Fan of’  (‘Apple’) (‘Barça’, ‘football’) 9
  • 10. EEDC 34330 Execution Environments for Distributed Computing Section 2.2 Master in Computer Architecture, Networks and Systems - CANS Relational commands
  • 11. Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); visits: (‘Amy’, ‘cnn.com’, ‘8am’) (‘Amy’, ‘nytimes.com’, ‘9am’) (‘Bob’, ‘elmundotoday.com’, ’11am’) pages: (‘cnn.com’, ‘0.8’) (‘nytimes.com’, ‘0.6’) (‘elmundotoday’, ‘0.2’) 11
  • 12. Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’) (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’) (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’) 12
  • 13. Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’), (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)}) (‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)}) 13
  • 14. Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr user: (‘Amy’, ‘0.7’) (‘Bob’, ‘0.2’) 14
  • 15. Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr answer = FILTER useravg BY avgpr > ‘0.5’ answer: (‘Amy’, ‘0.7’) 15
  • 16. Relational commands Other relational operators: – STORE : exports data into a file. STORE var1_name INTO 'output.txt‘; – COGROUP : groups together tuples from diferent datasets. COGROUP var1_name BY field_id, var2_name BY field_id – UNION : computes the union of two variables. – CROSS : computes the cross product. – ORDER : sorts a data set by one or more fields. – DISTINCT : removes replicated tuples in a dataset. 16
  • 17. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 3 Networks and Systems - CANS Implementation
  • 18. Implementation: Highlights  Works on top of Hadoop ecosystem: – Current implementation uses Hadoop as execution platform.  On-the-fly compilation: – Pig translates the Pig Latin commands to Map and Reduce methods.  Lazy style language: – Pig try to pospone the data materialization (on disk writes) as much as possible. 18
  • 19. Implementation: Building the logical plan  Query parsing: – Pig interpreter parses the commands verifying that the input files and bags referenced are valid.  On-the-fly compilation: – Pig compiles the logical plan for that bag into physical plan (Map-Reduce statements) when the command cannot be more delayed and must be executed.  Lazy characteristics: – No processing are carried out when the logical plan are build up. – Processing is triggered only when the user invokes STORE command on a bag. – Lazy style execution permits in-memory pipelining and other interesting optimizations. 19
  • 20. Implementation: Map-Reduce plan compilation  CO(GROUP): – Each command is compiled in a distinct map-reduce job with its own map and reduce functions. – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to multiple reduce instances.  LOAD: – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system.  FILTER/FOREACH: – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run in parallel.  ORDER (compiled in two map-reduce jobs): – First: Determine quantiles of the sort key – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase resulting in a global sorted file. 20
  • 21. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 4 Networks and Systems - CANS Conclusions
  • 22. Conclusions  Advantages: – Step-by-step syntaxis. – Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time). – Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, … – Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance. – Debugging environment. – Open Source (IMPORTANT!!)  Disadvantages: – UDFs methods could be a source of performance loss (the control relies on user). – Overhead while compiling Pig Latin into map-reduce jobs.  Usage Scenarios: – Temporal analysis: search logs mainly involves studying how search query distribution changes over time. – Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are analized to calculate some metrics such: – how long is the average user session? – how many links does a user click on before leaving a website? – Others, ... 22
  • 23. Q&A 23
  • 25. FLATTERED 25