SlideShare a Scribd company logo
EEDC
                          34330
Execution                                   Apache Pig
Environments for
Distributed
Computing
Master in Computer Architecture,
Networks and Systems - CANS



                                           Homework number: 3
                                          Group number: EEDC-3
                                             Group members:
                                       Javier Álvarez – javicid@gmail.com
                                   Francesc Lordan – francesc.lordan@gmail.com
                                     Roger Rafanell – rogerrafanell@gmail.com
Outline

1.- Introduction

2.- Pig Latin
    2.1.- Data model
    2.2.- Relational commands

3.- Implementation

4.- Conclusions




                                2
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,      Part 1
Networks and Systems - CANS        Introduction
Why Apache Pig?

Today’s Internet companies needs to process hugh data sets:

   – Parallel databases can be prohibitively expensive at this scale.

   – Programmers tend to find declarative languages such as SQL very
     unnatural.

   – Other approaches such map-reduce are low-level and rigid.




                                       4
What is Apache Pig?

A platform for analyzing large data sets that:

   – It is based in Pig Latin which lies between declarative (SQL) and
     procedural (C++) programming languages.

   – At the same time, enables the construction of programs with an easy
     parallelizable structure.




                                      5
Which features does it have?
 Dataflow Language
   – Data processing is expressed step-by-step.

 Quick Start & Interoperability
   – Pig can work over any kind of input and produce any kind of output.

 Nested Data Model
   – Pig works with complex types like tuples, bags, ...

 User Defined Functions (UDFs)
   – Potentially in any programming language (only Java for the moment).

 Only parallel
   – Pig Latin forces to use directives that are parallelizable in a direct way.

 Debugging environment
   – Debugging at programming time.
                                        6
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,     Part 2
Networks and Systems - CANS        Pig Latin
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,    Section 2.1
Networks and Systems - CANS        Data model
Data Model
Very rich data model consisting on 4 simple data types:

 Atom: Simple atomic value such as strings or numbers.
                                        ‘Alice’


 Tuple: Sequence of fields of any type of data.
                                   (‘Alice’, ‘Apple’)
                            (‘Alice’, (‘Barça’, ‘football’))


 Bag: collection of tuples with possible duplicates.
                                      (‘Alice’, ‘Apple’)
                               (‘Alice’, (‘Barça’, ‘football’))


 Map: collection of data items with an associated key (always an atom).

                              ‘Fan of’        (‘Apple’)
                                           (‘Barça’, ‘football’)
                                            9
EEDC
                          34330
Execution
Environments for
Distributed
Computing                           Section 2.2
Master in Computer Architecture,
Networks and Systems - CANS        Relational
                                   commands
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);




visits:   (‘Amy’, ‘cnn.com’, ‘8am’)
          (‘Amy’, ‘nytimes.com’, ‘9am’)
          (‘Bob’, ‘elmundotoday.com’, ’11am’)

pages: (‘cnn.com’, ‘0.8’)
       (‘nytimes.com’, ‘0.6’)
       (‘elmundotoday’, ‘0.2’)


                                       11
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url




v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’)
    (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)
    (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)




                                      12
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user




user:   (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’),
                  (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)})

        (‘Bob’,   {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})


                                       13
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr




user:   (‘Amy’, ‘0.7’)
        (‘Bob’, ‘0.2’)




                             14
Relational commands
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
answer = FILTER useravg BY avgpr > ‘0.5’




answer: (‘Amy’, ‘0.7’)




                             15
Relational commands
Other relational operators:

    – STORE : exports data into a file.
          STORE var1_name INTO 'output.txt‘;

    – COGROUP : groups together tuples from diferent datasets.
          COGROUP var1_name BY field_id, var2_name BY field_id

    –   UNION : computes the union of two variables.
    –   CROSS : computes the cross product.
    –   ORDER : sorts a data set by one or more fields.
    –   DISTINCT : removes replicated tuples in a dataset.




                                           16
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,        Part 3
Networks and Systems - CANS        Implementation
Implementation: Highlights

 Works on top of Hadoop ecosystem:
   – Current implementation uses Hadoop as execution platform.

 On-the-fly compilation:
   – Pig translates the Pig Latin commands to Map and Reduce methods.

 Lazy style language:
   – Pig try to pospone the data materialization (on disk writes) as much as
     possible.




                                     18
Implementation: Building the logical plan

 Query parsing:
   – Pig interpreter parses the commands verifying that the input files and
     bags referenced are valid.

 On-the-fly compilation:
   – Pig compiles the logical plan for that bag into physical plan (Map-Reduce
     statements) when the command cannot be more delayed and must be
     executed.

 Lazy characteristics:
   – No processing are carried out when the logical plan are build up.
   – Processing is triggered only when the user invokes STORE command
     on a bag.
   – Lazy style execution permits in-memory pipelining and other interesting
     optimizations.



                                      19
Implementation: Map-Reduce plan compilation
 CO(GROUP):
   – Each command is compiled in a distinct map-reduce job with its own
     map and reduce functions.
   – Parallelism is achieved since the output of multiple map instances is
     repartitioned in parallel to multiple reduce instances.

 LOAD:
   – Parallelism is obtained since Pig operates over files residing in the
     Hadoop distributed file system.

 FILTER/FOREACH:
   – Automatic parallelism is given since for a map-reduce job several map
     and reduce instances are run in parallel.

 ORDER (compiled in two map-reduce jobs):
   – First: Determine quantiles of the sort key
   – Second: Chops the job according the quantiles and performs a local
     sorting in the reduce phase resulting in a global sorted file.
                                      20
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,      Part 4
Networks and Systems - CANS        Conclusions
Conclusions
   Advantages:
     –   Step-by-step syntaxis.
     –   Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time).
     –   Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, …
     –   Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance.
     –   Debugging environment.
     –   Open Source (IMPORTANT!!)

   Disadvantages:
     –   UDFs methods could be a source of performance loss (the control relies on user).
     –   Overhead while compiling Pig Latin into map-reduce jobs.

   Usage Scenarios:
     –   Temporal analysis: search logs mainly involves studying how search query distribution changes
         over time.
     –   Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are
         analized to calculate some metrics such:
     –   how long is the average user session?
     –   how many links does a user click on before leaving a website?
     –   Others, ...



                                                    22
Q&A




      23
JOIN vs COGROUP




             24
FLATTERED




            25

More Related Content

What's hot

mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
Samir Bessalah
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 
Linux
LinuxLinux
02unixintro
02unixintro02unixintro
02unixintro
Zhiwen Guo
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010
BOSC 2010
 
RapidInsight for OpenNMS
RapidInsight for OpenNMSRapidInsight for OpenNMS
RapidInsight for OpenNMS
mberkay
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
DataWorks Summit
 
Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop apps
Basant Verma
 
Asian Spirit 3 Day Dba On Ubl
Asian Spirit 3 Day Dba On UblAsian Spirit 3 Day Dba On Ubl
Asian Spirit 3 Day Dba On Ubl
newrforce
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
Rupak Roy
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
Dave Hiltbrand
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
Subhas Kumar Ghosh
 
OpenNMS introduction
OpenNMS introductionOpenNMS introduction
OpenNMS introduction
Guider Lee
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
Steven Francia
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
Linux
LinuxLinux
Linux
Rathan Raj
 
Configuring and manipulating HDFS files
Configuring and manipulating HDFS filesConfiguring and manipulating HDFS files
Configuring and manipulating HDFS files
Rupak Roy
 
Drupal Basics
Drupal BasicsDrupal Basics
Drupal Basics
Ryan Cross
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 

What's hot (20)

mesos-devoxx14
mesos-devoxx14mesos-devoxx14
mesos-devoxx14
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Linux
LinuxLinux
Linux
 
02unixintro
02unixintro02unixintro
02unixintro
 
O connor bosc2010
O connor bosc2010O connor bosc2010
O connor bosc2010
 
RapidInsight for OpenNMS
RapidInsight for OpenNMSRapidInsight for OpenNMS
RapidInsight for OpenNMS
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Profile hadoop apps
Profile hadoop appsProfile hadoop apps
Profile hadoop apps
 
Asian Spirit 3 Day Dba On Ubl
Asian Spirit 3 Day Dba On UblAsian Spirit 3 Day Dba On Ubl
Asian Spirit 3 Day Dba On Ubl
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export Installing Apache Hive, internal and external table, import-export
Installing Apache Hive, internal and external table, import-export
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
OpenNMS introduction
OpenNMS introductionOpenNMS introduction
OpenNMS introduction
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Linux
LinuxLinux
Linux
 
Configuring and manipulating HDFS files
Configuring and manipulating HDFS filesConfiguring and manipulating HDFS files
Configuring and manipulating HDFS files
 
Drupal Basics
Drupal BasicsDrupal Basics
Drupal Basics
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 

Similar to EEDC Apache Pig Language

Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
Francesc Lordan Gomis
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
Roger Rafanell Mas
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1
Hassy Veldstra
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Research computing at ILRI
Research computing at ILRIResearch computing at ILRI
Research computing at ILRI
ILRI
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Lviv Startup Club
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
Harisankar H
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
Holistic Aggregate Resource Environment
Holistic Aggregate Resource EnvironmentHolistic Aggregate Resource Environment
Holistic Aggregate Resource Environment
Eric Van Hensbergen
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
Karthik Murugesan
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Spark Summit
 
ERS downscale2016
ERS downscale2016ERS downscale2016
ERS downscale2016
Victor de Boer
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
Martin Dvorak
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 

Similar to EEDC Apache Pig Language (20)

Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Research computing at ILRI
Research computing at ILRIResearch computing at ILRI
Research computing at ILRI
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Holistic Aggregate Resource Environment
Holistic Aggregate Resource EnvironmentHolistic Aggregate Resource Environment
Holistic Aggregate Resource Environment
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
ERS downscale2016
ERS downscale2016ERS downscale2016
ERS downscale2016
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 

More from Roger Rafanell Mas

How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?
Roger Rafanell Mas
 
Activate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifiedsActivate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifieds
Roger Rafanell Mas
 
Pensamiento lateral
Pensamiento lateralPensamiento lateral
Pensamiento lateral
Roger Rafanell Mas
 
Storm distributed cache workshop
Storm distributed cache workshopStorm distributed cache workshop
Storm distributed cache workshop
Roger Rafanell Mas
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 
MRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingMRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud Computing
Roger Rafanell Mas
 
SDS Amazon RDS
SDS Amazon RDSSDS Amazon RDS
SDS Amazon RDS
Roger Rafanell Mas
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
Roger Rafanell Mas
 
EEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersEEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of Datacenters
Roger Rafanell Mas
 
EEDC Everthing as a Service
EEDC Everthing as a ServiceEEDC Everthing as a Service
EEDC Everthing as a Service
Roger Rafanell Mas
 
EEDC Distributed Systems
EEDC Distributed SystemsEEDC Distributed Systems
EEDC Distributed Systems
Roger Rafanell Mas
 
EEDC SOAP vs REST
EEDC SOAP vs RESTEEDC SOAP vs REST
EEDC SOAP vs REST
Roger Rafanell Mas
 

More from Roger Rafanell Mas (12)

How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?
 
Activate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifiedsActivate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifieds
 
Pensamiento lateral
Pensamiento lateralPensamiento lateral
Pensamiento lateral
 
Storm distributed cache workshop
Storm distributed cache workshopStorm distributed cache workshop
Storm distributed cache workshop
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
MRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingMRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud Computing
 
SDS Amazon RDS
SDS Amazon RDSSDS Amazon RDS
SDS Amazon RDS
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
EEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersEEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of Datacenters
 
EEDC Everthing as a Service
EEDC Everthing as a ServiceEEDC Everthing as a Service
EEDC Everthing as a Service
 
EEDC Distributed Systems
EEDC Distributed SystemsEEDC Distributed Systems
EEDC Distributed Systems
 
EEDC SOAP vs REST
EEDC SOAP vs RESTEEDC SOAP vs REST
EEDC SOAP vs REST
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 

EEDC Apache Pig Language

  • 1. EEDC 34330 Execution Apache Pig Environments for Distributed Computing Master in Computer Architecture, Networks and Systems - CANS Homework number: 3 Group number: EEDC-3 Group members: Javier Álvarez – javicid@gmail.com Francesc Lordan – francesc.lordan@gmail.com Roger Rafanell – rogerrafanell@gmail.com
  • 2. Outline 1.- Introduction 2.- Pig Latin 2.1.- Data model 2.2.- Relational commands 3.- Implementation 4.- Conclusions 2
  • 3. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 1 Networks and Systems - CANS Introduction
  • 4. Why Apache Pig? Today’s Internet companies needs to process hugh data sets: – Parallel databases can be prohibitively expensive at this scale. – Programmers tend to find declarative languages such as SQL very unnatural. – Other approaches such map-reduce are low-level and rigid. 4
  • 5. What is Apache Pig? A platform for analyzing large data sets that: – It is based in Pig Latin which lies between declarative (SQL) and procedural (C++) programming languages. – At the same time, enables the construction of programs with an easy parallelizable structure. 5
  • 6. Which features does it have?  Dataflow Language – Data processing is expressed step-by-step.  Quick Start & Interoperability – Pig can work over any kind of input and produce any kind of output.  Nested Data Model – Pig works with complex types like tuples, bags, ...  User Defined Functions (UDFs) – Potentially in any programming language (only Java for the moment).  Only parallel – Pig Latin forces to use directives that are parallelizable in a direct way.  Debugging environment – Debugging at programming time. 6
  • 7. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 2 Networks and Systems - CANS Pig Latin
  • 8. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Section 2.1 Networks and Systems - CANS Data model
  • 9. Data Model Very rich data model consisting on 4 simple data types:  Atom: Simple atomic value such as strings or numbers. ‘Alice’  Tuple: Sequence of fields of any type of data. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Bag: collection of tuples with possible duplicates. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Map: collection of data items with an associated key (always an atom). ‘Fan of’  (‘Apple’) (‘Barça’, ‘football’) 9
  • 10. EEDC 34330 Execution Environments for Distributed Computing Section 2.2 Master in Computer Architecture, Networks and Systems - CANS Relational commands
  • 11. Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); visits: (‘Amy’, ‘cnn.com’, ‘8am’) (‘Amy’, ‘nytimes.com’, ‘9am’) (‘Bob’, ‘elmundotoday.com’, ’11am’) pages: (‘cnn.com’, ‘0.8’) (‘nytimes.com’, ‘0.6’) (‘elmundotoday’, ‘0.2’) 11
  • 12. Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’) (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’) (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’) 12
  • 13. Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’), (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)}) (‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)}) 13
  • 14. Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr user: (‘Amy’, ‘0.7’) (‘Bob’, ‘0.2’) 14
  • 15. Relational commands visits = LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr answer = FILTER useravg BY avgpr > ‘0.5’ answer: (‘Amy’, ‘0.7’) 15
  • 16. Relational commands Other relational operators: – STORE : exports data into a file. STORE var1_name INTO 'output.txt‘; – COGROUP : groups together tuples from diferent datasets. COGROUP var1_name BY field_id, var2_name BY field_id – UNION : computes the union of two variables. – CROSS : computes the cross product. – ORDER : sorts a data set by one or more fields. – DISTINCT : removes replicated tuples in a dataset. 16
  • 17. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 3 Networks and Systems - CANS Implementation
  • 18. Implementation: Highlights  Works on top of Hadoop ecosystem: – Current implementation uses Hadoop as execution platform.  On-the-fly compilation: – Pig translates the Pig Latin commands to Map and Reduce methods.  Lazy style language: – Pig try to pospone the data materialization (on disk writes) as much as possible. 18
  • 19. Implementation: Building the logical plan  Query parsing: – Pig interpreter parses the commands verifying that the input files and bags referenced are valid.  On-the-fly compilation: – Pig compiles the logical plan for that bag into physical plan (Map-Reduce statements) when the command cannot be more delayed and must be executed.  Lazy characteristics: – No processing are carried out when the logical plan are build up. – Processing is triggered only when the user invokes STORE command on a bag. – Lazy style execution permits in-memory pipelining and other interesting optimizations. 19
  • 20. Implementation: Map-Reduce plan compilation  CO(GROUP): – Each command is compiled in a distinct map-reduce job with its own map and reduce functions. – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to multiple reduce instances.  LOAD: – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system.  FILTER/FOREACH: – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run in parallel.  ORDER (compiled in two map-reduce jobs): – First: Determine quantiles of the sort key – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase resulting in a global sorted file. 20
  • 21. EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 4 Networks and Systems - CANS Conclusions
  • 22. Conclusions  Advantages: – Step-by-step syntaxis. – Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time). – Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, … – Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance. – Debugging environment. – Open Source (IMPORTANT!!)  Disadvantages: – UDFs methods could be a source of performance loss (the control relies on user). – Overhead while compiling Pig Latin into map-reduce jobs.  Usage Scenarios: – Temporal analysis: search logs mainly involves studying how search query distribution changes over time. – Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are analized to calculate some metrics such: – how long is the average user session? – how many links does a user click on before leaving a website? – Others, ... 22
  • 23. Q&A 23
  • 25. FLATTERED 25