EEDC
                          34330
Execution                                   Apache Pig
Environments for
Distributed
Computing
Master in Computer Architecture,
Networks and Systems - CANS



                                           Homework number: 3
                                          Group number: EEDC-3
                                             Group members:
                                       Javier Álvarez – javicid@gmail.com
                                   Francesc Lordan – francesc.lordan@gmail.com
                                     Roger Rafanell – rogerrafanell@gmail.com
Outline

1.- Introduction

2.- Pig Latin
    2.1.- Data Model
    2.2.- Programming Model

3.- Implementation

4.- Conclusions




                              2
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,      Part 1
Networks and Systems - CANS        Introduction
Why Apache Pig?

Today’s Internet companies needs to process hugh data sets:

   – Parallel databases can be prohibitively expensive at this scale.

   – Programmers tend to find declarative languages such as SQL very
     unnatural.

   – Other approaches such map-reduce are low-level and rigid.




                                       4
What is Apache Pig?

A platform for analyzing large data sets that:

   – It is based in Pig Latin which lies between declarative (SQL) and
     procedural (C++) programming languages.

   – At the same time, enables the construction of programs with an easy
     parallelizable structure.




                                      5
Which features does it have?
 Dataflow Language
   – Data processing is expressed step-by-step.

 Quick Start & Interoperability
   – Pig can work over any kind of input and produce any kind of output.

 Nested Data Model
   – Pig works with complex types like tuples, bags, ...

 User Defined Functions (UDFs)
   – Potentially in any programming language (only Java for the moment).

 Only parallel
   – Pig Latin forces to use directives that are parallelizable in a direct way.

 Debugging environment
   – Debugging at programming time.
                                        6
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,     Part 2
Networks and Systems - CANS        Pig Latin
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,    Section 2.1
Networks and Systems - CANS        Data Model
Data Model
Very rich data model consisting on 4 simple data types:

 Atom: Simple atomic value such as strings or numbers.
                                        ‘Alice’
 Tuple: Sequence of fields of any type of data.
                                   (‘Alice’, ‘Apple’)
                            (‘Alice’, (‘Barça’, ‘football’))
 Bag: collection of tuples with possible duplicates.
                                      (‘Alice’, ‘Apple’)
                               (‘Alice’, (‘Barça’, ‘football’))
 Map: collection of data items with an associated key (always an atom).
                              ‘Fan of’           (‘Apple’)
                                                (‘Barça’, ‘football’)

                                    ‘Age’  ’20’


                                            9
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,     Section 2.2
Networks and Systems - CANS        Programming
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);




visits: (‘Amy’, ‘cnn.com’, ‘8am’)
     (‘Amy’, ‘nytimes.com’, ‘9am’)
     (‘Bob’, ‘elmundotoday.com’, ’11am’)

pages: (‘cnn.com’, ‘0.8’)
    (‘nytimes.com’, ‘0.6’)
    (‘elmundotoday’, ‘0.2’)


                                      11
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url




v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’)
    (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)
    (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)




                                      12
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user




user:   (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’),
        (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)})

    (‘Bob’,   {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})


                                        13
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr




user:   (‘Amy’, ‘0.7’)
    (‘Bob’, ‘0.2’)




                             14
Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);
vp = JOIN visits BY url, pages BY url
users = GROUP vp BY user
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
answer = FILTER useravg BY avgpr > ‘0.5’




answer: (‘Amy’, ‘0.7’)




                             15
Programming Model
Other relational operators:

    – STORE : exports data into a file.
    STORE var1_name INTO 'output.txt‘;

    – COGROUP : groups together tuples from diferent datasets.
    COGROUP var1_name BY field_id, var2_name BY field_id

    –   UNION : computes the union of two variables.
    –   CROSS : computes the cross product.
    –   ORDER : sorts a data set by one or more fields.
    –   DISTINCT : removes replicated tuples in a dataset.




                                           16
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,        Part 3
Networks and Systems - CANS        Implementation
Implementation: Highlights

 Works on top of Hadoop ecosystem:
   – Current implementation uses Hadoop as execution platform.

 On-the-fly compilation:
   – Pig translates the Pig Latin commands to Map and Reduce methods.

 Lazy style language:
   – Pig try to pospone the data materialization (on disk writes) as much as
     possible.




                                     18
Implementation: Building the logical
plan
 Query parsing:
   – Pig interpreter parses the commands verifying that the input files and
     bags referenced are valid.

 On-the-fly compilation:
   – Pig compiles the logical plan for that bag into physical plan (Map-Reduce
     statements) when the command cannot be more delayed and must be
     executed.

 Lazy characteristics:
   – No processing are carried out when the logical plan are constructed.
   – Processing is triggered only when the user invokes STORE command
     on a bag.
   – Lazy style execution permits in-memory pipelining and other interesting
     optimizations.



                                      5
                                      19
Implementation: Map-Reduce plan
compilation
   CO(GROUP):
     – Each command is compiled in a distinct map-reduce job with its own map and reduce functions.
     – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to
        multiple reduce instances.


   LOAD:
     – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system.


   FILTER/FOREACH:
     – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run
         in parallel.


   ORDER (compiled in two map-reduce jobs):
     – First: Determine quantiles of the sort key
     – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase
       resulting in a global sorted file.




                                                     20
EEDC
                          34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture,      Part 4
Networks and Systems - CANS        Conclusions
Conclusions
   Advantages:
     –   Step-by-step syntaxis.
     –   Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time).
     –   Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, …
     –   Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance.
     –   Debugging environment.
     –   Open Source (IMPORTANT!!)


   Disadvantages:
     –   UDFs methods could be a source of performance loss (the control relies on the user side).
     –   Overhead while compiling Pig Latin into map-reduce jobs.


   Usage Scenarios:
     –   Temporal analysis: search logs mainly involves studying how search query distribution changes
         over time.
     –   Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are
         analized to calculate some metrics such:
     –   how long is the average user session?
     –   how many links does a user click on before leaving a website?
     –   Others, ...


                                                    22
Q&A




      23
FLATTERED




            24

EEDC - Apache Pig

  • 1.
    EEDC 34330 Execution Apache Pig Environments for Distributed Computing Master in Computer Architecture, Networks and Systems - CANS Homework number: 3 Group number: EEDC-3 Group members: Javier Álvarez – javicid@gmail.com Francesc Lordan – francesc.lordan@gmail.com Roger Rafanell – rogerrafanell@gmail.com
  • 2.
    Outline 1.- Introduction 2.- PigLatin 2.1.- Data Model 2.2.- Programming Model 3.- Implementation 4.- Conclusions 2
  • 3.
    EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 1 Networks and Systems - CANS Introduction
  • 4.
    Why Apache Pig? Today’sInternet companies needs to process hugh data sets: – Parallel databases can be prohibitively expensive at this scale. – Programmers tend to find declarative languages such as SQL very unnatural. – Other approaches such map-reduce are low-level and rigid. 4
  • 5.
    What is ApachePig? A platform for analyzing large data sets that: – It is based in Pig Latin which lies between declarative (SQL) and procedural (C++) programming languages. – At the same time, enables the construction of programs with an easy parallelizable structure. 5
  • 6.
    Which features doesit have?  Dataflow Language – Data processing is expressed step-by-step.  Quick Start & Interoperability – Pig can work over any kind of input and produce any kind of output.  Nested Data Model – Pig works with complex types like tuples, bags, ...  User Defined Functions (UDFs) – Potentially in any programming language (only Java for the moment).  Only parallel – Pig Latin forces to use directives that are parallelizable in a direct way.  Debugging environment – Debugging at programming time. 6
  • 7.
    EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 2 Networks and Systems - CANS Pig Latin
  • 8.
    EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Section 2.1 Networks and Systems - CANS Data Model
  • 9.
    Data Model Very richdata model consisting on 4 simple data types:  Atom: Simple atomic value such as strings or numbers. ‘Alice’  Tuple: Sequence of fields of any type of data. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Bag: collection of tuples with possible duplicates. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’))  Map: collection of data items with an associated key (always an atom). ‘Fan of’  (‘Apple’) (‘Barça’, ‘football’) ‘Age’  ’20’ 9
  • 10.
    EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Section 2.2 Networks and Systems - CANS Programming
  • 11.
    Programming Model visits =LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); visits: (‘Amy’, ‘cnn.com’, ‘8am’) (‘Amy’, ‘nytimes.com’, ‘9am’) (‘Bob’, ‘elmundotoday.com’, ’11am’) pages: (‘cnn.com’, ‘0.8’) (‘nytimes.com’, ‘0.6’) (‘elmundotoday’, ‘0.2’) 11
  • 12.
    Programming Model visits =LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’) (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’) (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’) 12
  • 13.
    Programming Model visits =LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’), (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)}) (‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)}) 13
  • 14.
    Programming Model visits =LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr user: (‘Amy’, ‘0.7’) (‘Bob’, ‘0.2’) 14
  • 15.
    Programming Model visits =LOAD ‘visits.txt’ AS (user, url, time) pages = LOAD `pages.txt` AS (url, rank); vp = JOIN visits BY url, pages BY url users = GROUP vp BY user useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr answer = FILTER useravg BY avgpr > ‘0.5’ answer: (‘Amy’, ‘0.7’) 15
  • 16.
    Programming Model Other relationaloperators: – STORE : exports data into a file. STORE var1_name INTO 'output.txt‘; – COGROUP : groups together tuples from diferent datasets. COGROUP var1_name BY field_id, var2_name BY field_id – UNION : computes the union of two variables. – CROSS : computes the cross product. – ORDER : sorts a data set by one or more fields. – DISTINCT : removes replicated tuples in a dataset. 16
  • 17.
    EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 3 Networks and Systems - CANS Implementation
  • 18.
    Implementation: Highlights  Workson top of Hadoop ecosystem: – Current implementation uses Hadoop as execution platform.  On-the-fly compilation: – Pig translates the Pig Latin commands to Map and Reduce methods.  Lazy style language: – Pig try to pospone the data materialization (on disk writes) as much as possible. 18
  • 19.
    Implementation: Building thelogical plan  Query parsing: – Pig interpreter parses the commands verifying that the input files and bags referenced are valid.  On-the-fly compilation: – Pig compiles the logical plan for that bag into physical plan (Map-Reduce statements) when the command cannot be more delayed and must be executed.  Lazy characteristics: – No processing are carried out when the logical plan are constructed. – Processing is triggered only when the user invokes STORE command on a bag. – Lazy style execution permits in-memory pipelining and other interesting optimizations. 5 19
  • 20.
    Implementation: Map-Reduce plan compilation  CO(GROUP): – Each command is compiled in a distinct map-reduce job with its own map and reduce functions. – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to multiple reduce instances.  LOAD: – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system.  FILTER/FOREACH: – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run in parallel.  ORDER (compiled in two map-reduce jobs): – First: Determine quantiles of the sort key – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase resulting in a global sorted file. 20
  • 21.
    EEDC 34330 Execution Environments for Distributed Computing Master in Computer Architecture, Part 4 Networks and Systems - CANS Conclusions
  • 22.
    Conclusions  Advantages: – Step-by-step syntaxis. – Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time). – Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, … – Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance. – Debugging environment. – Open Source (IMPORTANT!!)  Disadvantages: – UDFs methods could be a source of performance loss (the control relies on the user side). – Overhead while compiling Pig Latin into map-reduce jobs.  Usage Scenarios: – Temporal analysis: search logs mainly involves studying how search query distribution changes over time. – Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are analized to calculate some metrics such: – how long is the average user session? – how many links does a user click on before leaving a website? – Others, ... 22
  • 23.
    Q&A 23
  • 24.