EEDC - Apache Pig

EEDC
34330
Execution Apache Pig
Environments for
Distributed
Computing
Master in Computer Architecture,
Networks and Systems - CANS

Homework number: 3
Group number: EEDC-3
Group members:
Javier Álvarez – javicid@gmail.com
Francesc Lordan – francesc.lordan@gmail.com
Roger Rafanell – rogerrafanell@gmail.com

Outline

1.- Introduction

2.- Pig Latin
2.1.- Data Model
2.2.- Programming Model

3.- Implementation

4.- Conclusions

2

EEDC
34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture, Part 1
Networks and Systems - CANS Introduction

Why Apache Pig?

Today’s Internet companies needs to process hugh data sets:

– Parallel databases can be prohibitively expensive at this scale.

– Programmers tend to find declarative languages such as SQL very
unnatural.

– Other approaches such map-reduce are low-level and rigid.

4

What is Apache Pig?

A platform for analyzing large data sets that:

– It is based in Pig Latin which lies between declarative (SQL) and
procedural (C++) programming languages.

– At the same time, enables the construction of programs with an easy
parallelizable structure.

5

Which features does it have?
 Dataflow Language
– Data processing is expressed step-by-step.

 Quick Start & Interoperability
– Pig can work over any kind of input and produce any kind of output.

 Nested Data Model
– Pig works with complex types like tuples, bags, ...

 User Defined Functions (UDFs)
– Potentially in any programming language (only Java for the moment).

 Only parallel
– Pig Latin forces to use directives that are parallelizable in a direct way.

 Debugging environment
– Debugging at programming time.
6

EEDC
34330
Execution
Environments for
Distributed
Computing
Networks and Systems - CANS Pig Latin

EEDC
34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture, Section 2.1
Networks and Systems - CANS Data Model

Data Model
Very rich data model consisting on 4 simple data types:

 Atom: Simple atomic value such as strings or numbers.
‘Alice’
 Tuple: Sequence of fields of any type of data.
(‘Alice’, ‘Apple’)
(‘Alice’, (‘Barça’, ‘football’))
 Bag: collection of tuples with possible duplicates.
(‘Alice’, ‘Apple’)
(‘Alice’, (‘Barça’, ‘football’))
 Map: collection of data items with an associated key (always an atom).
‘Fan of’  (‘Apple’)
(‘Barça’, ‘football’)

‘Age’  ’20’

9

EEDC
34330
Execution
Environments for
Distributed
Computing
Master in Computer Architecture, Section 2.2
Networks and Systems - CANS Programming

Programming Model
visits = LOAD ‘visits.txt’ AS (user, url, time)
pages = LOAD `pages.txt` AS (url, rank);

visits: (‘Amy’, ‘cnn.com’, ‘8am’)
(‘Amy’, ‘nytimes.com’, ‘9am’)
(‘Bob’, ‘elmundotoday.com’, ’11am’)

pages: (‘cnn.com’, ‘0.8’)
(‘nytimes.com’, ‘0.6’)
(‘elmundotoday’, ‘0.2’)

11

Programming Model
vp = JOIN visits BY url, pages BY url

v_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’)
(‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)
(‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)

12

Programming Model
users = GROUP vp BY user

user: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’),
(‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)})

(‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)})

13

Programming Model
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr

user: (‘Amy’, ‘0.7’)
(‘Bob’, ‘0.2’)

14

Programming Model
useravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpr
answer = FILTER useravg BY avgpr > ‘0.5’

answer: (‘Amy’, ‘0.7’)

15

Programming Model
Other relational operators:

– STORE : exports data into a file.
STORE var1_name INTO 'output.txt‘;

– COGROUP : groups together tuples from diferent datasets.
COGROUP var1_name BY field_id, var2_name BY field_id

– UNION : computes the union of two variables.
– CROSS : computes the cross product.
– ORDER : sorts a data set by one or more fields.
– DISTINCT : removes replicated tuples in a dataset.

16

EEDC
34330
Execution
Environments for
Distributed
Computing
Networks and Systems - CANS Implementation

Implementation: Highlights

 Works on top of Hadoop ecosystem:
– Current implementation uses Hadoop as execution platform.

 On-the-fly compilation:
– Pig translates the Pig Latin commands to Map and Reduce methods.

 Lazy style language:
– Pig try to pospone the data materialization (on disk writes) as much as
possible.

18

Implementation: Building the logical
plan
 Query parsing:
– Pig interpreter parses the commands verifying that the input files and
bags referenced are valid.

 On-the-fly compilation:
– Pig compiles the logical plan for that bag into physical plan (Map-Reduce
statements) when the command cannot be more delayed and must be
executed.

 Lazy characteristics:
– No processing are carried out when the logical plan are constructed.
– Processing is triggered only when the user invokes STORE command
on a bag.
– Lazy style execution permits in-memory pipelining and other interesting
optimizations.

5
19

Implementation: Map-Reduce plan
compilation
 CO(GROUP):
– Each command is compiled in a distinct map-reduce job with its own map and reduce functions.
– Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to
multiple reduce instances.

 LOAD:
– Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system.

 FILTER/FOREACH:
– Automatic parallelism is given since for a map-reduce job several map and reduce instances are run
in parallel.

 ORDER (compiled in two map-reduce jobs):
– First: Determine quantiles of the sort key
– Second: Chops the job according the quantiles and performs a local sorting in the reduce phase
resulting in a global sorted file.

20

EEDC
34330
Execution
Environments for
Distributed
Computing
Networks and Systems - CANS Conclusions

Conclusions
 Advantages:
– Step-by-step syntaxis.
– Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time).
– Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, …
– Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance.
– Debugging environment.
– Open Source (IMPORTANT!!)

 Disadvantages:
– UDFs methods could be a source of performance loss (the control relies on the user side).
– Overhead while compiling Pig Latin into map-reduce jobs.

 Usage Scenarios:
– Temporal analysis: search logs mainly involves studying how search query distribution changes
over time.
– Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are
analized to calculate some metrics such:
– how long is the average user session?
– how many links does a user click on before leaving a website?
– Others, ...

22

EEDC - Apache Pig

More Related Content

What's hot

Viewers also liked

Similar to EEDC - Apache Pig

Recently uploaded

EEDC - Apache Pig