EEDC                          34330Execution                                   Apache PigEnvironments forDistributedComput...
Outline1.- Introduction2.- Pig Latin    2.1.- Data Model    2.2.- Programming Model3.- Implementation4.- Conclusions      ...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,      Part...
Why Apache Pig?Today’s Internet companies needs to process hugh data sets:   – Parallel databases can be prohibitively exp...
What is Apache Pig?A platform for analyzing large data sets that:   – It is based in Pig Latin which lies between declarat...
Which features does it have? Dataflow Language   – Data processing is expressed step-by-step. Quick Start & Interoperabi...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,     Part ...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,    Sectio...
Data ModelVery rich data model consisting on 4 simple data types: Atom: Simple atomic value such as strings or numbers.  ...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,     Secti...
Programming Modelvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);visits: (‘Amy’, ‘c...
Programming Modelvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits B...
Programming Modelvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits B...
Programming Modelvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits B...
Programming Modelvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits B...
Programming ModelOther relational operators:    – STORE : exports data into a file.    STORE var1_name INTO output.txt‘;  ...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,        Pa...
Implementation: Highlights Works on top of Hadoop ecosystem:   – Current implementation uses Hadoop as execution platform...
Implementation: Building the logicalplan Query parsing:   – Pig interpreter parses the commands verifying that the input ...
Implementation: Map-Reduce plancompilation   CO(GROUP):     – Each command is compiled in a distinct map-reduce job with ...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,      Part...
Conclusions   Advantages:     –   Step-by-step syntaxis.     –   Flexible: UDFs, not locked to a fixed schema (allows sch...
Q&A      23
FLATTERED            24
Upcoming SlideShare
Loading in...5
×

EEDC - Apache Pig

439

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
439
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "EEDC - Apache Pig"

  1. 1. EEDC 34330Execution Apache PigEnvironments forDistributedComputingMaster in Computer Architecture,Networks and Systems - CANS Homework number: 3 Group number: EEDC-3 Group members: Javier Álvarez – javicid@gmail.com Francesc Lordan – francesc.lordan@gmail.com Roger Rafanell – rogerrafanell@gmail.com
  2. 2. Outline1.- Introduction2.- Pig Latin 2.1.- Data Model 2.2.- Programming Model3.- Implementation4.- Conclusions 2
  3. 3. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Part 1Networks and Systems - CANS Introduction
  4. 4. Why Apache Pig?Today’s Internet companies needs to process hugh data sets: – Parallel databases can be prohibitively expensive at this scale. – Programmers tend to find declarative languages such as SQL very unnatural. – Other approaches such map-reduce are low-level and rigid. 4
  5. 5. What is Apache Pig?A platform for analyzing large data sets that: – It is based in Pig Latin which lies between declarative (SQL) and procedural (C++) programming languages. – At the same time, enables the construction of programs with an easy parallelizable structure. 5
  6. 6. Which features does it have? Dataflow Language – Data processing is expressed step-by-step. Quick Start & Interoperability – Pig can work over any kind of input and produce any kind of output. Nested Data Model – Pig works with complex types like tuples, bags, ... User Defined Functions (UDFs) – Potentially in any programming language (only Java for the moment). Only parallel – Pig Latin forces to use directives that are parallelizable in a direct way. Debugging environment – Debugging at programming time. 6
  7. 7. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Part 2Networks and Systems - CANS Pig Latin
  8. 8. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Section 2.1Networks and Systems - CANS Data Model
  9. 9. Data ModelVery rich data model consisting on 4 simple data types: Atom: Simple atomic value such as strings or numbers. ‘Alice’ Tuple: Sequence of fields of any type of data. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’)) Bag: collection of tuples with possible duplicates. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’)) Map: collection of data items with an associated key (always an atom). ‘Fan of’  (‘Apple’) (‘Barça’, ‘football’) ‘Age’  ’20’ 9
  10. 10. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Section 2.2Networks and Systems - CANS Programming
  11. 11. Programming Modelvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);visits: (‘Amy’, ‘cnn.com’, ‘8am’) (‘Amy’, ‘nytimes.com’, ‘9am’) (‘Bob’, ‘elmundotoday.com’, ’11am’)pages: (‘cnn.com’, ‘0.8’) (‘nytimes.com’, ‘0.6’) (‘elmundotoday’, ‘0.2’) 11
  12. 12. Programming Modelvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits BY url, pages BY urlv_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’) (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’) (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’) 12
  13. 13. Programming Modelvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits BY url, pages BY urlusers = GROUP vp BY useruser: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’), (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)}) (‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)}) 13
  14. 14. Programming Modelvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits BY url, pages BY urlusers = GROUP vp BY useruseravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpruser: (‘Amy’, ‘0.7’) (‘Bob’, ‘0.2’) 14
  15. 15. Programming Modelvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits BY url, pages BY urlusers = GROUP vp BY useruseravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpranswer = FILTER useravg BY avgpr > ‘0.5’answer: (‘Amy’, ‘0.7’) 15
  16. 16. Programming ModelOther relational operators: – STORE : exports data into a file. STORE var1_name INTO output.txt‘; – COGROUP : groups together tuples from diferent datasets. COGROUP var1_name BY field_id, var2_name BY field_id – UNION : computes the union of two variables. – CROSS : computes the cross product. – ORDER : sorts a data set by one or more fields. – DISTINCT : removes replicated tuples in a dataset. 16
  17. 17. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Part 3Networks and Systems - CANS Implementation
  18. 18. Implementation: Highlights Works on top of Hadoop ecosystem: – Current implementation uses Hadoop as execution platform. On-the-fly compilation: – Pig translates the Pig Latin commands to Map and Reduce methods. Lazy style language: – Pig try to pospone the data materialization (on disk writes) as much as possible. 18
  19. 19. Implementation: Building the logicalplan Query parsing: – Pig interpreter parses the commands verifying that the input files and bags referenced are valid. On-the-fly compilation: – Pig compiles the logical plan for that bag into physical plan (Map-Reduce statements) when the command cannot be more delayed and must be executed. Lazy characteristics: – No processing are carried out when the logical plan are constructed. – Processing is triggered only when the user invokes STORE command on a bag. – Lazy style execution permits in-memory pipelining and other interesting optimizations. 5 19
  20. 20. Implementation: Map-Reduce plancompilation CO(GROUP): – Each command is compiled in a distinct map-reduce job with its own map and reduce functions. – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to multiple reduce instances. LOAD: – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system. FILTER/FOREACH: – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run in parallel. ORDER (compiled in two map-reduce jobs): – First: Determine quantiles of the sort key – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase resulting in a global sorted file. 20
  21. 21. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Part 4Networks and Systems - CANS Conclusions
  22. 22. Conclusions Advantages: – Step-by-step syntaxis. – Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time). – Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, … – Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance. – Debugging environment. – Open Source (IMPORTANT!!) Disadvantages: – UDFs methods could be a source of performance loss (the control relies on the user side). – Overhead while compiling Pig Latin into map-reduce jobs. Usage Scenarios: – Temporal analysis: search logs mainly involves studying how search query distribution changes over time. – Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are analized to calculate some metrics such: – how long is the average user session? – how many links does a user click on before leaving a website? – Others, ... 22
  23. 23. Q&A 23
  24. 24. FLATTERED 24

×