Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
EEDC                          34330Execution                                   Apache PigEnvironments forDistributedComput...
Outline1.- Introduction2.- Pig Latin    2.1.- Data model    2.2.- Relational commands3.- Implementation4.- Conclusions    ...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,      Part...
Why Apache Pig?Today’s Internet companies needs to process hugh data sets:   – Parallel databases can be prohibitively exp...
What is Apache Pig?A platform for analyzing large data sets that:   – It is based in Pig Latin which lies between declarat...
Which features does it have? Dataflow Language   – Data processing is expressed step-by-step. Quick Start & Interoperabi...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,     Part ...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,    Sectio...
Data ModelVery rich data model consisting on 4 simple data types: Atom: Simple atomic value such as strings or numbers.  ...
EEDC                          34330ExecutionEnvironments forDistributedComputing                           Section 2.2Mast...
Relational commandsvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);visits:   (‘Amy’...
Relational commandsvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits...
Relational commandsvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits...
Relational commandsvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits...
Relational commandsvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits...
Relational commandsOther relational operators:    – STORE : exports data into a file.          STORE var1_name INTO output...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,        Pa...
Implementation: Highlights Works on top of Hadoop ecosystem:   – Current implementation uses Hadoop as execution platform...
Implementation: Building the logical plan Query parsing:   – Pig interpreter parses the commands verifying that the input...
Implementation: Map-Reduce plan compilation CO(GROUP):   – Each command is compiled in a distinct map-reduce job with its...
EEDC                          34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture,      Part...
Conclusions   Advantages:     –   Step-by-step syntaxis.     –   Flexible: UDFs, not locked to a fixed schema (allows sch...
Q&A      23
JOIN vs COGROUP             24
FLATTERED            25
Upcoming SlideShare
Loading in …5
×

EEDC Apache Pig Language

1,159 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

EEDC Apache Pig Language

  1. 1. EEDC 34330Execution Apache PigEnvironments forDistributedComputingMaster in Computer Architecture,Networks and Systems - CANS Homework number: 3 Group number: EEDC-3 Group members: Javier Álvarez – javicid@gmail.com Francesc Lordan – francesc.lordan@gmail.com Roger Rafanell – rogerrafanell@gmail.com
  2. 2. Outline1.- Introduction2.- Pig Latin 2.1.- Data model 2.2.- Relational commands3.- Implementation4.- Conclusions 2
  3. 3. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Part 1Networks and Systems - CANS Introduction
  4. 4. Why Apache Pig?Today’s Internet companies needs to process hugh data sets: – Parallel databases can be prohibitively expensive at this scale. – Programmers tend to find declarative languages such as SQL very unnatural. – Other approaches such map-reduce are low-level and rigid. 4
  5. 5. What is Apache Pig?A platform for analyzing large data sets that: – It is based in Pig Latin which lies between declarative (SQL) and procedural (C++) programming languages. – At the same time, enables the construction of programs with an easy parallelizable structure. 5
  6. 6. Which features does it have? Dataflow Language – Data processing is expressed step-by-step. Quick Start & Interoperability – Pig can work over any kind of input and produce any kind of output. Nested Data Model – Pig works with complex types like tuples, bags, ... User Defined Functions (UDFs) – Potentially in any programming language (only Java for the moment). Only parallel – Pig Latin forces to use directives that are parallelizable in a direct way. Debugging environment – Debugging at programming time. 6
  7. 7. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Part 2Networks and Systems - CANS Pig Latin
  8. 8. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Section 2.1Networks and Systems - CANS Data model
  9. 9. Data ModelVery rich data model consisting on 4 simple data types: Atom: Simple atomic value such as strings or numbers. ‘Alice’ Tuple: Sequence of fields of any type of data. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’)) Bag: collection of tuples with possible duplicates. (‘Alice’, ‘Apple’) (‘Alice’, (‘Barça’, ‘football’)) Map: collection of data items with an associated key (always an atom). ‘Fan of’  (‘Apple’) (‘Barça’, ‘football’) 9
  10. 10. EEDC 34330ExecutionEnvironments forDistributedComputing Section 2.2Master in Computer Architecture,Networks and Systems - CANS Relational commands
  11. 11. Relational commandsvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);visits: (‘Amy’, ‘cnn.com’, ‘8am’) (‘Amy’, ‘nytimes.com’, ‘9am’) (‘Bob’, ‘elmundotoday.com’, ’11am’)pages: (‘cnn.com’, ‘0.8’) (‘nytimes.com’, ‘0.6’) (‘elmundotoday’, ‘0.2’) 11
  12. 12. Relational commandsvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits BY url, pages BY urlv_p:(‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’) (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’) (‘Bob’, ‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’) 12
  13. 13. Relational commandsvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits BY url, pages BY urlusers = GROUP vp BY useruser: (‘Amy’, { (‘Amy’, ‘cnn.com’, ‘8am’, ‘cnn.com’, ‘0.8’), (‘Amy’, ‘nytimes.com’, ‘9am’, ‘nytimes.com, ‘0.6’)}) (‘Bob’, {‘elmundotoday.com’, ’11am’, ‘elmundotoday.com’, ‘0.2’)}) 13
  14. 14. Relational commandsvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits BY url, pages BY urlusers = GROUP vp BY useruseravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpruser: (‘Amy’, ‘0.7’) (‘Bob’, ‘0.2’) 14
  15. 15. Relational commandsvisits = LOAD ‘visits.txt’ AS (user, url, time)pages = LOAD `pages.txt` AS (url, rank);vp = JOIN visits BY url, pages BY urlusers = GROUP vp BY useruseravg = FOREACH users GENERATE group, AVG(vp.rank) AS avgpranswer = FILTER useravg BY avgpr > ‘0.5’answer: (‘Amy’, ‘0.7’) 15
  16. 16. Relational commandsOther relational operators: – STORE : exports data into a file. STORE var1_name INTO output.txt‘; – COGROUP : groups together tuples from diferent datasets. COGROUP var1_name BY field_id, var2_name BY field_id – UNION : computes the union of two variables. – CROSS : computes the cross product. – ORDER : sorts a data set by one or more fields. – DISTINCT : removes replicated tuples in a dataset. 16
  17. 17. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Part 3Networks and Systems - CANS Implementation
  18. 18. Implementation: Highlights Works on top of Hadoop ecosystem: – Current implementation uses Hadoop as execution platform. On-the-fly compilation: – Pig translates the Pig Latin commands to Map and Reduce methods. Lazy style language: – Pig try to pospone the data materialization (on disk writes) as much as possible. 18
  19. 19. Implementation: Building the logical plan Query parsing: – Pig interpreter parses the commands verifying that the input files and bags referenced are valid. On-the-fly compilation: – Pig compiles the logical plan for that bag into physical plan (Map-Reduce statements) when the command cannot be more delayed and must be executed. Lazy characteristics: – No processing are carried out when the logical plan are build up. – Processing is triggered only when the user invokes STORE command on a bag. – Lazy style execution permits in-memory pipelining and other interesting optimizations. 19
  20. 20. Implementation: Map-Reduce plan compilation CO(GROUP): – Each command is compiled in a distinct map-reduce job with its own map and reduce functions. – Parallelism is achieved since the output of multiple map instances is repartitioned in parallel to multiple reduce instances. LOAD: – Parallelism is obtained since Pig operates over files residing in the Hadoop distributed file system. FILTER/FOREACH: – Automatic parallelism is given since for a map-reduce job several map and reduce instances are run in parallel. ORDER (compiled in two map-reduce jobs): – First: Determine quantiles of the sort key – Second: Chops the job according the quantiles and performs a local sorting in the reduce phase resulting in a global sorted file. 20
  21. 21. EEDC 34330ExecutionEnvironments forDistributedComputingMaster in Computer Architecture, Part 4Networks and Systems - CANS Conclusions
  22. 22. Conclusions Advantages: – Step-by-step syntaxis. – Flexible: UDFs, not locked to a fixed schema (allows schema changes over the time). – Exposes a set of widely used functions: FOREACH, FILTER, ORDER, GROUP, … – Takes advantage of Hadoop native properties such: parallelism, load-balancing, fault-tolerance. – Debugging environment. – Open Source (IMPORTANT!!) Disadvantages: – UDFs methods could be a source of performance loss (the control relies on user). – Overhead while compiling Pig Latin into map-reduce jobs. Usage Scenarios: – Temporal analysis: search logs mainly involves studying how search query distribution changes over time. – Session analysis: web user sessions, i.e, sequences of page views and clicks made by users are analized to calculate some metrics such: – how long is the average user session? – how many links does a user click on before leaving a website? – Others, ... 22
  23. 23. Q&A 23
  24. 24. JOIN vs COGROUP 24
  25. 25. FLATTERED 25

×