Introduction to pig

  • 1,227 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,227
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
26
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction to Pig
    Xiafei.qiu@PCA
  • 2. Nested Data Model
    Field, Tuple, Bag, Map
  • 3. Normal Operators
    Arithmetic Operators
    X = FOREACH A GENERATE f1, f2, f1 % f2;
    Boolean Operators
    X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));
    Cast operators
    X = FOREACH B GENERATE group, (chararray) COUNT(A) AS total;
    Comparison Operators
    X = FILTER A BY (f1 matches '.*apache.*') OR (NOT (f2+f3 > f1));
    Flatten Operator
    Tuple: remove a level of nesting
    Bag :remove a level of nesting, may cause cross product 
  • 4. Normal Operators
  • 5. Relational Operators
    LOADa bag of tuples
    A = LOAD 'data' [USING function] [AS schema]; 
    STORE
    A = STORE alias INTO 'directory' [USING function];
    FOREACHtuple in the bag, produce a new tuple
    A = FOREACH queries GENERATE uid, expandQuery(query);
    FILTERa bag to produce a subset of it
    A = FILTER queries BY uidneq ‘bot’ OR notBot(uid);
  • 6. Relational Operators
    COGROUP/GROUPone or less than 127 relations
    alias = GROUP … by …, … by…
    {group: int, A: {name: chararray,age: int,gpa: float}}
    (18,{(John,18,4.0F),(Joe,18,3.8F)})
  • 7. Relational Operators
    JOIN(inner/outer)
    Replicated Joins
    one or more relations are small enough to fit into main memory. 
    Skewed Joins
    computes a histogram of the key space and uses this data to allocate reducers for a given key. 
    Merge Joins
    Sorted, perform join on map phase
  • 8. Relational Operators
  • 9. Relational Operators
    ORDERalias by filed DESC/ASC
    Unstable
    SPLITalias INTO alias IF …, alias IF …
    CROSS
    cross product
    X = CROSS A, B;
    DISTINCT
    Removes duplicate tuples in a relation.
    X = DISTINCT A;
    LIMIT
    LIMITE A 3;
    SAMPLE
    SAMPLE alias size;
    IMPORT
    Import other .pig file
    DEFINE
    Define a Pig macro.
  • 10. Built In Eval Function
    AVG/MAX/MIN/SUM
    on a single column of a bag; group it first
    COUNT/ COUNT_STAR
    number of elements in a bag; COUNT_STAR counts null
    CONCAT
    DIFF
    IsEmpty
    SIZE
    TOKENIZE
  • 11. Other Built In Function
    Load/Store Functions
    Math Functions
    String Functions
  • 12. Map-Reduce Plan Compilation
    Compile each GROUP into distinct Map-Reduce job
    Push commands between LOAD and GROUP to the Map Side
    Commands between subsequent GROUP Gi and Gi+1 pushed into the Reduce Side of Gi
  • 13. Map-Reduce Plan Compilation
    ORDER is compiled into two map-reduce jobs.
    MR1: sample the key space
    MR2: sort
  • 14. User Defined Function
    Simple Eval Function
    public class UPPER extends EvalFunc<String>{  public String exec(Tuple input) throws IOException {     // .......  }}
  • 15. User Defined Function
    Aggregate Functions
    Algebraic Interface
    they can be computed incrementally in a distributed fashion.
    Accumulator Interface
    designed to decrease memory usage
  • 16. Accumulator Interface
    public interface Accumulator <T> {    public void accumulate(Tuple b) throws IOException;      public T getValue();      public void cleanup();}
  • 17. Aggregate Functions
    public interface Algebraic{        public String getInitial();        public String getIntermed();        public String getFinal();}