Introduction to Pig<br />Xiafei.qiu@PCA<br />
Nested Data Model<br />Field, Tuple, Bag, Map<br />
Normal Operators<br />Arithmetic Operators<br />X = FOREACH A GENERATE f1, f2, f1 % f2;<br />Boolean Operators<br />X = FI...
Normal Operators<br />
Relational Operators<br />LOADa bag of tuples<br />A = LOAD 'data' [USING function] [AS schema]; <br />STORE<br />A = STOR...
Relational Operators<br />COGROUP/GROUPone or less than 127 relations<br />alias = GROUP … by …, … by…<br />{group: int, A...
Relational Operators<br />JOIN(inner/outer)<br />Replicated Joins<br />one or more relations are small enough to fit into ...
Relational Operators<br />
Relational Operators<br />ORDERalias by filed DESC/ASC<br />Unstable<br />SPLITalias INTO alias IF …, alias IF …<br />CROS...
Built In Eval Function<br />AVG/MAX/MIN/SUM <br />on a single column of a bag; group it first<br />COUNT/ COUNT_STAR <br /...
Other Built In Function<br />Load/Store Functions<br />Math Functions<br />String Functions<br />
Map-Reduce Plan Compilation<br />Compile each GROUP into distinct Map-Reduce job<br />Push commands between LOAD and GROUP...
Map-Reduce Plan Compilation<br />ORDER is compiled into two map-reduce jobs.<br />MR1: sample the key space<br />MR2: sort...
User Defined Function<br />Simple Eval Function<br />public class UPPER extends EvalFunc<String>{  public String exec(Tupl...
User Defined Function<br />Aggregate Functions<br />Algebraic Interface<br />they can be computed incrementally in a distr...
Accumulator Interface<br />public interface Accumulator <T> {    public void accumulate(Tuple b) throws IOException;      ...
Aggregate Functions<br />public interface Algebraic{        public String getInitial();        public String getIntermed()...
Upcoming SlideShare
Loading in …5
×

Introduction to pig

1,590 views
1,390 views

Published on

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,590
On SlideShare
0
From Embeds
0
Number of Embeds
610
Actions
Shares
0
Downloads
31
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Introduction to pig

  1. 1. Introduction to Pig<br />Xiafei.qiu@PCA<br />
  2. 2. Nested Data Model<br />Field, Tuple, Bag, Map<br />
  3. 3. Normal Operators<br />Arithmetic Operators<br />X = FOREACH A GENERATE f1, f2, f1 % f2;<br />Boolean Operators<br />X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));<br />Cast operators<br />X = FOREACH B GENERATE group, (chararray) COUNT(A) AS total;<br />Comparison Operators<br />X = FILTER A BY (f1 matches '.*apache.*') OR (NOT (f2+f3 > f1));<br />Flatten Operator<br />Tuple: remove a level of nesting<br />Bag :remove a level of nesting, may cause cross product <br />
  4. 4. Normal Operators<br />
  5. 5. Relational Operators<br />LOADa bag of tuples<br />A = LOAD 'data' [USING function] [AS schema]; <br />STORE<br />A = STORE alias INTO 'directory' [USING function];<br />FOREACHtuple in the bag, produce a new tuple<br />A = FOREACH queries GENERATE uid, expandQuery(query);<br />FILTERa bag to produce a subset of it<br />A = FILTER queries BY uidneq ‘bot’ OR notBot(uid);<br />
  6. 6. Relational Operators<br />COGROUP/GROUPone or less than 127 relations<br />alias = GROUP … by …, … by…<br />{group: int, A: {name: chararray,age: int,gpa: float}}<br />(18,{(John,18,4.0F),(Joe,18,3.8F)})<br />
  7. 7. Relational Operators<br />JOIN(inner/outer)<br />Replicated Joins<br />one or more relations are small enough to fit into main memory. <br />Skewed Joins<br />computes a histogram of the key space and uses this data to allocate reducers for a given key. <br />Merge Joins<br />Sorted, perform join on map phase<br />
  8. 8. Relational Operators<br />
  9. 9. Relational Operators<br />ORDERalias by filed DESC/ASC<br />Unstable<br />SPLITalias INTO alias IF …, alias IF …<br />CROSS<br />cross product<br />X = CROSS A, B;<br />DISTINCT<br />Removes duplicate tuples in a relation.<br />X = DISTINCT A;<br />LIMIT<br />LIMITE A 3;<br />SAMPLE<br />SAMPLE alias size;<br />IMPORT<br />Import other .pig file<br />DEFINE<br />Define a Pig macro. <br />
  10. 10. Built In Eval Function<br />AVG/MAX/MIN/SUM <br />on a single column of a bag; group it first<br />COUNT/ COUNT_STAR <br />number of elements in a bag; COUNT_STAR counts null<br />CONCAT<br />DIFF<br />IsEmpty<br />SIZE<br />TOKENIZE<br />
  11. 11. Other Built In Function<br />Load/Store Functions<br />Math Functions<br />String Functions<br />
  12. 12. Map-Reduce Plan Compilation<br />Compile each GROUP into distinct Map-Reduce job<br />Push commands between LOAD and GROUP to the Map Side<br />Commands between subsequent GROUP Gi and Gi+1 pushed into the Reduce Side of Gi<br />
  13. 13. Map-Reduce Plan Compilation<br />ORDER is compiled into two map-reduce jobs.<br />MR1: sample the key space<br />MR2: sort<br />
  14. 14. User Defined Function<br />Simple Eval Function<br />public class UPPER extends EvalFunc<String>{  public String exec(Tuple input) throws IOException {     // .......  }}<br />
  15. 15. User Defined Function<br />Aggregate Functions<br />Algebraic Interface<br />they can be computed incrementally in a distributed fashion.<br />Accumulator Interface<br />designed to decrease memory usage<br />
  16. 16. Accumulator Interface<br />public interface Accumulator <T> {    public void accumulate(Tuple b) throws IOException;      public T getValue();      public void cleanup();}<br />
  17. 17. Aggregate Functions<br />public interface Algebraic{        public String getInitial();        public String getIntermed();        public String getFinal();}<br />

×