Upcoming SlideShare
×

# Introduction to pig

1,590 views
1,390 views

Published on

3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,590
On SlideShare
0
From Embeds
0
Number of Embeds
610
Actions
Shares
0
31
0
Likes
3
Embeds 0
No embeds

No notes for slide

### Introduction to pig

1. 1. Introduction to Pig<br />Xiafei.qiu@PCA<br />
2. 2. Nested Data Model<br />Field, Tuple, Bag, Map<br />
3. 3. Normal Operators<br />Arithmetic Operators<br />X = FOREACH A GENERATE f1, f2, f1 % f2;<br />Boolean Operators<br />X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));<br />Cast operators<br />X = FOREACH B GENERATE group, (chararray) COUNT(A) AS total;<br />Comparison Operators<br />X = FILTER A BY (f1 matches '.*apache.*') OR (NOT (f2+f3 > f1));<br />Flatten Operator<br />Tuple: remove a level of nesting<br />Bag :remove a level of nesting, may cause cross product <br />
4. 4. Normal Operators<br />
5. 5. Relational Operators<br />LOADa bag of tuples<br />A = LOAD 'data' [USING function] [AS schema]; <br />STORE<br />A = STORE alias INTO 'directory' [USING function];<br />FOREACHtuple in the bag, produce a new tuple<br />A = FOREACH queries GENERATE uid, expandQuery(query);<br />FILTERa bag to produce a subset of it<br />A = FILTER queries BY uidneq ‘bot’ OR notBot(uid);<br />
6. 6. Relational Operators<br />COGROUP/GROUPone or less than 127 relations<br />alias = GROUP … by …, … by…<br />{group: int, A: {name: chararray,age: int,gpa: float}}<br />(18,{(John,18,4.0F),(Joe,18,3.8F)})<br />
7. 7. Relational Operators<br />JOIN(inner/outer)<br />Replicated Joins<br />one or more relations are small enough to fit into main memory. <br />Skewed Joins<br />computes a histogram of the key space and uses this data to allocate reducers for a given key. <br />Merge Joins<br />Sorted， perform join on map phase<br />
8. 8. Relational Operators<br />
9. 9. Relational Operators<br />ORDERalias by filed DESC/ASC<br />Unstable<br />SPLITalias INTO alias IF …, alias IF …<br />CROSS<br />cross product<br />X = CROSS A, B;<br />DISTINCT<br />Removes duplicate tuples in a relation.<br />X = DISTINCT A;<br />LIMIT<br />LIMITE A 3;<br />SAMPLE<br />SAMPLE alias size;<br />IMPORT<br />Import other .pig file<br />DEFINE<br />Define a Pig macro. <br />
10. 10. Built In Eval Function<br />AVG/MAX/MIN/SUM <br />on a single column of a bag; group it first<br />COUNT/ COUNT_STAR <br />number of elements in a bag; COUNT_STAR counts null<br />CONCAT<br />DIFF<br />IsEmpty<br />SIZE<br />TOKENIZE<br />
11. 11. Other Built In Function<br />Load/Store Functions<br />Math Functions<br />String Functions<br />
12. 12. Map-Reduce Plan Compilation<br />Compile each GROUP into distinct Map-Reduce job<br />Push commands between LOAD and GROUP to the Map Side<br />Commands between subsequent GROUP Gi and Gi+1 pushed into the Reduce Side of Gi<br />
13. 13. Map-Reduce Plan Compilation<br />ORDER is compiled into two map-reduce jobs.<br />MR1: sample the key space<br />MR2: sort<br />
14. 14. User Defined Function<br />Simple Eval Function<br />public class UPPER extends EvalFunc<String>{  public String exec(Tuple input) throws IOException {     // .......  }}<br />
15. 15. User Defined Function<br />Aggregate Functions<br />Algebraic Interface<br />they can be computed incrementally in a distributed fashion.<br />Accumulator Interface<br />designed to decrease memory usage<br />
16. 16. Accumulator Interface<br />public interface Accumulator <T> {    public void accumulate(Tuple b) throws IOException;      public T getValue();      public void cleanup();}<br />
17. 17. Aggregate Functions<br />public interface Algebraic{        public String getInitial();        public String getIntermed();        public String getFinal();}<br />