Upcoming SlideShare
×

# Introduction to pig

• 1,227 views

• Comment goes here.
Are you sure you want to
Be the first to comment

Total Views
1,227
On Slideshare
0
From Embeds
0
Number of Embeds
2

Shares
26
0
Likes
3

No embeds

### Report content

No notes for slide

### Transcript

• 1. Introduction to Pig
Xiafei.qiu@PCA
• 2. Nested Data Model
Field, Tuple, Bag, Map
• 3. Normal Operators
Arithmetic Operators
X = FOREACH A GENERATE f1, f2, f1 % f2;
Boolean Operators
X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));
Cast operators
X = FOREACH B GENERATE group, (chararray) COUNT(A) AS total;
Comparison Operators
X = FILTER A BY (f1 matches '.*apache.*') OR (NOT (f2+f3 > f1));
Flatten Operator
Tuple: remove a level of nesting
Bag :remove a level of nesting, may cause cross product
• 4. Normal Operators
• 5. Relational Operators
A = LOAD 'data' [USING function] [AS schema];
STORE
A = STORE alias INTO 'directory' [USING function];
FOREACHtuple in the bag, produce a new tuple
A = FOREACH queries GENERATE uid, expandQuery(query);
FILTERa bag to produce a subset of it
A = FILTER queries BY uidneq ‘bot’ OR notBot(uid);
• 6. Relational Operators
COGROUP/GROUPone or less than 127 relations
alias = GROUP … by …, … by…
{group: int, A: {name: chararray,age: int,gpa: float}}
(18,{(John,18,4.0F),(Joe,18,3.8F)})
• 7. Relational Operators
JOIN(inner/outer)
Replicated Joins
one or more relations are small enough to fit into main memory.
Skewed Joins
computes a histogram of the key space and uses this data to allocate reducers for a given key.
Merge Joins
Sorted， perform join on map phase
• 8. Relational Operators
• 9. Relational Operators
ORDERalias by filed DESC/ASC
Unstable
SPLITalias INTO alias IF …, alias IF …
CROSS
cross product
X = CROSS A, B;
DISTINCT
Removes duplicate tuples in a relation.
X = DISTINCT A;
LIMIT
LIMITE A 3;
SAMPLE
SAMPLE alias size;
IMPORT
Import other .pig file
DEFINE
Define a Pig macro.
• 10. Built In Eval Function
AVG/MAX/MIN/SUM
on a single column of a bag; group it first
COUNT/ COUNT_STAR
number of elements in a bag; COUNT_STAR counts null
CONCAT
DIFF
IsEmpty
SIZE
TOKENIZE
• 11. Other Built In Function
Math Functions
String Functions
• 12. Map-Reduce Plan Compilation
Compile each GROUP into distinct Map-Reduce job
Push commands between LOAD and GROUP to the Map Side
Commands between subsequent GROUP Gi and Gi+1 pushed into the Reduce Side of Gi
• 13. Map-Reduce Plan Compilation
ORDER is compiled into two map-reduce jobs.
MR1: sample the key space
MR2: sort
• 14. User Defined Function
Simple Eval Function
public class UPPER extends EvalFunc<String>{  public String exec(Tuple input) throws IOException {     // .......  }}
• 15. User Defined Function
Aggregate Functions
Algebraic Interface
they can be computed incrementally in a distributed fashion.
Accumulator Interface
designed to decrease memory usage
• 16. Accumulator Interface
public interface Accumulator <T> {    public void accumulate(Tuple b) throws IOException;      public T getValue();      public void cleanup();}
• 17. Aggregate Functions
public interface Algebraic{        public String getInitial();        public String getIntermed();        public String getFinal();}