Introduction to pig

Introduction to Pig Xiafei.qiu@PCA

Nested Data Model Field, Tuple, Bag, Map

Normal Operators Arithmetic Operators X = FOREACH A GENERATE f1, f2, f1 % f2; Boolean Operators X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1)); Cast operators X = FOREACH B GENERATE group, (chararray) COUNT(A) AS total; Comparison Operators X = FILTER A BY (f1 matches '.*apache.*') OR (NOT (f2+f3 > f1)); Flatten Operator Tuple: remove a level of nesting Bag :remove a level of nesting, may cause cross product

Relational Operators LOADa bag of tuples A = LOAD 'data' [USING function] [AS schema]; STORE A = STORE alias INTO 'directory' [USING function]; FOREACHtuple in the bag, produce a new tuple A = FOREACH queries GENERATE uid, expandQuery(query); FILTERa bag to produce a subset of it A = FILTER queries BY uidneq ‘bot’ OR notBot(uid);

Relational Operators COGROUP/GROUPone or less than 127 relations alias = GROUP … by …, … by… {group: int, A: {name: chararray,age: int,gpa: float}} (18,{(John,18,4.0F),(Joe,18,3.8F)})

Relational Operators JOIN(inner/outer) Replicated Joins one or more relations are small enough to fit into main memory. Skewed Joins computes a histogram of the key space and uses this data to allocate reducers for a given key. Merge Joins Sorted， perform join on map phase

Relational Operators ORDERalias by filed DESC/ASC Unstable SPLITalias INTO alias IF …, alias IF … CROSS cross product X = CROSS A, B; DISTINCT Removes duplicate tuples in a relation. X = DISTINCT A; LIMIT LIMITE A 3; SAMPLE SAMPLE alias size; IMPORT Import other .pig file DEFINE Define a Pig macro.

Built In Eval Function AVG/MAX/MIN/SUM on a single column of a bag; group it first COUNT/ COUNT_STAR number of elements in a bag; COUNT_STAR counts null CONCAT DIFF IsEmpty SIZE TOKENIZE

Other Built In Function Load/Store Functions Math Functions String Functions

Map-Reduce Plan Compilation Compile each GROUP into distinct Map-Reduce job Push commands between LOAD and GROUP to the Map Side Commands between subsequent GROUP Gi and Gi+1 pushed into the Reduce Side of Gi

Map-Reduce Plan Compilation ORDER is compiled into two map-reduce jobs. MR1: sample the key space MR2: sort

User Defined Function Simple Eval Function public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { // ....... }}

User Defined Function Aggregate Functions Algebraic Interface they can be computed incrementally in a distributed fashion. Accumulator Interface designed to decrease memory usage

Accumulator Interface public interface Accumulator <T> { public void accumulate(Tuple b) throws IOException; public T getValue(); public void cleanup();}

Aggregate Functions public interface Algebraic{ public String getInitial(); public String getIntermed(); public String getFinal();}

Introduction to pig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to pig

Similar to Introduction to pig (20)

Recently uploaded

Recently uploaded (20)

Introduction to pig