• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction to pig
 

Introduction to pig

on

  • 1,377 views

 

Statistics

Views

Total Views
1,377
Views on SlideShare
787
Embed Views
590

Actions

Likes
3
Downloads
26
Comments
0

5 Embeds 590

http://chunyemen.org 584
http://cache.baidu.com 2
http://feed.feedsky.com 2
http://webcache.googleusercontent.com 1
http://ershu.me 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Introduction to pig Introduction to pig Presentation Transcript

    • Introduction to Pig
      Xiafei.qiu@PCA
    • Nested Data Model
      Field, Tuple, Bag, Map
    • Normal Operators
      Arithmetic Operators
      X = FOREACH A GENERATE f1, f2, f1 % f2;
      Boolean Operators
      X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));
      Cast operators
      X = FOREACH B GENERATE group, (chararray) COUNT(A) AS total;
      Comparison Operators
      X = FILTER A BY (f1 matches '.*apache.*') OR (NOT (f2+f3 > f1));
      Flatten Operator
      Tuple: remove a level of nesting
      Bag :remove a level of nesting, may cause cross product 
    • Normal Operators
    • Relational Operators
      LOADa bag of tuples
      A = LOAD 'data' [USING function] [AS schema]; 
      STORE
      A = STORE alias INTO 'directory' [USING function];
      FOREACHtuple in the bag, produce a new tuple
      A = FOREACH queries GENERATE uid, expandQuery(query);
      FILTERa bag to produce a subset of it
      A = FILTER queries BY uidneq ‘bot’ OR notBot(uid);
    • Relational Operators
      COGROUP/GROUPone or less than 127 relations
      alias = GROUP … by …, … by…
      {group: int, A: {name: chararray,age: int,gpa: float}}
      (18,{(John,18,4.0F),(Joe,18,3.8F)})
    • Relational Operators
      JOIN(inner/outer)
      Replicated Joins
      one or more relations are small enough to fit into main memory. 
      Skewed Joins
      computes a histogram of the key space and uses this data to allocate reducers for a given key. 
      Merge Joins
      Sorted, perform join on map phase
    • Relational Operators
    • Relational Operators
      ORDERalias by filed DESC/ASC
      Unstable
      SPLITalias INTO alias IF …, alias IF …
      CROSS
      cross product
      X = CROSS A, B;
      DISTINCT
      Removes duplicate tuples in a relation.
      X = DISTINCT A;
      LIMIT
      LIMITE A 3;
      SAMPLE
      SAMPLE alias size;
      IMPORT
      Import other .pig file
      DEFINE
      Define a Pig macro.
    • Built In Eval Function
      AVG/MAX/MIN/SUM
      on a single column of a bag; group it first
      COUNT/ COUNT_STAR
      number of elements in a bag; COUNT_STAR counts null
      CONCAT
      DIFF
      IsEmpty
      SIZE
      TOKENIZE
    • Other Built In Function
      Load/Store Functions
      Math Functions
      String Functions
    • Map-Reduce Plan Compilation
      Compile each GROUP into distinct Map-Reduce job
      Push commands between LOAD and GROUP to the Map Side
      Commands between subsequent GROUP Gi and Gi+1 pushed into the Reduce Side of Gi
    • Map-Reduce Plan Compilation
      ORDER is compiled into two map-reduce jobs.
      MR1: sample the key space
      MR2: sort
    • User Defined Function
      Simple Eval Function
      public class UPPER extends EvalFunc<String>{  public String exec(Tuple input) throws IOException {     // .......  }}
    • User Defined Function
      Aggregate Functions
      Algebraic Interface
      they can be computed incrementally in a distributed fashion.
      Accumulator Interface
      designed to decrease memory usage
    • Accumulator Interface
      public interface Accumulator <T> {    public void accumulate(Tuple b) throws IOException;      public T getValue();      public void cleanup();}
    • Aggregate Functions
      public interface Algebraic{        public String getInitial();        public String getIntermed();        public String getFinal();}