SDEC2011 Essentials of Pig

Conﬁdential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
Copyright for all other & referenced work is retained by their respective owners.

Essentials of Pig
Mastering Hadoop Map-reduce for Data Analysis

Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryoﬁdeas.com


Session Agenda

• What is Pig and why should you use it?

• Installing & Setting up Pig

• Pig’s Components

• Using Pig with Hadoop MapReduce

• Summary & Conclusion


What is Pig?

• Higher-level abstraction for Hadoop MapReduce

• An infrastructure for data analysis using a scripting language

• named, Pig Latin


Why should you use Pig?

• Hadoop MapReduce:

• Requires you to be a programmer

• Forces you to design all your algorithms in terms of the map and reduce
primitives


Installing & Setting Up Pig -- Required Software

• Required Software:

• Java 1.6.x

• Hadoop 0.20.x

• Ant 1.7+ (for builds)

• JUnit 4.5 (for tests)

• Cygwin (on Windows)


Download

• Source: http://pig.apache.org/

• Version:

• 0.8.1 -- current stable


Install & Conﬁgure

• Extract: tar zxvf pig-0.8.1.tar.gz

• Move & Create Symbolic Link:

• ln -s pig-0.8.1 pig

• Edit: bin/pig

• export PIG_CLASSPATH=$HADOOP_HOME/conf


Verify Installation

• Verify:

(remember to start Hadoop ﬁrst.)

• bin/pig -help (command options)

• bin/pig (run the grunt shell)


Running Pig

• Run Mode

• Local Mode -- single machine

• MapReduce Mode -- needs a Hadoop cluster (with HDFS)

• Run via:

• grunt shell

• pig scripts

• embedded programs


Pig IDE

• PigPen, an eclipse based IDE

• graphical data flow definition

• can show example data flow


Pig Components

• Pig Latin

• Pig Engine

• execution engine on top of Hadoop

• includes default optimal conﬁgurations


A client for your cluster

• Pig does not run on a Hadoop cluster

• It connects to one


Pig Latin

• Data ﬂow language (Not declarative like SQL)

• Increases productivity (less lines do more)

• Includes standard operations like join, ﬁlter, group, sort

• User code and existing binaries can be included

• Supports nested data types

• Does not require metadata


Pig Latin Example

• Will leverage the tutorial that comes with the distribution

• Check the tutorial folder in the distribution


Start Grunt Shell

• cd $PIG_HOME

• bin/pig -x local


Aggregate Data

• grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, timestamp,
query);

• alternate delimiters can be used and de-serializers like PigJsonLoader can
be leveraged

• grunt> grouped = GROUP log BY user;

• grunt> counted = FOREACH grouped GENERATE group, COUNT(log);

• grunt> STORE counted INTO 'output';


Group Data


• In Pig group operation generates (key, collection) pair , where the collection
itself is a collection of tuples.

• The key of the tuples is the same key as that of the (key, collection) pair


Filter Data

• grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query);


• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;

• grunt> ﬁltered = FILTER counted BY cnt > 75;

• grunt> STORE ﬁltered INTO 'output1';


Order Data

• grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query);


• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;

• grunt> ﬁltered = FILTER counted BY cnt > 50;

• grunt> sorted = ORDER ﬁltered BY cnt;

• grunt> STORE sorted INTO 'output2';


Join Data Example

• Words appearing in Adventures of Huckleberry Finn by Mark Twain

• http://www.gutenberg.org/ebooks/76

• Words appearing in The Adventures of Sherlock Holmes by Sir Arthur Conan
Doyle

• http://www.gutenberg.org/ebooks/1661


Loading & Counting Huckleberry Finn Data

• grunt> A = load 'pg76.txt';

• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

• grunt> C = filter B by word matches 'w+';

• grunt> D = group C by word;

• grunt> E = foreach D generate COUNT(C), group;

• store E into 'huckleberry_finn_freq';


Loading & Counting Sherlock Holmes Data

• grunt> A = load 'pg1661.txt';

• grunt> B = foreach A generate ﬂatten(TOKENIZE((chararray)$0)) as word;

• grunt> C = ﬁlter B by word matches 'w+';

• grunt> D = group C by word;

• grunt> E = foreach D generate COUNT(C), group;

• grunt> store E into 'sherlock_holmes_freq';


Join Data

• grunt> hf= LOAD 'huckleberry_ﬁnn_freq' AS (freq, word);

• grunt> sh= LOAD 'sherlock_holmes_freq' AS (freq, word);

• grunt> inboth = JOIN hf BY word, sh BY word;

• grunt> STORE inboth INTO 'output3';


Set Difference (A - B, in A but not in B)

• hf = LOAD 'huckleberry_ﬁnn_freq' AS (freq, word);

• sh = LOAD 'sherlock_holmes_freq' AS (freq, word);

• grouped = COGROUP hf BY word, sh BY word;

• not_in_hf = FILTER grouped BY COUNT(hf) == 0;

• out = FOREACH not_in_hf GENERATE FLATTEN(sh);

• STORE out INTO 'output4';


Cogroup Data

• Extends the idea of grouping to multiple collections

• Instead of (key, collection) pair, it now emits a key and a set of tuples from
each of the multiple collections

• With two sources of input it would be (key, collection1, collection2), where
tuples from the ﬁrst source will be in collection1 and tuples from the
second source will be in collection2.


Data types Supported

• int, long, double, chararray, bytearray

• map, tuple (ordered), bag (unordered)


Data type Declaration

• hf = LOAD 'huckleberry_ﬁnn_freq' AS (freq:int, word:chararray);

• explicit data type declaration

• hf = LOAD 'huckleberry_ﬁnn_freq' AS (freq:int, word:chararray);

• weighted = FOREACH hf GENERATE freq * 100;

• type inference, freq cast to int


Custom Extensions

• User deﬁned functions can be called from Pig scripts

• Nested operations can be carried out

• FOREACH grouped { sorted = ORDER hf BY counted;

• GENERATE group, CustomFunction(sorted); }

• Flow can be split: SPLIT A INTO Negative IF $0 < 0, Positive IF $0 > 0;


Questions?

• blog: shanky.org | twitter: @tshanky

• st@treasuryoﬁdeas.com

SDEC2011 Essentials of Pig

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to SDEC2011 Essentials of Pig

Similar to SDEC2011 Essentials of Pig (20)

More from Korea Sdec

More from Korea Sdec (9)

Recently uploaded

Recently uploaded (20)

SDEC2011 Essentials of Pig