SDEC2011 Essentials of Pig
Upcoming SlideShare
Loading in...5
×
 

SDEC2011 Essentials of Pig

on

  • 1,365 views

 

Statistics

Views

Total Views
1,365
Views on SlideShare
1,365
Embed Views
0

Actions

Likes
2
Downloads
128
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    SDEC2011 Essentials of Pig SDEC2011 Essentials of Pig Presentation Transcript

    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Essentials of PigMastering Hadoop Map-reduce for Data AnalysisShashank Tiwariblog: shanky.org | twitter: @tshankyst@treasuryofideas.com
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Session Agenda• What is Pig and why should you use it?• Installing & Setting up Pig• Pig’s Components• Using Pig with Hadoop MapReduce• Summary & Conclusion
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.What is Pig?• Higher-level abstraction for Hadoop MapReduce• An infrastructure for data analysis using a scripting language • named, Pig Latin
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Why should you use Pig?• Hadoop MapReduce: • Requires you to be a programmer • Forces you to design all your algorithms in terms of the map and reduce primitives
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Installing & Setting Up Pig -- Required Software• Required Software: • Java 1.6.x • Hadoop 0.20.x • Ant 1.7+ (for builds) • JUnit 4.5 (for tests) • Cygwin (on Windows)
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Download• Source: http://pig.apache.org/• Version: • 0.8.1 -- current stable
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Install & Configure• Extract: tar zxvf pig-0.8.1.tar.gz• Move & Create Symbolic Link: • ln -s pig-0.8.1 pig• Edit: bin/pig • export PIG_CLASSPATH=$HADOOP_HOME/conf
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Verify Installation• Verify: (remember to start Hadoop first.) • bin/pig -help (command options) • bin/pig (run the grunt shell)
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Running Pig• Run Mode • Local Mode -- single machine • MapReduce Mode -- needs a Hadoop cluster (with HDFS)• Run via: • grunt shell • pig scripts • embedded programs
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig IDE• PigPen, an eclipse based IDE • graphical data flow definition • can show example data flow
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig Components• Pig Latin• Pig Engine • execution engine on top of Hadoop • includes default optimal configurations
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.A client for your cluster• Pig does not run on a Hadoop cluster• It connects to one
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig Latin• Data flow language (Not declarative like SQL)• Increases productivity (less lines do more)• Includes standard operations like join, filter, group, sort• User code and existing binaries can be included• Supports nested data types• Does not require metadata
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Pig Latin Example• Will leverage the tutorial that comes with the distribution• Check the tutorial folder in the distribution
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Start Grunt Shell• cd $PIG_HOME• bin/pig -x local
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Aggregate Data• grunt> log = LOAD tutorial/data/excite-small.log AS (user, timestamp, query); • alternate delimiters can be used and de-serializers like PigJsonLoader can be leveraged• grunt> grouped = GROUP log BY user;• grunt> counted = FOREACH grouped GENERATE group, COUNT(log);• grunt> STORE counted INTO output;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Group Data• grunt> grouped = GROUP log BY user;• In Pig group operation generates (key, collection) pair , where the collection itself is a collection of tuples. • The key of the tuples is the same key as that of the (key, collection) pair
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Filter Data• grunt> log= LOAD tutorial/data/excite-small.log AS (user, time, query);• grunt> grouped = GROUP log BY user;• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;• grunt> filtered = FILTER counted BY cnt > 75;• grunt> STORE filtered INTO output1;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Order Data• grunt> log= LOAD tutorial/data/excite-small.log AS (user, time, query);• grunt> grouped = GROUP log BY user;• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;• grunt> filtered = FILTER counted BY cnt > 50;• grunt> sorted = ORDER filtered BY cnt;• grunt> STORE sorted INTO output2;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Join Data Example• Words appearing in Adventures of Huckleberry Finn by Mark Twain • http://www.gutenberg.org/ebooks/76• Words appearing in The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle • http://www.gutenberg.org/ebooks/1661
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Loading & Counting Huckleberry Finn Data• grunt> A = load pg76.txt;• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;• grunt> C = filter B by word matches w+;• grunt> D = group C by word;• grunt> E = foreach D generate COUNT(C), group;• store E into huckleberry_finn_freq;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Loading & Counting Sherlock Holmes Data• grunt> A = load pg1661.txt;• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;• grunt> C = filter B by word matches w+;• grunt> D = group C by word;• grunt> E = foreach D generate COUNT(C), group;• grunt> store E into sherlock_holmes_freq;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Join Data• grunt> hf= LOAD huckleberry_finn_freq AS (freq, word);• grunt> sh= LOAD sherlock_holmes_freq AS (freq, word);• grunt> inboth = JOIN hf BY word, sh BY word;• grunt> STORE inboth INTO output3;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Set Difference (A - B, in A but not in B)• hf = LOAD huckleberry_finn_freq AS (freq, word);• sh = LOAD sherlock_holmes_freq AS (freq, word);• grouped = COGROUP hf BY word, sh BY word;• not_in_hf = FILTER grouped BY COUNT(hf) == 0;• out = FOREACH not_in_hf GENERATE FLATTEN(sh);• STORE out INTO output4;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Cogroup Data• Extends the idea of grouping to multiple collections• Instead of (key, collection) pair, it now emits a key and a set of tuples from each of the multiple collections • With two sources of input it would be (key, collection1, collection2), where tuples from the first source will be in collection1 and tuples from the second source will be in collection2.
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Data types Supported• int, long, double, chararray, bytearray• map, tuple (ordered), bag (unordered)
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Data type Declaration• hf = LOAD huckleberry_finn_freq AS (freq:int, word:chararray); • explicit data type declaration• hf = LOAD huckleberry_finn_freq AS (freq:int, word:chararray);• weighted = FOREACH hf GENERATE freq * 100; • type inference, freq cast to int
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Data type Declaration• hf = LOAD huckleberry_finn_freq AS (freq:int, word:chararray); • explicit data type declaration• hf = LOAD huckleberry_finn_freq AS (freq:int, word:chararray);• weighted = FOREACH hf GENERATE freq * 100; • type inference, freq cast to int
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Custom Extensions• User defined functions can be called from Pig scripts• Nested operations can be carried out • FOREACH grouped { sorted = ORDER hf BY counted; • GENERATE group, CustomFunction(sorted); }• Flow can be split: SPLIT A INTO Negative IF $0 < 0, Positive IF $0 > 0;
    • Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners.Questions?• blog: shanky.org | twitter: @tshanky• st@treasuryofideas.com