Your SlideShare is downloading. ×
0
Whirlwind tour of Pig
           Chris Wilkes
    cwilkes@seattlehadoop.org
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Why Pig?                                      Tired of boilerplate


•   Started off writing Mappers/Reducers in Java

   ...
Why Pig?                 Java MapReduce boilerplate example




•   Typical use case: have two different input types

    ...
Why Pig?                   Java MapReduce boilerplate example

Need to handle two different output types, need a single
cl...
Where’s your shears?

"I was working on my thesis
and realized I needed a
reference. I'd
seen a post on comp.arch
recently...
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Data Types




•   From largest to smallest:

    •   Bag (relation / group)

    •   Tuple

    •   Field

•   A bag is a...
Data Types                                                                   Bag

$ cat logs                    $ cat grou...
Data Types                                               Tuple and Fields

$ cat booksexpensive.pig
logs = LOAD 'logs' AS ...
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Operator                                                             Load

This will load all files under the logs/2010/05 ...
Operator                                           Load

By default splits on the tab character (the same as the
key/value...
Operator                              Describe, Dump, and Store

“Describe” prints out that variable’s schema:
    DUMP co...
Operator                                                           Group

 $ cat starnames           $ cat starpositions
 ...
Operator                                                                 Join

Just like GROUP but flatter
   $ cat starsan...
Operator                                                          Flatten

Ugly looking output from before:
  expensive: {...
Operator                                       Renaming in Foreach

 All columns with cumbersome names:
  expensive: {grou...
Operator                                                          Split

When input file mixes types or needs separation
  ...
Operator                                                          Split

If same schema for each line can specify on load,...
Operator                                                 Sample, Limit

For testing purposes sample both large inputs:
   ...
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
UDF

UDF: User Defined Function
Operates on single values or a group
Simple example: IsEmpty (a FilterFunc)
   users = JOIN...
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
LOAD and GROUP
logfiles = LOAD ‘logs’ AS (userid: int, bookid: long, price:
double);
userinfo = LOAD ‘users’ AS (userid: in...
Inside {} are bags (unordered)
inside () are tuples (ordered list of fields)

report = FOREACH userpurchases GENERATE
FLATT...
Entering and exiting recorded in same file:

2010-05-10 12:55:12 user123 enter
2010-05-10 13:14:23 user456 enter
2010-05-10...
inandout = LOAD 'enterexittimes';
SPLIT inandout INTO enter
  IF $2 == 'enter', exit1 IF $2 == 'exit';

enter = FOREACH en...
combotimes = GROUP enter BY $1, exit BY $1;

purchases = FOREACH combotimes GENERATE
 group AS userid,
 FLATTEN(enter.$0) ...
Schema for inandout, enter1, exit1 unknown.

enter: {time: chararray,userid: chararray}
exit: {time: chararray,userid: cha...
UDFs
• User Defined Function
• For doing an operation on data
• Already use several builtins:
  • COUNT
  • SUM
•
Upcoming SlideShare
Loading in...5
×

Pig Introduction to Pig

4,823

Published on

30 minute talk on Pig given at http://seattlehadoop.org/ on 2010-05-17

Published in: Technology, Business
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,823
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
174
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide
































































  • Transcript of "Pig Introduction to Pig"

    1. 1. Whirlwind tour of Pig Chris Wilkes cwilkes@seattlehadoop.org
    2. 2. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    3. 3. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    4. 4. Why Pig? Tired of boilerplate • Started off writing Mappers/Reducers in Java • Fun at first • Gets a little tedious • Need to do more than one MR step • Write own flow control • Utility classes to pass parameters / input paths • Go back and change a Reducer’s input type • Did you change it in the Job setup? • Processing two different input types in first job
    5. 5. Why Pig? Java MapReduce boilerplate example • Typical use case: have two different input types • log files (timestamps and userids) • database table dump (userids and names) • Want to combine the two together • Relatively simple, but tedious
    6. 6. Why Pig? Java MapReduce boilerplate example Need to handle two different output types, need a single class that can handle both, designate with a “tag”: Mapper<LongWritable,Text,TaggedKeyWritable,TaggedValueWritable> Reducer<TaggedKeyWritable,TaggedValueWritable,Text,PurchaseInfoWritable> Inside of Mapper check in setup() or run() for Path of input to decide if this is a log file or database table if (context.getInputSplit().getPath().contains(“logfile”)) { inputType=”LOGFILE” } else if { ... inputType=”DATABASE”} Reducer: check tag and then combine if (key.getTag().equals(“LOGFILE”) { LogEntry logEntry = value.getValue() } else (key.getTag().equals(“DATABASE”) { UserInfo userInfo = value.getValue() } context.write(userInfo.getId(), logEntry.getTime() + “ “ + userInfo.getName())
    7. 7. Where’s your shears? "I was working on my thesis and realized I needed a reference. I'd seen a post on comp.arch recently that cited a paper, so I fired up gnus. While I was searching the for the post, I came across another post whose MIME encoding screwed up my ancient version of gnus, so I stopped and downloaded the latest version of gnus.
    8. 8. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    9. 9. Data Types • From largest to smallest: • Bag (relation / group) • Tuple • Field • A bag is a collection of tuples, tuples have fields
    10. 10. Data Types Bag $ cat logs $ cat groupbooks.pig 101 1002 10.09 logs = LOAD 'logs' AS 101 8912 5.96 (userid: int, bookid: long, price: double); 102 1002 10.09 bookbuys = GROUP logs BY bookid; 103 8912 5.96 DESCRIBE bookbuys; 103 7122 88.99 DUMP bookbuys; $ pig -x local groupbooks.pig bookbuys: {group: long,logs: {userid: int,bookid: long,price: double}} Tuple (1002L,{(101,1002L,10.09),(102,1002L,10.09)}) (7122L,{(103,7122L,88.99)}) (8912L,{(101,8912L,5.96),(103,8912L,5.96)}) Inner bag Field Field
    11. 11. Data Types Tuple and Fields $ cat booksexpensive.pig logs = LOAD 'logs' AS (userid: int, bookid: long, price: double); bookbuys = GROUP logs BY bookid; expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE inside; } Refers to the DESCRIBE expensive; inner bag DUMP expensive; $ pig -x local booksexpensive.pig expensive: {inside: {userid: int,bookid: long,price: double}} ({(101,1002L,10.09),(102,1002L,10.09)}) ({(103,7122L,88.99)}) ({}) Inner bag Note: can always refer to $0, $1, etc
    12. 12. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    13. 13. Operator Load This will load all files under the logs/2010/05 directory (or the logs/2010/05 file) and put into clicklogs: clicklogs = LOAD 'logs/2010/05'; Names the files in the tuple “userid” and “url” -- instead of having to refer as $0 and $1 clicklogs = LOAD 'logs/2010/05' as (userid: int, url: chararray) Inner bag occurs till a dump/store Note: no actual loading command is executed.
    14. 14. Operator Load By default splits on the tab character (the same as the key/value separator in MapReduce jobs). Can also specify your own delimiter: LOAD ‘logs’ USING PigStorage(‘~’) PigStorage implements LoadFunc -- implement this interface to create your own loader, ie “RegExLoader” from the Piggybank. Inner bag
    15. 15. Operator Describe, Dump, and Store “Describe” prints out that variable’s schema: DUMP combotimes; combotimes: {group: chararray, enter: {time: chararray,userid: chararray}, exit: {time: chararray,userid: chararray,cost: double}} To see output on the screen type “dump varname”: DUMP namesandaddresses; To output to a file / directory use store: STORE patienttrials INTO ‘trials/2010’; Inner bag
    16. 16. Operator Group $ cat starnames $ cat starpositions 1 Mintaka 1 R.A. 05h 32m 0.4s, Dec. -00 17' 57" 2 Alnitak 2 R.A. 05h 40m 45.5s, Dec. -01 56' 34" 3 Epsilon Orionis 3 R.A. 05h 36m 12.8s, Dec. -01 12' 07" $ cat starsandpositions.pig names = LOAD 'starnames' as (id: int, name: chararray); positions = LOAD 'starpositions' as (id: int, position: chararray); nameandpos = GROUP names BY id, positions BY id; DESCRIBE nameandpos; DUMP nameandpos; nameandpos: {group: int,names: {id: int,name: chararray}, positions: {id: int,position: chararray}} (1,{(1,Mintaka)},{(1,R.A. bag 32m 0.4s, Dec. -00 17' 57")}) Inner 05h (2,{(2,Alnitak)},{(2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")}) (3,{(3,Epsilon Orionis)},{(3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")})
    17. 17. Operator Join Just like GROUP but flatter $ cat starsandpositions2.pig names = LOAD 'starnames' as (id: int, name: chararray); positions = LOAD 'starpositions' as (id: int, position: chararray); nameandpos = JOIN names BY id, positions BY id; DESCRIBE nameandpos; DUMP nameandpos; nameandpos: {names::id: int,names::name: chararray, positions::id: int,positions::position: chararray} (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57") (2,Alnitak,2,R.A.Inner bag 05h 40m 45.5s, Dec. -01 56' 34") (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
    18. 18. Operator Flatten Ugly looking output from before: expensive: {inside: {userid: int,bookid: long,price: double}} ({(101,1002L,10.09),(102,1002L,10.09)}) ({(103,7122L,88.99)}) Use the FLATTEN operator expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE group, FLATTEN (inside); } expensive: {group: long,inside::userid: int,inside::bookid: long,inside::price: double} (1002L,101,1002L,10.09) Inner bag (1002L,102,1002L,10.09) (7122L,103,7122L,88.99)
    19. 19. Operator Renaming in Foreach All columns with cumbersome names: expensive: {group: long,inside::userid: int,inside::bookid: long,inside::price: double} Pick and rename: expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE group AS userid, FLATTEN (inside.(bookid, price)) AS (bookid, price); } Kept the type! Now easy to use: expensive: {userid: long,bookid: long,price: double} (1002L,1002L,10.09) (1002L,1002L,10.09) bag Inner (7122L,7122L,88.99)
    20. 20. Operator Split When input file mixes types or needs separation $ cat enterexittimes 2010-05-10 12:55:12 user123 enter 2010-05-10 13:14:23 user456 enter 2010-05-10 13:16:53 user123 exit 23.79 2010-05-10 13:17:49 user456 exit 0.50 inandout = LOAD 'enterexittimes'; SPLIT inandout INTO enter1 IF $2 == 'enter', exit1 IF $2 == 'exit'; (2010-05-10 12:55:12,user123,enter) enter1: (2010-05-10 13:14:23,user456,enter) (2010-05-10 13:16:53,user123,exit,23.79) : exit1 (2010-05-10 13:17:49,user456,exit,0.50)
    21. 21. Operator Split If same schema for each line can specify on load, in this case need to do a foreach: enter = FOREACH enter1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray; exit = FOREACH exit1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray, (double)$3 AS cost:double; DESCRIBE enter; DESCRIBE exit; enter: {time: chararray,userid: chararray} exit: {time: chararray,userid: chararray,cost: double}
    22. 22. Operator Sample, Limit For testing purposes sample both large inputs: names1 = LOAD 'starnames' as (id: int, name: chararray); names = SAMPLE names1 0.3; positions1 = LOAD 'starpositions' as (id: int, position: chararray); positions = SAMPLE positions1 0.3; Running returns random rows every time (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57") Limit only returns the first N results. Use with OrderBy to return the top results: nameandpos1 = JOIN names BY id, positions BY id; nameandpos2 = ORDER nameandpos1 BY names::id DESC; nameandpos Inner bag = LIMIT nameandpos2 2; (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07") (2,Alnitak,2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")
    23. 23. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    24. 24. UDF UDF: User Defined Function Operates on single values or a group Simple example: IsEmpty (a FilterFunc) users = JOIN names BY id, addresses BY id; D = FOREACH users GENERATE group, FLATTEN ((IsEmpty(names::firstName) ? “none” : names::firstName) Working over an aggregate, ie COUNT: users = JOIN names BY id, books BY buyerId; D = FOREACH users GENERATE group, COUNT(books) Working on two values: distance1= CROSS stars and stars; distance =
    25. 25. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    26. 26. LOAD and GROUP logfiles = LOAD ‘logs’ AS (userid: int, bookid: long, price: double); userinfo = LOAD ‘users’ AS (userid: int, name: chararray); userpurchases = GROUP logfiles BY userid, userinfo BY userid; DESCRIBE userpurchases; DUMP userpurchases;
    27. 27. Inside {} are bags (unordered) inside () are tuples (ordered list of fields) report = FOREACH userpurchases GENERATE FLATTEN(userinfo.name) AS name, group AS userid, FLATTEN(SUM(logfiles.price)) AS cost; bybigspender = ORDER report BY cost DESC; DUMP bybigspender; (Bob,103,94.94999999999999) (Joe,101,16.05) (Cindy,102,10.09)
    28. 28. Entering and exiting recorded in same file: 2010-05-10 12:55:12 user123 enter 2010-05-10 13:14:23 user456 enter 2010-05-10 13:16:53 user123 exit 23.79 2010-05-10 13:17:49 user456 exit 0.50
    29. 29. inandout = LOAD 'enterexittimes'; SPLIT inandout INTO enter IF $2 == 'enter', exit1 IF $2 == 'exit'; enter = FOREACH enter1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray; exit = FOREACH exit1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray, (double)$3 AS cost:double;
    30. 30. combotimes = GROUP enter BY $1, exit BY $1; purchases = FOREACH combotimes GENERATE group AS userid, FLATTEN(enter.$0) AS entertime, FLATTEN(exit.$0) AS exittime, FLATTEN(exit.$2); DUMP purchases;
    31. 31. Schema for inandout, enter1, exit1 unknown. enter: {time: chararray,userid: chararray} exit: {time: chararray,userid: chararray,cost: double} combotimes: {group: chararray, enter: {time: chararray,userid: chararray}, exit: {time: chararray,userid: chararray,cost: double}} purchases: {userid: chararray,entertime: chararray, exittime: chararray,cost: double}
    32. 32. UDFs • User Defined Function • For doing an operation on data • Already use several builtins: • COUNT • SUM •
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×