Your SlideShare is downloading. ×
0
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Pig Introduction to Pig
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Pig Introduction to Pig

4,808

Published on

30 minute talk on Pig given at http://seattlehadoop.org/ on 2010-05-17

30 minute talk on Pig given at http://seattlehadoop.org/ on 2010-05-17

Published in: Technology, Business
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,808
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
173
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
































































  • Transcript

    • 1. Whirlwind tour of Pig Chris Wilkes cwilkes@seattlehadoop.org
    • 2. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    • 3. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    • 4. Why Pig? Tired of boilerplate • Started off writing Mappers/Reducers in Java • Fun at first • Gets a little tedious • Need to do more than one MR step • Write own flow control • Utility classes to pass parameters / input paths • Go back and change a Reducer’s input type • Did you change it in the Job setup? • Processing two different input types in first job
    • 5. Why Pig? Java MapReduce boilerplate example • Typical use case: have two different input types • log files (timestamps and userids) • database table dump (userids and names) • Want to combine the two together • Relatively simple, but tedious
    • 6. Why Pig? Java MapReduce boilerplate example Need to handle two different output types, need a single class that can handle both, designate with a “tag”: Mapper<LongWritable,Text,TaggedKeyWritable,TaggedValueWritable> Reducer<TaggedKeyWritable,TaggedValueWritable,Text,PurchaseInfoWritable> Inside of Mapper check in setup() or run() for Path of input to decide if this is a log file or database table if (context.getInputSplit().getPath().contains(“logfile”)) { inputType=”LOGFILE” } else if { ... inputType=”DATABASE”} Reducer: check tag and then combine if (key.getTag().equals(“LOGFILE”) { LogEntry logEntry = value.getValue() } else (key.getTag().equals(“DATABASE”) { UserInfo userInfo = value.getValue() } context.write(userInfo.getId(), logEntry.getTime() + “ “ + userInfo.getName())
    • 7. Where’s your shears? "I was working on my thesis and realized I needed a reference. I'd seen a post on comp.arch recently that cited a paper, so I fired up gnus. While I was searching the for the post, I came across another post whose MIME encoding screwed up my ancient version of gnus, so I stopped and downloaded the latest version of gnus.
    • 8. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    • 9. Data Types • From largest to smallest: • Bag (relation / group) • Tuple • Field • A bag is a collection of tuples, tuples have fields
    • 10. Data Types Bag $ cat logs $ cat groupbooks.pig 101 1002 10.09 logs = LOAD 'logs' AS 101 8912 5.96 (userid: int, bookid: long, price: double); 102 1002 10.09 bookbuys = GROUP logs BY bookid; 103 8912 5.96 DESCRIBE bookbuys; 103 7122 88.99 DUMP bookbuys; $ pig -x local groupbooks.pig bookbuys: {group: long,logs: {userid: int,bookid: long,price: double}} Tuple (1002L,{(101,1002L,10.09),(102,1002L,10.09)}) (7122L,{(103,7122L,88.99)}) (8912L,{(101,8912L,5.96),(103,8912L,5.96)}) Inner bag Field Field
    • 11. Data Types Tuple and Fields $ cat booksexpensive.pig logs = LOAD 'logs' AS (userid: int, bookid: long, price: double); bookbuys = GROUP logs BY bookid; expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE inside; } Refers to the DESCRIBE expensive; inner bag DUMP expensive; $ pig -x local booksexpensive.pig expensive: {inside: {userid: int,bookid: long,price: double}} ({(101,1002L,10.09),(102,1002L,10.09)}) ({(103,7122L,88.99)}) ({}) Inner bag Note: can always refer to $0, $1, etc
    • 12. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    • 13. Operator Load This will load all files under the logs/2010/05 directory (or the logs/2010/05 file) and put into clicklogs: clicklogs = LOAD 'logs/2010/05'; Names the files in the tuple “userid” and “url” -- instead of having to refer as $0 and $1 clicklogs = LOAD 'logs/2010/05' as (userid: int, url: chararray) Inner bag occurs till a dump/store Note: no actual loading command is executed.
    • 14. Operator Load By default splits on the tab character (the same as the key/value separator in MapReduce jobs). Can also specify your own delimiter: LOAD ‘logs’ USING PigStorage(‘~’) PigStorage implements LoadFunc -- implement this interface to create your own loader, ie “RegExLoader” from the Piggybank. Inner bag
    • 15. Operator Describe, Dump, and Store “Describe” prints out that variable’s schema: DUMP combotimes; combotimes: {group: chararray, enter: {time: chararray,userid: chararray}, exit: {time: chararray,userid: chararray,cost: double}} To see output on the screen type “dump varname”: DUMP namesandaddresses; To output to a file / directory use store: STORE patienttrials INTO ‘trials/2010’; Inner bag
    • 16. Operator Group $ cat starnames $ cat starpositions 1 Mintaka 1 R.A. 05h 32m 0.4s, Dec. -00 17' 57" 2 Alnitak 2 R.A. 05h 40m 45.5s, Dec. -01 56' 34" 3 Epsilon Orionis 3 R.A. 05h 36m 12.8s, Dec. -01 12' 07" $ cat starsandpositions.pig names = LOAD 'starnames' as (id: int, name: chararray); positions = LOAD 'starpositions' as (id: int, position: chararray); nameandpos = GROUP names BY id, positions BY id; DESCRIBE nameandpos; DUMP nameandpos; nameandpos: {group: int,names: {id: int,name: chararray}, positions: {id: int,position: chararray}} (1,{(1,Mintaka)},{(1,R.A. bag 32m 0.4s, Dec. -00 17' 57")}) Inner 05h (2,{(2,Alnitak)},{(2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")}) (3,{(3,Epsilon Orionis)},{(3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")})
    • 17. Operator Join Just like GROUP but flatter $ cat starsandpositions2.pig names = LOAD 'starnames' as (id: int, name: chararray); positions = LOAD 'starpositions' as (id: int, position: chararray); nameandpos = JOIN names BY id, positions BY id; DESCRIBE nameandpos; DUMP nameandpos; nameandpos: {names::id: int,names::name: chararray, positions::id: int,positions::position: chararray} (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57") (2,Alnitak,2,R.A.Inner bag 05h 40m 45.5s, Dec. -01 56' 34") (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
    • 18. Operator Flatten Ugly looking output from before: expensive: {inside: {userid: int,bookid: long,price: double}} ({(101,1002L,10.09),(102,1002L,10.09)}) ({(103,7122L,88.99)}) Use the FLATTEN operator expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE group, FLATTEN (inside); } expensive: {group: long,inside::userid: int,inside::bookid: long,inside::price: double} (1002L,101,1002L,10.09) Inner bag (1002L,102,1002L,10.09) (7122L,103,7122L,88.99)
    • 19. Operator Renaming in Foreach All columns with cumbersome names: expensive: {group: long,inside::userid: int,inside::bookid: long,inside::price: double} Pick and rename: expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE group AS userid, FLATTEN (inside.(bookid, price)) AS (bookid, price); } Kept the type! Now easy to use: expensive: {userid: long,bookid: long,price: double} (1002L,1002L,10.09) (1002L,1002L,10.09) bag Inner (7122L,7122L,88.99)
    • 20. Operator Split When input file mixes types or needs separation $ cat enterexittimes 2010-05-10 12:55:12 user123 enter 2010-05-10 13:14:23 user456 enter 2010-05-10 13:16:53 user123 exit 23.79 2010-05-10 13:17:49 user456 exit 0.50 inandout = LOAD 'enterexittimes'; SPLIT inandout INTO enter1 IF $2 == 'enter', exit1 IF $2 == 'exit'; (2010-05-10 12:55:12,user123,enter) enter1: (2010-05-10 13:14:23,user456,enter) (2010-05-10 13:16:53,user123,exit,23.79) : exit1 (2010-05-10 13:17:49,user456,exit,0.50)
    • 21. Operator Split If same schema for each line can specify on load, in this case need to do a foreach: enter = FOREACH enter1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray; exit = FOREACH exit1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray, (double)$3 AS cost:double; DESCRIBE enter; DESCRIBE exit; enter: {time: chararray,userid: chararray} exit: {time: chararray,userid: chararray,cost: double}
    • 22. Operator Sample, Limit For testing purposes sample both large inputs: names1 = LOAD 'starnames' as (id: int, name: chararray); names = SAMPLE names1 0.3; positions1 = LOAD 'starpositions' as (id: int, position: chararray); positions = SAMPLE positions1 0.3; Running returns random rows every time (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57") Limit only returns the first N results. Use with OrderBy to return the top results: nameandpos1 = JOIN names BY id, positions BY id; nameandpos2 = ORDER nameandpos1 BY names::id DESC; nameandpos Inner bag = LIMIT nameandpos2 2; (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07") (2,Alnitak,2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")
    • 23. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    • 24. UDF UDF: User Defined Function Operates on single values or a group Simple example: IsEmpty (a FilterFunc) users = JOIN names BY id, addresses BY id; D = FOREACH users GENERATE group, FLATTEN ((IsEmpty(names::firstName) ? “none” : names::firstName) Working over an aggregate, ie COUNT: users = JOIN names BY id, books BY buyerId; D = FOREACH users GENERATE group, COUNT(books) Working on two values: distance1= CROSS stars and stars; distance =
    • 25. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
    • 26. LOAD and GROUP logfiles = LOAD ‘logs’ AS (userid: int, bookid: long, price: double); userinfo = LOAD ‘users’ AS (userid: int, name: chararray); userpurchases = GROUP logfiles BY userid, userinfo BY userid; DESCRIBE userpurchases; DUMP userpurchases;
    • 27. Inside {} are bags (unordered) inside () are tuples (ordered list of fields) report = FOREACH userpurchases GENERATE FLATTEN(userinfo.name) AS name, group AS userid, FLATTEN(SUM(logfiles.price)) AS cost; bybigspender = ORDER report BY cost DESC; DUMP bybigspender; (Bob,103,94.94999999999999) (Joe,101,16.05) (Cindy,102,10.09)
    • 28. Entering and exiting recorded in same file: 2010-05-10 12:55:12 user123 enter 2010-05-10 13:14:23 user456 enter 2010-05-10 13:16:53 user123 exit 23.79 2010-05-10 13:17:49 user456 exit 0.50
    • 29. inandout = LOAD 'enterexittimes'; SPLIT inandout INTO enter IF $2 == 'enter', exit1 IF $2 == 'exit'; enter = FOREACH enter1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray; exit = FOREACH exit1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray, (double)$3 AS cost:double;
    • 30. combotimes = GROUP enter BY $1, exit BY $1; purchases = FOREACH combotimes GENERATE group AS userid, FLATTEN(enter.$0) AS entertime, FLATTEN(exit.$0) AS exittime, FLATTEN(exit.$2); DUMP purchases;
    • 31. Schema for inandout, enter1, exit1 unknown. enter: {time: chararray,userid: chararray} exit: {time: chararray,userid: chararray,cost: double} combotimes: {group: chararray, enter: {time: chararray,userid: chararray}, exit: {time: chararray,userid: chararray,cost: double}} purchases: {userid: chararray,entertime: chararray, exittime: chararray,cost: double}
    • 32. UDFs • User Defined Function • For doing an operation on data • Already use several builtins: • COUNT • SUM •

    ×