Practical pig


Published on

Practical Pig: Preventing Perilous Pitfalls for Prestige & Profit

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Practical pig

  1. 1. Practical PigPreventing Perilous Programming Pitfalls for Prestige & ProfitJameson LoppSoftware EngineerBronto Software, IncMarch 20, 2012
  2. 2. Why Pig?● High level language● Small learning curve● Increases productivity● Insulates you from complexity of MapReduce ○ Job configuration tuning ○ Mapper / Reducer optimization ○ Data re-use ○ Job Chains
  3. 3. Simple MapReduce ExampleInput: User profiles, page visitsOutput: the top 5 mostvisited pages by usersaged 18-25
  4. 4. In Native Hadoop Code
  5. 5. In Pigusers = LOAD ‘users’ AS (name, age);users = FILTER users BY age >= 18 AND age <= 25;pages = LOAD ‘pages’ AS (user, url);joined = JOIN users BY name, pages BY user;grouped = group JOINED BY url;summed = FOREACH grouped GENERATE group, COUNT(joined) AS clicks;sorted = ORDER summed BY clicks DESC;top5 = LIMIT sorted 5;STORE top5 INTO ‘/data/top5sites’;
  6. 6. Comparisons Significantly fewer lines of code Considerably less development timeReasonably close to optimal performance
  7. 7. Under the Hood Automagic!
  8. 8. Getting Up and Running1) Build from source via repository checkout or download a package from: 2) Make sure your class paths are set export JAVA_HOME=/usr/java/default export HBASE_HOME=/usr/lib/hbase export PIG_HOME=/usr/lib/pig export HADOOP_HOME=/usr/lib/hadoop export PATH=$PIG_HOME/bin:$PATH3) Run Grunt or execute a Pig Latin script $ pig -x local ... - Connecting to ... grunt> OR $ pig -x mapreduce wordCount.pig
  9. 9. Pig Latin Basics Pig Latin statements allow you to transform relations.● A relation is a bag.● A bag is a collection of tuples.● A tuple is an ordered set of fields.● A field is a piece of data (int / long / float / double / chararray / bytearray)Relations are referred to by name. Names are assigned by you as part of thePig Latin statement.Fields are referred to by positional notation or by name if you assign one. A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float); X = FOREACH A GENERATE name,$2; DUMP X; (John,4.0F) (Mary,3.8F) (Bill,3.9F) (Joe,3.8F)
  10. 10. Pig Crash Course for SQL Users SQL Pig LatinSELECT * FROM users; users = LOAD /hdfs/users USING PigStorage (‘t’) AS (name:chararray, age:int, weight:int);SELECT * FROM users where weight < 150; skinnyUsers = FILTER users BY weight < 150;SELECT name, age FROM users where weight skinnyUserNames = FOREACH skinnyUsers< 150; GENERATE name, age;
  11. 11. Pig Crash Course for SQL Users SQL Pig LatinSELECT name, SUM(orderAmount) A = GROUP orders BY name;FROM orders GROUP BY name... B = FOREACH A GENERATE $0 AS name, SUM($1.orderAmount) AS orderTotal;...HAVING SUM(orderAmount) > 500... C = FILTER B BY orderTotal > 500;...ORDER BY name ASC; D = ORDER C BY name ASC;SELECT DISTINCT name FROM users; names = FOREACH users GENERATE name; uniqueNames = DISTINCT names;SELECT name, COUNT(DISTINCT age) usersByName = GROUP users BY name;FROM users GROUP BY name; numAgesByName = FOREACH usersByName { ages = DISTINCT users.age; GENERATE FLATTEN(group), COUNT(ages); }
  12. 12. Real World Pig Script"Aggregate yesterdays API web server logs by client and function call."logs = LOAD /hdfs/logs/$date/api.log using PigStorage(t) AS (type, date, ipAddress, sessionId, clientId, apiMethod);methods = FILTER logs BY type == INFO ;methods = FOREACH methods GENERATE type, date, clientId, class, method;methods = GROUP methods BY (clientId, class, method);methodStats = FOREACH methods GENERATE $0.clientId, $0.class, $0.method, COUNT($1) as methodCount;STORE methodStats to /stats/$date/api/apiUsageByClient
  13. 13. Pig Job Performance "Find the most commonly used desktop browser, mobile browser,operating system, email client, and geographic location for every contact."● 150 line Pig Latin script● Runs daily on 6 node computation cluster● Processes ~1B rows of raw tracking data in 40 minutes, doing multiple groups and joins via 16 chained MapReduce jobs with 2100 mappers● Output: ~40M rows of contact attributes
  14. 14. Pig Job Performance● Reads input tracking data from sequence files on HDFS logs = LOAD /rawdata/track/{$dates}/part-* USING SequenceFileLoader; logs = FOREACH logs GENERATE $0, STRSPLIT($1, t);● Filters out all tracking actions other than email opens rawOpens = FILTER logs BY $1.$2 == open AND $1.$15 IS NOT NULL AND ($1.$17 IS NOT NULL OR $1.$18 IS NOT NULL OR $1.$19 IS NOT NULL OR $1.$20 IS NOT NULL);● Strip down each row to required data (memory usage optimization) allBrowsers = FOREACH rawOpens GENERATE (chararray)$1.$15 AS subscriberId, (chararray)$1.$17 AS ipAddress, (chararray)$1.$18 AS userAgent, (chararray)$1.$19 AS httpReferer, (chararray)$1.$20 AS browser, (chararray)$1.$21 AS os;● Separate mobile browser data from desktop browser data SPLIT allBrowsers INTO mobile IF (browser == iPhone OR browser == Android), desktop IF (browser != iPhone AND browser != Android);
  15. 15. Pig Job Performance OMGWTFBBQ-- the last column is a concatenated index we will use to diff between daily runs of this scriptstoreResults = FOREACH joinedResults { GENERATE joinedResults::compactResults::subscriberId AS subscriberId, joinedResults::subscriberModeIP::ipCountBySubscriber::ipAddress AS ipAddress, joinedResults::compactResults::primaryBrowser AS primaryBrowser, joinedResults::compactResults::primaryUserAgent AS primaryUserAgent, joinedResults::compactResults::primaryHttpReferer AS primaryHttpReferer, joinedResults::compactResults::mobileBrowser AS mobileBrowser, joinedResults::compactResults::mobileUserAgent AS mobileUserAgent, joinedResults::compactResults::mobileHttpReferer AS mobileHttpReferer, subscriberModeOS::osCountBySubscriber::os AS os, CONCAT(CONCAT(CONCAT(joinedResults::compactResults::subscriberId,(joinedResults::subscriberModeIP::ipCountBySubscriber::ipAddress IS NULL ? : joinedResults::subscriberModeIP::ipCountBySubscriber::ipAddress)), CONCAT((joinedResults::compactResults::primaryBrowser IS NULL ? :joinedResults::compactResults::primaryBrowser), (joinedResults::compactResults::mobileBrowser IS NULL ? : joinedResults::compactResults::mobileBrowser))), (subscriberModeOS::osCountBySubscriber::os IS NULL ? : subscriberModeOS::osCountBySubscriber::os)) AS key;}
  16. 16. Pig Job Performance
  17. 17. User Defined FunctionsAllow you to perform more complex operations upon fieldsWritten in java, compiled into a jar, loaded into your Pig script at runtimepackage myudfs;import;import org.apache.pig.EvalFunc;import;import org.apache.pig.impl.util.WrappedIOException;public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}
  18. 18. User Defined FunctionsMaking use of your UDF in a Pig Script:REGISTER myudfs.jar;students = LOAD student_data AS (name: chararray, age: int, gpa: float);upperNames = FOREACH students GENERATE myudfs.UPPER(name);DUMP upperNames;
  19. 19. UDF PitfallsUDFs are limited; can only operate on fields, not on groups of fields. Agiven UDF can only return a single data type (integer / float / chararray /etc).To build a jar file that contains all available UDFs, follow these steps: ● Checkout UDF code: svn co ● Add pig.jar to your ClassPath: export CLASSPATH=$CLASSPATH:/path/to/pig.jar ● Build the jar file: cd trunk/contrib/piggybank/java; run "ant" This will generate piggybank.jar in the same directory.You must build piggybank in order to read UDF documentation - run "antjavadoc" from directory trunk/contrib/piggybank/java. The documentation isgenerated in directory trunk/contrib/piggybank/java/build/javadoc.How to compile a custom UDF isn’t obvious. After writing your UDF, youmust place your java code in an appropriate directory inside a checkout ofthe piggybank code and build the piggybank jar with ant.
  20. 20. Common Pig PitfallsTrying to match pig version with hadoop / hbase versions. There is verylittle documentation on what is compatible with what.A few snippets from the mailing list:“Are you using Pig 8 distribution or Pig 8 from svn? You want the latter (soon-to-be-Pig 0.8.1)”“Please upgrade your pig version to the latest in the 0.8 branch. The 0.8 release is notcompatible with 0.20+ versions of hbase; we bumped up the support in 0.8.1, which is nearingrelease. Clouderas latest CDH3 GA might have these patches (it was just released today) butCDH3B4 didnt.”
  21. 21. Common Pig PitfallsBugs in older versions of pig requiring you to register jars. Indicated by MapReduce job failuredue to java.lang.ClassNotFoundException:I finally resolved the problem by manually registering jars: REGISTER /path/to/pig_0.8/lib/google-collections-1.0.jar; REGISTER /path/to/pig_0.8/lib/hbase-0.20.3-1.cloudera.jar; REGISTER /path/to/pig_0.8/lib/zookeeper-hbase-1329.jarFrom the mailing list: “If you are using Hbase 0.91 and Pig 0.8.1, the hbaseStorage code in Pigis supposed to auto-register the hbase, zookeeper, and google-collections jars, so you wonthave to do that.” No more registering jars, though they do need to be on your classpath.
  22. 22. Obscure Pig PitfallsHBaseLoader bug requiring disabling input splits. Pig versions prior to0.8.1 will only load a single HBase region unless you disable input splits. Fix via: SET pig.splitCombination false;
  23. 23. Obscure Pig Pitfallsvisitors = LOAD hbase://tracking USING HBaseStorage( open:browseropen:ip open:os open:createdDate) as (browser:chararray, ipAddress:chararray, os:chararray, createdDate:chararray);Resulted in: java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init( Caused by: Call to hadoopMaster failed on
  24. 24. RecommendationsJoin the pig-user mailing list: user@pig.apache.orgUse (the latest) complete cloudera distribution to avoidversion compatibility issues.Learn the quick & dirty rules for optimizing performance. the “set” command to tune your MapReduce jobs. & re-test. Walk through your pig script int the Grunt shell and useDUMP/ DESCRIBE / EXPLAIN / ILLUSTRATE on your variables /operations. Once you’re happy with how the script looks on paper, run it onyour cluster and examine for places you can tweak the Map/Reduce jobconfig.
  25. 25. RecommendationsVariable input requires passing arguments from anexternal wrapper script; we use groovy scripts to kickstart pig jobs.def day = new Date()def dateString = (2..31).collect{day.minus(it).format("yyyy-MM-dd")}.join(",")def pig = "/usr/bin/pig -l /dev/null -param dates=${dateString}/path/to/pig/job.pig".execute()Remember to filter out null data or youll have wonkyresults when grouping by that field.Tell pig to parallelize reducers; tune for your cluster. ○ SET default_parallel 30;
  26. 26. RecommendationsIncrease acceptable mapper failure rate (tweak for your cluster size) SET mapred.reduce.max.attempts 10; SET mapred.max.tracker.failures 10; SET 20;
  27. 27. Thats All, Folks!
  28. 28. CreditsExample code & charts from "Practical Problem Solving with Hadoop andPig" by Milind Bhandarkar ( log aggregation script by Jeff Turner ("Nerdy Pig" cartoon by"Pig with Goggles" photo via"Cinderella" photo via"Racing Piglets" via"Flying Pig" cartoon via"Fault Tolerance" comic by John Muellerleile (@jrecursive)"Pug Pig" photo via"Angry Birds Pig" via"Oh Bother" cartoon via"Trojan Pig" cartoon"Drunk Man Rides Pig" via"Redundancy" via"Thats All, Folks" cartoon via