Practical pig
Upcoming SlideShare
Loading in...5

Practical pig



Practical Pig: Preventing Perilous Pitfalls for Prestige & Profit

Practical Pig: Preventing Perilous Pitfalls for Prestige & Profit



Total Views
Views on SlideShare
Embed Views



6 Embeds 1,085 1079 2
http://xandros 1 1 1 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Practical pig Practical pig Presentation Transcript

    • Practical PigPreventing Perilous Programming Pitfalls for Prestige & ProfitJameson LoppSoftware EngineerBronto Software, IncMarch 20, 2012
    • Why Pig?● High level language● Small learning curve● Increases productivity● Insulates you from complexity of MapReduce ○ Job configuration tuning ○ Mapper / Reducer optimization ○ Data re-use ○ Job Chains
    • Simple MapReduce ExampleInput: User profiles, page visitsOutput: the top 5 mostvisited pages by usersaged 18-25
    • In Native Hadoop Code
    • In Pigusers = LOAD ‘users’ AS (name, age);users = FILTER users BY age >= 18 AND age <= 25;pages = LOAD ‘pages’ AS (user, url);joined = JOIN users BY name, pages BY user;grouped = group JOINED BY url;summed = FOREACH grouped GENERATE group, COUNT(joined) AS clicks;sorted = ORDER summed BY clicks DESC;top5 = LIMIT sorted 5;STORE top5 INTO ‘/data/top5sites’;
    • Comparisons Significantly fewer lines of code Considerably less development timeReasonably close to optimal performance
    • Under the Hood Automagic!
    • Getting Up and Running1) Build from source via repository checkout or download a package from: 2) Make sure your class paths are set export JAVA_HOME=/usr/java/default export HBASE_HOME=/usr/lib/hbase export PIG_HOME=/usr/lib/pig export HADOOP_HOME=/usr/lib/hadoop export PATH=$PIG_HOME/bin:$PATH3) Run Grunt or execute a Pig Latin script $ pig -x local ... - Connecting to ... grunt> OR $ pig -x mapreduce wordCount.pig
    • Pig Latin Basics Pig Latin statements allow you to transform relations.● A relation is a bag.● A bag is a collection of tuples.● A tuple is an ordered set of fields.● A field is a piece of data (int / long / float / double / chararray / bytearray)Relations are referred to by name. Names are assigned by you as part of thePig Latin statement.Fields are referred to by positional notation or by name if you assign one. A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float); X = FOREACH A GENERATE name,$2; DUMP X; (John,4.0F) (Mary,3.8F) (Bill,3.9F) (Joe,3.8F)
    • Pig Crash Course for SQL Users SQL Pig LatinSELECT * FROM users; users = LOAD /hdfs/users USING PigStorage (‘t’) AS (name:chararray, age:int, weight:int);SELECT * FROM users where weight < 150; skinnyUsers = FILTER users BY weight < 150;SELECT name, age FROM users where weight skinnyUserNames = FOREACH skinnyUsers< 150; GENERATE name, age;
    • Pig Crash Course for SQL Users SQL Pig LatinSELECT name, SUM(orderAmount) A = GROUP orders BY name;FROM orders GROUP BY name... B = FOREACH A GENERATE $0 AS name, SUM($1.orderAmount) AS orderTotal;...HAVING SUM(orderAmount) > 500... C = FILTER B BY orderTotal > 500;...ORDER BY name ASC; D = ORDER C BY name ASC;SELECT DISTINCT name FROM users; names = FOREACH users GENERATE name; uniqueNames = DISTINCT names;SELECT name, COUNT(DISTINCT age) usersByName = GROUP users BY name;FROM users GROUP BY name; numAgesByName = FOREACH usersByName { ages = DISTINCT users.age; GENERATE FLATTEN(group), COUNT(ages); }
    • Real World Pig Script"Aggregate yesterdays API web server logs by client and function call."logs = LOAD /hdfs/logs/$date/api.log using PigStorage(t) AS (type, date, ipAddress, sessionId, clientId, apiMethod);methods = FILTER logs BY type == INFO ;methods = FOREACH methods GENERATE type, date, clientId, class, method;methods = GROUP methods BY (clientId, class, method);methodStats = FOREACH methods GENERATE $0.clientId, $0.class, $0.method, COUNT($1) as methodCount;STORE methodStats to /stats/$date/api/apiUsageByClient
    • Pig Job Performance "Find the most commonly used desktop browser, mobile browser,operating system, email client, and geographic location for every contact."● 150 line Pig Latin script● Runs daily on 6 node computation cluster● Processes ~1B rows of raw tracking data in 40 minutes, doing multiple groups and joins via 16 chained MapReduce jobs with 2100 mappers● Output: ~40M rows of contact attributes
    • Pig Job Performance● Reads input tracking data from sequence files on HDFS logs = LOAD /rawdata/track/{$dates}/part-* USING SequenceFileLoader; logs = FOREACH logs GENERATE $0, STRSPLIT($1, t);● Filters out all tracking actions other than email opens rawOpens = FILTER logs BY $1.$2 == open AND $1.$15 IS NOT NULL AND ($1.$17 IS NOT NULL OR $1.$18 IS NOT NULL OR $1.$19 IS NOT NULL OR $1.$20 IS NOT NULL);● Strip down each row to required data (memory usage optimization) allBrowsers = FOREACH rawOpens GENERATE (chararray)$1.$15 AS subscriberId, (chararray)$1.$17 AS ipAddress, (chararray)$1.$18 AS userAgent, (chararray)$1.$19 AS httpReferer, (chararray)$1.$20 AS browser, (chararray)$1.$21 AS os;● Separate mobile browser data from desktop browser data SPLIT allBrowsers INTO mobile IF (browser == iPhone OR browser == Android), desktop IF (browser != iPhone AND browser != Android);
    • Pig Job Performance OMGWTFBBQ-- the last column is a concatenated index we will use to diff between daily runs of this scriptstoreResults = FOREACH joinedResults { GENERATE joinedResults::compactResults::subscriberId AS subscriberId, joinedResults::subscriberModeIP::ipCountBySubscriber::ipAddress AS ipAddress, joinedResults::compactResults::primaryBrowser AS primaryBrowser, joinedResults::compactResults::primaryUserAgent AS primaryUserAgent, joinedResults::compactResults::primaryHttpReferer AS primaryHttpReferer, joinedResults::compactResults::mobileBrowser AS mobileBrowser, joinedResults::compactResults::mobileUserAgent AS mobileUserAgent, joinedResults::compactResults::mobileHttpReferer AS mobileHttpReferer, subscriberModeOS::osCountBySubscriber::os AS os, CONCAT(CONCAT(CONCAT(joinedResults::compactResults::subscriberId,(joinedResults::subscriberModeIP::ipCountBySubscriber::ipAddress IS NULL ? : joinedResults::subscriberModeIP::ipCountBySubscriber::ipAddress)), CONCAT((joinedResults::compactResults::primaryBrowser IS NULL ? :joinedResults::compactResults::primaryBrowser), (joinedResults::compactResults::mobileBrowser IS NULL ? : joinedResults::compactResults::mobileBrowser))), (subscriberModeOS::osCountBySubscriber::os IS NULL ? : subscriberModeOS::osCountBySubscriber::os)) AS key;}
    • Pig Job Performance
    • User Defined FunctionsAllow you to perform more complex operations upon fieldsWritten in java, compiled into a jar, loaded into your Pig script at runtimepackage myudfs;import;import org.apache.pig.EvalFunc;import;import org.apache.pig.impl.util.WrappedIOException;public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}
    • User Defined FunctionsMaking use of your UDF in a Pig Script:REGISTER myudfs.jar;students = LOAD student_data AS (name: chararray, age: int, gpa: float);upperNames = FOREACH students GENERATE myudfs.UPPER(name);DUMP upperNames;
    • UDF PitfallsUDFs are limited; can only operate on fields, not on groups of fields. Agiven UDF can only return a single data type (integer / float / chararray /etc).To build a jar file that contains all available UDFs, follow these steps: ● Checkout UDF code: svn co ● Add pig.jar to your ClassPath: export CLASSPATH=$CLASSPATH:/path/to/pig.jar ● Build the jar file: cd trunk/contrib/piggybank/java; run "ant" This will generate piggybank.jar in the same directory.You must build piggybank in order to read UDF documentation - run "antjavadoc" from directory trunk/contrib/piggybank/java. The documentation isgenerated in directory trunk/contrib/piggybank/java/build/javadoc.How to compile a custom UDF isn’t obvious. After writing your UDF, youmust place your java code in an appropriate directory inside a checkout ofthe piggybank code and build the piggybank jar with ant.
    • Common Pig PitfallsTrying to match pig version with hadoop / hbase versions. There is verylittle documentation on what is compatible with what.A few snippets from the mailing list:“Are you using Pig 8 distribution or Pig 8 from svn? You want the latter (soon-to-be-Pig 0.8.1)”“Please upgrade your pig version to the latest in the 0.8 branch. The 0.8 release is notcompatible with 0.20+ versions of hbase; we bumped up the support in 0.8.1, which is nearingrelease. Clouderas latest CDH3 GA might have these patches (it was just released today) butCDH3B4 didnt.”
    • Common Pig PitfallsBugs in older versions of pig requiring you to register jars. Indicated by MapReduce job failuredue to java.lang.ClassNotFoundException:I finally resolved the problem by manually registering jars: REGISTER /path/to/pig_0.8/lib/google-collections-1.0.jar; REGISTER /path/to/pig_0.8/lib/hbase-0.20.3-1.cloudera.jar; REGISTER /path/to/pig_0.8/lib/zookeeper-hbase-1329.jarFrom the mailing list: “If you are using Hbase 0.91 and Pig 0.8.1, the hbaseStorage code in Pigis supposed to auto-register the hbase, zookeeper, and google-collections jars, so you wonthave to do that.” No more registering jars, though they do need to be on your classpath.
    • Obscure Pig PitfallsHBaseLoader bug requiring disabling input splits. Pig versions prior to0.8.1 will only load a single HBase region unless you disable input splits. Fix via: SET pig.splitCombination false;
    • Obscure Pig Pitfallsvisitors = LOAD hbase://tracking USING HBaseStorage( open:browseropen:ip open:os open:createdDate) as (browser:chararray, ipAddress:chararray, os:chararray, createdDate:chararray);Resulted in: java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init( Caused by: Call to hadoopMaster failed on
    • RecommendationsJoin the pig-user mailing list: user@pig.apache.orgUse (the latest) complete cloudera distribution to avoidversion compatibility issues.Learn the quick & dirty rules for optimizing performance. the “set” command to tune your MapReduce jobs. & re-test. Walk through your pig script int the Grunt shell and useDUMP/ DESCRIBE / EXPLAIN / ILLUSTRATE on your variables /operations. Once you’re happy with how the script looks on paper, run it onyour cluster and examine for places you can tweak the Map/Reduce jobconfig.
    • RecommendationsVariable input requires passing arguments from anexternal wrapper script; we use groovy scripts to kickstart pig jobs.def day = new Date()def dateString = (2..31).collect{day.minus(it).format("yyyy-MM-dd")}.join(",")def pig = "/usr/bin/pig -l /dev/null -param dates=${dateString}/path/to/pig/job.pig".execute()Remember to filter out null data or youll have wonkyresults when grouping by that field.Tell pig to parallelize reducers; tune for your cluster. ○ SET default_parallel 30;
    • RecommendationsIncrease acceptable mapper failure rate (tweak for your cluster size) SET mapred.reduce.max.attempts 10; SET mapred.max.tracker.failures 10; SET 20;
    • Thats All, Folks!
    • CreditsExample code & charts from "Practical Problem Solving with Hadoop andPig" by Milind Bhandarkar ( log aggregation script by Jeff Turner ("Nerdy Pig" cartoon by"Pig with Goggles" photo via"Cinderella" photo via"Racing Piglets" via"Flying Pig" cartoon via"Fault Tolerance" comic by John Muellerleile (@jrecursive)"Pug Pig" photo via"Angry Birds Pig" via"Oh Bother" cartoon via"Trojan Pig" cartoon"Drunk Man Rides Pig" via"Redundancy" via"Thats All, Folks" cartoon via