Alan F. GatesYahoo!Pig, Making Hadoop Easy
Who Am I?Pig committerHadoop PMC MemberAn architect in Yahoo!grid teamOr, as one coworker put it, “the lipstick on the Pig”
Who are you?
Motivation By Example   Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.Load UsersLoad PagesFilter by ageJoin on nameGroup on urlCount clicksOrder by clicksTake top 5
In Map Reduce
In Pig LatinUsers = load‘users’as (name, age);Fltrd = filter Users by        age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;
Performance0.10.4,0.50.20.30.6, 0.7
Why not SQL?Data FactoryPigPipelinesIterative ProcessingResearchData WarehouseHiveBI ToolsAnalysisData Collection
Pig HighlightsUser defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)UDFs can be written to take advantage of the combinerFour join implementations built in:  hash, fragment-replicate, merge, skewedMulti-query:  Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scannedOrder by provides total ordering across reducers in a balanced wayWriting load and store functions is easy once an InputFormat and OutputFormat existPiggybank, a collection of user contributed UDFs
Who uses Pig for What?70% of production jobs at Yahoo (10ks per day)Also used by Twitter, LinkedIn, Ebay, AOL, …Used toProcess web logsBuild user behavior modelsProcess imagesBuild maps of the webDo research on raw data sets
Accessing PigSubmit a script directlyGrunt, the pig shellPigServer Java class, a JDBC like interface
ComponentsJob executes on clusterHadoop ClusterPig resides on user machineUser machineNo need to install anything extra on your Hadoop cluster.
How It WorksPig LatinA = LOAD ‘myfile’    AS (x, y, z);B = FILTER A by x > 0; C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);STORE D INTO ‘output’;pig.jar:parses
checks
optimizes
plans execution

Pig, Making Hadoop Easy

  • 1.
    Alan F. GatesYahoo!Pig,Making Hadoop Easy
  • 2.
    Who Am I?PigcommitterHadoop PMC MemberAn architect in Yahoo!grid teamOr, as one coworker put it, “the lipstick on the Pig”
  • 3.
  • 4.
    Motivation By Example Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.Load UsersLoad PagesFilter by ageJoin on nameGroup on urlCount clicksOrder by clicksTake top 5
  • 5.
  • 6.
    In Pig LatinUsers= load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;
  • 7.
  • 8.
    Why not SQL?DataFactoryPigPipelinesIterative ProcessingResearchData WarehouseHiveBI ToolsAnalysisData Collection
  • 9.
    Pig HighlightsUser definedfunctions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)UDFs can be written to take advantage of the combinerFour join implementations built in: hash, fragment-replicate, merge, skewedMulti-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scannedOrder by provides total ordering across reducers in a balanced wayWriting load and store functions is easy once an InputFormat and OutputFormat existPiggybank, a collection of user contributed UDFs
  • 10.
    Who uses Pigfor What?70% of production jobs at Yahoo (10ks per day)Also used by Twitter, LinkedIn, Ebay, AOL, …Used toProcess web logsBuild user behavior modelsProcess imagesBuild maps of the webDo research on raw data sets
  • 11.
    Accessing PigSubmit ascript directlyGrunt, the pig shellPigServer Java class, a JDBC like interface
  • 12.
    ComponentsJob executes onclusterHadoop ClusterPig resides on user machineUser machineNo need to install anything extra on your Hadoop cluster.
  • 13.
    How It WorksPigLatinA = LOAD ‘myfile’ AS (x, y, z);B = FILTER A by x > 0; C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);STORE D INTO ‘output’;pig.jar:parses
  • 14.
  • 15.
  • 16.

Editor's Notes

  • #4 How many have used Pig? How many have looked at it and have a basic understanding of it?
  • #15 Demo script:Show group query first, talk about: load and schema (none, declared, from data) data types data sources need not be from HDFS or even from files parallel clause, how parallelism is determined on maps how grouping works in Pig LatinSo far what I’ve shown you is a simple join/group query. Now let’s look at something less straight forward in SQLOften people want to group data a number of different ways. Look at multiquery script: Note how there’s a branch in the logic nowOften want to operate on the result of each record in a previous statement. Look at top5 query Note nested foreach allows you to operate on each record coming out of group by Since result of group by is a bag in each record, can apply operators to that bag Currently support order, distinct, filter, limit Use of flatten at the end Use of positional parametersThere will always be logic you need to write that you can’t get from Pig Latin. This is where rich support of UDFs come in. Look at session query Note registering UDF UDF now called like any other Pig builtin function (in fact Pig builtins implemented as UDFs)Look at SessionAnalysis.java Class name is UDF name Input to UDF is always a Tuple, avoids need to declare expected input, means UDF has to check what it gets Talk about how projection of bags works Talk about how EvalFunc is templatized on return typeAlso easy to write load and store functions to fit your data needs