Pig, Making Hadoop Easy

Alan F. GatesYahoo!Pig, Making Hadoop Easy

Who Am I?Pig committerHadoop PMC MemberAn architect in Yahoo!grid teamOr, as one coworker put it, “the lipstick on the Pig”

Motivation By Example Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.Load UsersLoad PagesFilter by ageJoin on nameGroup on urlCount clicksOrder by clicksTake top 5

In Pig LatinUsers = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;

Performance0.10.4,0.50.20.30.6, 0.7

Why not SQL?Data FactoryPigPipelinesIterative ProcessingResearchData WarehouseHiveBI ToolsAnalysisData Collection

Pig HighlightsUser defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)UDFs can be written to take advantage of the combinerFour join implementations built in: hash, fragment-replicate, merge, skewedMulti-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scannedOrder by provides total ordering across reducers in a balanced wayWriting load and store functions is easy once an InputFormat and OutputFormat existPiggybank, a collection of user contributed UDFs

Who uses Pig for What?70% of production jobs at Yahoo (10ks per day)Also used by Twitter, LinkedIn, Ebay, AOL, …Used toProcess web logsBuild user behavior modelsProcess imagesBuild maps of the webDo research on raw data sets

Accessing PigSubmit a script directlyGrunt, the pig shellPigServer Java class, a JDBC like interface

ComponentsJob executes on clusterHadoop ClusterPig resides on user machineUser machineNo need to install anything extra on your Hadoop cluster.

How It WorksPig LatinA = LOAD ‘myfile’ AS (x, y, z);B = FILTER A by x > 0; C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);STORE D INTO ‘output’;pig.jar:parses

Pig, Making Hadoop Easy

More Related Content

What's hot

Viewers also liked

Similar to Pig, Making Hadoop Easy

More from Nick Dimiduk

Pig, Making Hadoop Easy

Editor's Notes