Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Pig      Dataflow Scripting for Hadoop      Alan F. Gates      @alanfgates© Hortonworks, Inc 2011                         ...
Who Am I?•   Pig committer and PMC Member•   HCatalog committer and mentor•   Member of ASF and Incubator PMC•   Co-founde...
Who Are You?               3
ExampleFor all of your           Load Users                Load Logsregistered users, you                                 ...
In Pig Latin-- Load web server logslogs      = load server_logs using HCatLoader();thismonth = filter logs by date >= 2011...
Pig’s Place in the Data World Data Collection   Data Factory           Data Warehouse                   Pig               ...
Why not MapReduce?• Pig Provides a number of standard data operators   – Five different implementations of join (hash, fra...
Embedding Example: Compute PagerankPageRank:A system of linear equations (as many as there  are pages on the web, yeah, a ...
Or more visuallyEach page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of...
Slide courtesy of Julien Le Dem
Let’s zoom in           pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +                             PR(Tn)/C(Tn))     ...
Recently Added Features• New in 0.9 (released July 2011):  – Embedding in Python  – Macros and Imports• New in 0.10 (shoul...
Learn More• Read the online documentation:  http://pig.apache.org/• Programming Pig from O’Reilly  Press• Join the mailing...
Questions            16
Upcoming SlideShare
Loading in …5
×

TriHUG November Pig Talk by Alan Gates

2,607 views

Published on

Published in: Technology, Business
  • Be the first to comment

TriHUG November Pig Talk by Alan Gates

  1. 1. Pig Dataflow Scripting for Hadoop Alan F. Gates @alanfgates© Hortonworks, Inc 2011 Page 1
  2. 2. Who Am I?• Pig committer and PMC Member• HCatalog committer and mentor• Member of ASF and Incubator PMC• Co-founder of Hortonworks• Author of Programming Pig from O’Reilly Photo credit: Steven Guarnaccia, The Three Little Pigs
  3. 3. Who Are You? 3
  4. 4. ExampleFor all of your Load Users Load Logsregistered users, you Semi-joinwant to count howmany came to your site Count by zip Count bythis month. You want age, genderthis count both bygeography (zip code) Store Store results resultsand by demographicgroup (age andgender)
  5. 5. In Pig Latin-- Load web server logslogs = load server_logs using HCatLoader();thismonth = filter logs by date >= 20110801 and date < 20110901;-- Load usersusers = load users using HCatLoader();-- Remove any users that did not visit this monthgrpd = cogroup thismonth by userid, users by userid;fltrd = filter grpd by not IsEmpty(logs);visited = foreach fltrd generate flatten(users);-- Count by zip codegrpbyzip = group visited by zip;cntzip = foreach grpbyzip generate group, COUNT(visited);store cntzip into by_zip using HCatStorer(date=201108);-- Count by demographicsgrpbydemo = group visited by (age, gender);cntdemo = foreach grpbydemo generate flatten(group), COUNT(visited);store cntdemo into by_demo using HCatStorer(date=201108);
  6. 6. Pig’s Place in the Data World Data Collection Data Factory Data Warehouse Pig Hive Pipelines BI Tools Iterative Processing Analysis Research 6
  7. 7. Why not MapReduce?• Pig Provides a number of standard data operators – Five different implementations of join (hash, fragment- replicate, merge, sparse merged, skewed) – Order by provides total ordering across reducers in a balanced way• Provides optimizations that are hard to do by hand – Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned• User Defined Functions provide a way to inject your code into the data transformation – can be written in Java or Python – can do column transformation (TOUPPER) and aggregation (SUM) – can be written to take advantage of the combiner• Control flow can be done via Python or Java 7
  8. 8. Embedding Example: Compute PagerankPageRank:A system of linear equations (as many as there are pages on the web, yeah, a lot):It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value.Ref: http://en.wikipedia.org/wiki/PageRank Slide courtesy of Julien Le Dem
  9. 9. Or more visuallyEach page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links. Slide courtesy of Julien Le Dem
  10. 10. Slide courtesy of Julien Le Dem
  11. 11. Let’s zoom in pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Iterate 10 times Pass parameters as a dictionary Just run P, that was declared above The output becomes the new input Slide courtesy of Julien Le Dem
  12. 12. Recently Added Features• New in 0.9 (released July 2011): – Embedding in Python – Macros and Imports• New in 0.10 (should release in Dec 2011) – Boolean data type – Hash based aggregation for aggregates with low cardinality keys – UDFs to build and apply bloom filters – UDFs in JRuby (may slip to next release) 14
  13. 13. Learn More• Read the online documentation: http://pig.apache.org/• Programming Pig from O’Reilly Press• Join the mailing lists: – user@pig.apache.org for user questions – dev@pig.apache.com for developer issues• Follow me on Twitter, @alanfgates
  14. 14. Questions 16

×