Your SlideShare is downloading. ×
Pig: Recent Work and NextStepsAlan	  F.	  Gates	  @alanfgates	                              Page	  1	  
Who Am I?                        •  Pig committer and PMC Member                        •  Original member of the engineer...
What is Pig?   © 2012 Hortonworks                        Page 3
What is Pig?•  A data flow language   users      = load users;   grouped = group users by zipcode;   byzip      = foreach ...
What is Pig?•  A data flow language   users      = load users;   grouped = group users by zipcode;   byzip      = foreach ...
What is Pig?•  A data flow language   users      = load users;   grouped = group users by zipcode;   byzip      = foreach ...
What is Pig?•  A data flow language   users      = load users;   grouped = group users by zipcode;   byzip      = foreach ...
What is Pig?  •  A data flow language     users      = load users;     grouped = group users by zipcode;     byzip      = ...
Recent Work© 2012 Hortonworks                                   Page 9
New Features in Pig 0.10•  Released April, 2012•  This release was a collaborative effort, with major features added by   ...
Ruby UDFs•  Pig 0.8, 0.9 UDFs could be done in Python and Java. Now Ruby also supported•  Evaluated via JRubypower.pig:reg...
PigStorage With Schemas•  By default, PigStorage (the default load/store function) does not use a   schema•  In 0.10, it c...
Additional UDF Improvements•  Automatic generation of simpler UDFs    –  If you implement an Algebraic UDF, Pig can genera...
Language Improvements•  Boolean now supported as a first class data type  a = load foo as (n:chararray, a:int, g:double, b...
Performance Improvements•  Hash based aggregation    –  Up to 50% faster aggregation for sets with small number of distinc...
Current Work in Pig – Not Yet Released•  Work done on internal data representation and map è reduce transfer to   lower m...
Pig with Hadoop 2.0•  Pig 0.10 is the first release of Pig that works with Hadoop 2.0 (fka Hadoop 0.23)•  By default Pig 0...
Next Steps© 2012 Hortonworks                                  Page 18
Pig Execution Today                                            Map Reduce Job:users   = load users;                     In...
Re-optimize on the Fly                                    MR Job            MR JobMR Job   = planned                      ...
Re-optimize on the Fly                                    MR Job            MR JobMR Job   = planned                      ...
Re-optimize on the Fly                                    MR Job            MR JobMR Job   = planned                      ...
Re-optimize on the Fly                                    MR Job            MR JobMR Job   = planned                      ...
Re-optimize on the Fly                                    MR Job                   MR Job                                 ...
Re-optimize on the Fly                                    MR Job                   MR Job                                 ...
Re-optimize on the Fly                                    MR Job                   MR Job                                 ...
Re-optimize on the Fly                                    MR Job                   MR Job                                 ...
Modify MapReduceusers   = load users;grouped = group users by zipcode;    Mapbyzip   = foreach grouped          generate z...
Modify MapReduceusers   = load users;grouped = group users by zipcode;                   Mapbyzip   = foreach grouped     ...
Modify MapReduceusers   = load users;grouped = group users by zipcode;    Mapbyzip   = foreach grouped          generate z...
Today                                   Hive        Pig                               Plan             Others             ...
Data Virtual Machine        Pig                                                Others                                Hive ...
Questions & Answers                             TRY                             download at hortonworks.com               ...
Upcoming SlideShare
Loading in...5
×

Pig Out to Hadoop

2,231

Published on

Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks co-founder and long-time contributor to the Apache Pig and HCatalog projects, to discuss these new features, as well as talk about work the project is planning to do in the near future. In particular, we will cover how Pig can take advantage of changes in Hadoop 0.23.

Published in: Education, Technology, Business

Transcript of "Pig Out to Hadoop"

  1. 1. Pig: Recent Work and NextStepsAlan  F.  Gates  @alanfgates   Page  1  
  2. 2. Who Am I? •  Pig committer and PMC Member •  Original member of the engineering team in Yahoo that took Pig from research to production •  Author of Programming Pig from O’Reilly •  HCatalog committer and mentor •  Co-founder of Hortonworks •  Tech lead of the team at Hortonworks that does Pig, Hive, and HCatalog •  Member of Apache Software Foundation and Incubator PMC © 2012 Hortonworks Page 2
  3. 3. What is Pig? © 2012 Hortonworks Page 3
  4. 4. What is Pig?•  A data flow language users = load users; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into count_by_zip; © 2012 Hortonworks Page 4
  5. 5. What is Pig?•  A data flow language users = load users; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into count_by_zip;•  that translates a script into a series of MapReduce jobs and then executes those jobs © 2012 Hortonworks Page 5
  6. 6. What is Pig?•  A data flow language users = load users; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into count_by_zip;•  That translates a script into a series of MapReduce jobs and then executes those jobsusers = load users;grouped = group users by zipcode;byzip = foreach grouped generate zipcode, COUNT(users);store byzip into count_by_zip; © 2012 Hortonworks Page 6
  7. 7. What is Pig?•  A data flow language users = load users; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into count_by_zip;•  That translates a script into a series of MapReduce jobs and then executes those jobs Map Reduce Job:users = load users; Input: ./usersgrouped = group users by zipcode;byzip = foreach grouped Map: project(zipcode, userid) generate zipcode, COUNT(users); Shuffle key: useridstore byzip into count_by_zip; Reduce: count Output: ./count_by_zip © 2012 Hortonworks Page 7
  8. 8. What is Pig? •  A data flow language users = load users; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into count_by_zip; •  That translates a script into a series of MapReduce jobs and then executes those jobs Map Reduce Job: users = load users; Input: ./users grouped = group users by zipcode; byzip = foreach grouped Map: project(zipcode, userid) generate zipcode, COUNT(users); Shuffle key: userid store byzip into count_by_zip; Reduce: count Output: ./count_by_zipLives on clientmachine,nothing toinstall oncluster © 2012 Hortonworks Page 8
  9. 9. Recent Work© 2012 Hortonworks Page 9
  10. 10. New Features in Pig 0.10•  Released April, 2012•  This release was a collaborative effort, with major features added by Twitter, Yahoo, Hortonworks, and Google Summer of Code students•  Not all the new features are covered here, see http://hortonworks.com/blog/new-features-in-apache-pig-0-10/ for a complete list. © 2012 Hortonworks Page 10
  11. 11. Ruby UDFs•  Pig 0.8, 0.9 UDFs could be done in Python and Java. Now Ruby also supported•  Evaluated via JRubypower.pig:register power.rb using jruby as rf;data = load ‘input’ as (a:int, b:int);powered = foreach data generate rf.power(a, b);power.rb:require pigudfclass Power < PigUdf outputSchema "a:int" def power(mantissa, exponent) return nil if mantissa.nil? or exponent.nil? mantissa**exponent endend•  Can also do Algebraic and Accumulator UDFs in Ruby (like in Java, but unlike in Python) © 2012 Hortonworks Page 11
  12. 12. PigStorage With Schemas•  By default, PigStorage (the default load/store function) does not use a schema•  In 0.10, it can store a schema if instructed to•  Schema stored in side file .pig_schema•  If schema is available it will automatically be used A = load studenttab10k as (name:chararray, age:int, gpa:double); store A into foo using PigStorage(t, -schema); A = load foo; B = foreach A generate name, age; © 2012 Hortonworks Page 12
  13. 13. Additional UDF Improvements•  Automatic generation of simpler UDFs –  If you implement an Algebraic UDF, Pig can generate Accumulator & basic UDFs –  If you implement an Accumulator UDF, Pig can generate a basic UDF•  JSON load and store functions –  Requires schema that describes JSON, does not intuit schema from data –  Schema stored in side file, no need to declare in script•  Built in UDFs for Bloom filters –  BuildBloom builds a bloom filter for one or more columns for a given input –  Can be constructed to be a certain size (# of hash functions and # of bits) or based on the desired false positive rate –  Bloom takes the file generated by BuildBloom and applies it to an input define bb BuildBloom(Hash.JENKINS_HASH, 1000, 0.01); A = load users; B = group A all; C = foreach B generate bb(A.name); store C into mybloom; define bloom Bloom(mybloom); A = load transactions; B = filter A by bloom(name); © 2012 Hortonworks Page 13
  14. 14. Language Improvements•  Boolean now supported as a first class data type a = load foo as (n:chararray, a:int, g:double, b:boolean);•  Default split destination - otherwise –  records which do not match any of the ifs will go to this destination –  records can still go to multiple ifs split a into b if id < 3, c if id > 5, d otherwise;•  Maps, tuples, and bags can now be generated without UDFs: B = foreach A generate [key, value], (col1, col2), {col1, col2};•  Register a collection of jars at once with globs: –  Uses HDFS globbing syntax register /home/me/jars/*.jar; © 2012 Hortonworks Page 14
  15. 15. Performance Improvements•  Hash based aggregation –  Up to 50% faster aggregation for sets with small number of distinct keys –  Pig runtime automatically selects aggregation implementation•  Push limit to loader –  Now when you have a limit that can be applied to the load, Pig will stop reading records after reaching the limit –  Does not work after group, join, distinct, or order by © 2012 Hortonworks Page 15
  16. 16. Current Work in Pig – Not Yet Released•  Work done on internal data representation and map è reduce transfer to lower memory footprint and enhance performance•  Datetime type has been added•  Development of CUBE, ROLLUP, and RANK operators – patches posted and being reviewed•  Pig running natively on Windows – in the process of posting patches © 2012 Hortonworks Page 16
  17. 17. Pig with Hadoop 2.0•  Pig 0.10 is the first release of Pig that works with Hadoop 2.0 (fka Hadoop 0.23)•  By default Pig 0.10 works with Hadoop 1.0•  Must be recompiled to work with Hadoop 2.0 –  All the pieces included with released code, just need to run ant with the right flags set•  Does not yet take advantage of new features in Hadoop 2.0 © 2012 Hortonworks Page 17
  18. 18. Next Steps© 2012 Hortonworks Page 18
  19. 19. Pig Execution Today Map Reduce Job:users = load users; Input: ./usersgrouped = group users by zipcode;byzip = foreach grouped Map: project(zipcode, userid) generate zipcode, COUNT(users); Shuffle key: useridstore byzip into count_by_zip; Reduce: count Output: ./count_by_zip •  All planning done up front •  No use made of any statistics or information that we have •  Pig (mostly) uses vanilla MapReduce © 2012 Hortonworks Page 19
  20. 20. Re-optimize on the Fly MR Job MR JobMR Job = planned MR JobMR Job = executed MR Job © 2012 Hortonworks Page 20
  21. 21. Re-optimize on the Fly MR Job MR JobMR Job = planned MR JobMR Job = executed MR Job © 2012 Hortonworks Page 21
  22. 22. Re-optimize on the Fly MR Job MR JobMR Job = planned MR JobMR Job = executed MR Job © 2012 Hortonworks Page 22
  23. 23. Re-optimize on the Fly MR Job MR JobMR Job = planned MR JobMR Job = executed MR Job © 2012 Hortonworks Page 23
  24. 24. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output sizeMR Job = planned from both jobs, notice MR Job that one of them isMR Job = executed small enough to fit in memory MR Job © 2012 Hortonworks Page 24
  25. 25. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output sizeMR Job = planned from both jobs, notice MR Job that one of them isMR Job = executed small enough to fit in memory Can change join to FR join, thus map only, and combine with last MR Job MR job © 2012 Hortonworks Page 25
  26. 26. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output sizeMR Job = planned from both jobs, notice MR Job that one of them isMR Job = executed small enough to fit in memory Can change join to FR join, thus map only, and combine with last MR job © 2012 Hortonworks Page 26
  27. 27. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output sizeMR Job = planned from both jobs, notice MR Job that one of them isMR Job = executed small enough to fit in memory Can change join to FR join, thus map only, and combine with last MR job © 2012 Hortonworks Page 27
  28. 28. Modify MapReduceusers = load users;grouped = group users by zipcode; Mapbyzip = foreach grouped generate zipcode, COUNT(users) as cnt;sorted = order byzip by cnt Reducestore sorted into count_by_zip; Map Reduce © 2012 Hortonworks Page 28
  29. 29. Modify MapReduceusers = load users;grouped = group users by zipcode; Mapbyzip = foreach grouped generate zipcode, COUNT(users) as cnt;sorted = order byzip by cnt Reducestore sorted into count_by_zip; This map is useless. Whatever Map can be done in it can always be done in the preceding reduce. Having it costs an extra write Reduce to and read from HDFS. © 2012 Hortonworks Page 29
  30. 30. Modify MapReduceusers = load users;grouped = group users by zipcode; Mapbyzip = foreach grouped generate zipcode, COUNT(users) as cnt;sorted = order byzip by cnt Reducestore sorted into count_by_zip; Reduce © 2012 Hortonworks Page 30
  31. 31. Today Hive Pig Plan Others Optimize Plan Execute Plan Optimize Optimize Execute Execute•  Different in the front end; very similar in the backend•  With HCatalog different apps can share metadata•  No ability to share UDFs, operators, or innovations between projects © 2012 Hortonworks Page 31
  32. 32. Data Virtual Machine Pig Others Hive Plan Plan Plan Optimize Data Virtual Machine Execute © 2012 Hortonworks Page 32
  33. 33. Questions & Answers TRY download at hortonworks.com LEARN Hortonworks University FOLLOW twitter: @hortonworks Facebook: facebook.com/hortonworks MORE EVENTS hortonworks.com/events Page 33 © Hortonworks Inc. 2012

×