Pig Out to Hadoop

  • 1,924 views
Uploaded on

Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks …

Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks co-founder and long-time contributor to the Apache Pig and HCatalog projects, to discuss these new features, as well as talk about work the project is planning to do in the near future. In particular, we will cover how Pig can take advantage of changes in Hadoop 0.23.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,924
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
8

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Pig: Recent Work and NextStepsAlan  F.  Gates  @alanfgates   Page  1  
  • 2. Who Am I? •  Pig committer and PMC Member •  Original member of the engineering team in Yahoo that took Pig from research to production •  Author of Programming Pig from O’Reilly •  HCatalog committer and mentor •  Co-founder of Hortonworks •  Tech lead of the team at Hortonworks that does Pig, Hive, and HCatalog •  Member of Apache Software Foundation and Incubator PMC © 2012 Hortonworks Page 2
  • 3. What is Pig? © 2012 Hortonworks Page 3
  • 4. What is Pig?•  A data flow language users = load users; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into count_by_zip; © 2012 Hortonworks Page 4
  • 5. What is Pig?•  A data flow language users = load users; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into count_by_zip;•  that translates a script into a series of MapReduce jobs and then executes those jobs © 2012 Hortonworks Page 5
  • 6. What is Pig?•  A data flow language users = load users; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into count_by_zip;•  That translates a script into a series of MapReduce jobs and then executes those jobsusers = load users;grouped = group users by zipcode;byzip = foreach grouped generate zipcode, COUNT(users);store byzip into count_by_zip; © 2012 Hortonworks Page 6
  • 7. What is Pig?•  A data flow language users = load users; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into count_by_zip;•  That translates a script into a series of MapReduce jobs and then executes those jobs Map Reduce Job:users = load users; Input: ./usersgrouped = group users by zipcode;byzip = foreach grouped Map: project(zipcode, userid) generate zipcode, COUNT(users); Shuffle key: useridstore byzip into count_by_zip; Reduce: count Output: ./count_by_zip © 2012 Hortonworks Page 7
  • 8. What is Pig? •  A data flow language users = load users; grouped = group users by zipcode; byzip = foreach grouped generate zipcode, COUNT(users); store byzip into count_by_zip; •  That translates a script into a series of MapReduce jobs and then executes those jobs Map Reduce Job: users = load users; Input: ./users grouped = group users by zipcode; byzip = foreach grouped Map: project(zipcode, userid) generate zipcode, COUNT(users); Shuffle key: userid store byzip into count_by_zip; Reduce: count Output: ./count_by_zipLives on clientmachine,nothing toinstall oncluster © 2012 Hortonworks Page 8
  • 9. Recent Work© 2012 Hortonworks Page 9
  • 10. New Features in Pig 0.10•  Released April, 2012•  This release was a collaborative effort, with major features added by Twitter, Yahoo, Hortonworks, and Google Summer of Code students•  Not all the new features are covered here, see http://hortonworks.com/blog/new-features-in-apache-pig-0-10/ for a complete list. © 2012 Hortonworks Page 10
  • 11. Ruby UDFs•  Pig 0.8, 0.9 UDFs could be done in Python and Java. Now Ruby also supported•  Evaluated via JRubypower.pig:register power.rb using jruby as rf;data = load ‘input’ as (a:int, b:int);powered = foreach data generate rf.power(a, b);power.rb:require pigudfclass Power < PigUdf outputSchema "a:int" def power(mantissa, exponent) return nil if mantissa.nil? or exponent.nil? mantissa**exponent endend•  Can also do Algebraic and Accumulator UDFs in Ruby (like in Java, but unlike in Python) © 2012 Hortonworks Page 11
  • 12. PigStorage With Schemas•  By default, PigStorage (the default load/store function) does not use a schema•  In 0.10, it can store a schema if instructed to•  Schema stored in side file .pig_schema•  If schema is available it will automatically be used A = load studenttab10k as (name:chararray, age:int, gpa:double); store A into foo using PigStorage(t, -schema); A = load foo; B = foreach A generate name, age; © 2012 Hortonworks Page 12
  • 13. Additional UDF Improvements•  Automatic generation of simpler UDFs –  If you implement an Algebraic UDF, Pig can generate Accumulator & basic UDFs –  If you implement an Accumulator UDF, Pig can generate a basic UDF•  JSON load and store functions –  Requires schema that describes JSON, does not intuit schema from data –  Schema stored in side file, no need to declare in script•  Built in UDFs for Bloom filters –  BuildBloom builds a bloom filter for one or more columns for a given input –  Can be constructed to be a certain size (# of hash functions and # of bits) or based on the desired false positive rate –  Bloom takes the file generated by BuildBloom and applies it to an input define bb BuildBloom(Hash.JENKINS_HASH, 1000, 0.01); A = load users; B = group A all; C = foreach B generate bb(A.name); store C into mybloom; define bloom Bloom(mybloom); A = load transactions; B = filter A by bloom(name); © 2012 Hortonworks Page 13
  • 14. Language Improvements•  Boolean now supported as a first class data type a = load foo as (n:chararray, a:int, g:double, b:boolean);•  Default split destination - otherwise –  records which do not match any of the ifs will go to this destination –  records can still go to multiple ifs split a into b if id < 3, c if id > 5, d otherwise;•  Maps, tuples, and bags can now be generated without UDFs: B = foreach A generate [key, value], (col1, col2), {col1, col2};•  Register a collection of jars at once with globs: –  Uses HDFS globbing syntax register /home/me/jars/*.jar; © 2012 Hortonworks Page 14
  • 15. Performance Improvements•  Hash based aggregation –  Up to 50% faster aggregation for sets with small number of distinct keys –  Pig runtime automatically selects aggregation implementation•  Push limit to loader –  Now when you have a limit that can be applied to the load, Pig will stop reading records after reaching the limit –  Does not work after group, join, distinct, or order by © 2012 Hortonworks Page 15
  • 16. Current Work in Pig – Not Yet Released•  Work done on internal data representation and map è reduce transfer to lower memory footprint and enhance performance•  Datetime type has been added•  Development of CUBE, ROLLUP, and RANK operators – patches posted and being reviewed•  Pig running natively on Windows – in the process of posting patches © 2012 Hortonworks Page 16
  • 17. Pig with Hadoop 2.0•  Pig 0.10 is the first release of Pig that works with Hadoop 2.0 (fka Hadoop 0.23)•  By default Pig 0.10 works with Hadoop 1.0•  Must be recompiled to work with Hadoop 2.0 –  All the pieces included with released code, just need to run ant with the right flags set•  Does not yet take advantage of new features in Hadoop 2.0 © 2012 Hortonworks Page 17
  • 18. Next Steps© 2012 Hortonworks Page 18
  • 19. Pig Execution Today Map Reduce Job:users = load users; Input: ./usersgrouped = group users by zipcode;byzip = foreach grouped Map: project(zipcode, userid) generate zipcode, COUNT(users); Shuffle key: useridstore byzip into count_by_zip; Reduce: count Output: ./count_by_zip •  All planning done up front •  No use made of any statistics or information that we have •  Pig (mostly) uses vanilla MapReduce © 2012 Hortonworks Page 19
  • 20. Re-optimize on the Fly MR Job MR JobMR Job = planned MR JobMR Job = executed MR Job © 2012 Hortonworks Page 20
  • 21. Re-optimize on the Fly MR Job MR JobMR Job = planned MR JobMR Job = executed MR Job © 2012 Hortonworks Page 21
  • 22. Re-optimize on the Fly MR Job MR JobMR Job = planned MR JobMR Job = executed MR Job © 2012 Hortonworks Page 22
  • 23. Re-optimize on the Fly MR Job MR JobMR Job = planned MR JobMR Job = executed MR Job © 2012 Hortonworks Page 23
  • 24. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output sizeMR Job = planned from both jobs, notice MR Job that one of them isMR Job = executed small enough to fit in memory MR Job © 2012 Hortonworks Page 24
  • 25. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output sizeMR Job = planned from both jobs, notice MR Job that one of them isMR Job = executed small enough to fit in memory Can change join to FR join, thus map only, and combine with last MR Job MR job © 2012 Hortonworks Page 25
  • 26. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output sizeMR Job = planned from both jobs, notice MR Job that one of them isMR Job = executed small enough to fit in memory Can change join to FR join, thus map only, and combine with last MR job © 2012 Hortonworks Page 26
  • 27. Re-optimize on the Fly MR Job MR Job output: 50G output: 1G Observe output sizeMR Job = planned from both jobs, notice MR Job that one of them isMR Job = executed small enough to fit in memory Can change join to FR join, thus map only, and combine with last MR job © 2012 Hortonworks Page 27
  • 28. Modify MapReduceusers = load users;grouped = group users by zipcode; Mapbyzip = foreach grouped generate zipcode, COUNT(users) as cnt;sorted = order byzip by cnt Reducestore sorted into count_by_zip; Map Reduce © 2012 Hortonworks Page 28
  • 29. Modify MapReduceusers = load users;grouped = group users by zipcode; Mapbyzip = foreach grouped generate zipcode, COUNT(users) as cnt;sorted = order byzip by cnt Reducestore sorted into count_by_zip; This map is useless. Whatever Map can be done in it can always be done in the preceding reduce. Having it costs an extra write Reduce to and read from HDFS. © 2012 Hortonworks Page 29
  • 30. Modify MapReduceusers = load users;grouped = group users by zipcode; Mapbyzip = foreach grouped generate zipcode, COUNT(users) as cnt;sorted = order byzip by cnt Reducestore sorted into count_by_zip; Reduce © 2012 Hortonworks Page 30
  • 31. Today Hive Pig Plan Others Optimize Plan Execute Plan Optimize Optimize Execute Execute•  Different in the front end; very similar in the backend•  With HCatalog different apps can share metadata•  No ability to share UDFs, operators, or innovations between projects © 2012 Hortonworks Page 31
  • 32. Data Virtual Machine Pig Others Hive Plan Plan Plan Optimize Data Virtual Machine Execute © 2012 Hortonworks Page 32
  • 33. Questions & Answers TRY download at hortonworks.com LEARN Hortonworks University FOLLOW twitter: @hortonworks Facebook: facebook.com/hortonworks MORE EVENTS hortonworks.com/events Page 33 © Hortonworks Inc. 2012