Introduction to Pig

536 views
459 views

Published on

Short presentation introducing Pig.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
536
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
39
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction to Pig

  1. 1. PIG Mike Unwin Twitter: @mjunwin
  2. 2. Why are we talking about Pig?  Originally developed at Yahoo! now an apache project  Engine for executing data flows in parallel on Hadoop  Includes a language called Pig Latin for expressing data flows  Easy to learn and extensible  Open source
  3. 3. What is a data flow language  Allows us to describe how data should be loaded, read, processed and stored.  Can be simple linear flows e.g. word count  Complex workflows that include joins
  4. 4. Is it like SQL?  Pig Latin does look a bit like SQL e.g. Join, Group By  But SQL is declarative  In Pig you describe how the data flows  SQL you end up producing an inside out query whereas with Pig you describe a pipeline.
  5. 5. SQL example SELECT CustomerName,TotalOrders, PostCode FROM Customers c INNER JOIN ( SELECT CustomerId, count(OrderId) as FROM Orders GROUP BY CustomerId ) as t on t.CustomerId = c.CustomerId
  6. 6. Same Query in Pig orders = load ‘Orders’ as (CustomerId, OrderId); grouped = group orders by CustomerId; total = foreach grouped generate group, COUNT(OrderId) customer = load ‘Customers’ as (CustomerId, CustomerName) result = join total by group, customer by customerId dump result;
  7. 7. Installing Pig  http://pig.apache.org/docs/r0.11.1/  Requires Java  Hadoop (it does have a built in version of hadoop which is currently v0.20.2.)  Requires Cygwin on windows
  8. 8. What do you get? Pig Grunt Shell Piggy Bank
  9. 9. Basic Pig Operators  FOREACH  FILTER  GROUP BY  ORDER BY  UNION  CROSS
  10. 10. Same Query in Pig orders = load ‘Orders’ as (CustomerId, OrderId); grouped = group orders by CustomerId; total = foreach grouped generate group, COUNT(OrderId) customer = load ‘Customers’ as (CustomerId, CustomerName) result = join total by group, customer by customerId dump result;
  11. 11. Debugging  Describe  Explain
  12. 12. How does Pig become a MR job?
  13. 13. Advantages of Pig  Easy to learn  Can achieve a lot with a small amount of code  E.g. Join example  Well written scripts can be easy to read and easy to maintain  Has a local mode for testing scripts  Has a unit testing framework
  14. 14. Limitations of Pig  Unit testing  High level – often need to drop down into custom UDFs  If you are proficient at C# or F# sometimes this can be easier to test e.g. Streaming unit allows unit testing.  Still doesn’t play nicely in a windows environment
  15. 15. http://elastastorage.blob.core.windows.n et/hdinsight/PigOnHDInsight.pdf

×