Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Pig

577 views

Published on

Short presentation introducing Pig.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Introduction to Pig

  1. 1. PIG Mike Unwin Twitter: @mjunwin
  2. 2. Why are we talking about Pig?  Originally developed at Yahoo! now an apache project  Engine for executing data flows in parallel on Hadoop  Includes a language called Pig Latin for expressing data flows  Easy to learn and extensible  Open source
  3. 3. What is a data flow language  Allows us to describe how data should be loaded, read, processed and stored.  Can be simple linear flows e.g. word count  Complex workflows that include joins
  4. 4. Is it like SQL?  Pig Latin does look a bit like SQL e.g. Join, Group By  But SQL is declarative  In Pig you describe how the data flows  SQL you end up producing an inside out query whereas with Pig you describe a pipeline.
  5. 5. SQL example SELECT CustomerName,TotalOrders, PostCode FROM Customers c INNER JOIN ( SELECT CustomerId, count(OrderId) as FROM Orders GROUP BY CustomerId ) as t on t.CustomerId = c.CustomerId
  6. 6. Same Query in Pig orders = load ‘Orders’ as (CustomerId, OrderId); grouped = group orders by CustomerId; total = foreach grouped generate group, COUNT(OrderId) customer = load ‘Customers’ as (CustomerId, CustomerName) result = join total by group, customer by customerId dump result;
  7. 7. Installing Pig  http://pig.apache.org/docs/r0.11.1/  Requires Java  Hadoop (it does have a built in version of hadoop which is currently v0.20.2.)  Requires Cygwin on windows
  8. 8. What do you get? Pig Grunt Shell Piggy Bank
  9. 9. Basic Pig Operators  FOREACH  FILTER  GROUP BY  ORDER BY  UNION  CROSS
  10. 10. Same Query in Pig orders = load ‘Orders’ as (CustomerId, OrderId); grouped = group orders by CustomerId; total = foreach grouped generate group, COUNT(OrderId) customer = load ‘Customers’ as (CustomerId, CustomerName) result = join total by group, customer by customerId dump result;
  11. 11. Debugging  Describe  Explain
  12. 12. How does Pig become a MR job?
  13. 13. Advantages of Pig  Easy to learn  Can achieve a lot with a small amount of code  E.g. Join example  Well written scripts can be easy to read and easy to maintain  Has a local mode for testing scripts  Has a unit testing framework
  14. 14. Limitations of Pig  Unit testing  High level – often need to drop down into custom UDFs  If you are proficient at C# or F# sometimes this can be easier to test e.g. Streaming unit allows unit testing.  Still doesn’t play nicely in a windows environment
  15. 15. http://elastastorage.blob.core.windows.n et/hdinsight/PigOnHDInsight.pdf

×