Your SlideShare is downloading. ×
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate


Published on

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • A very common use case we see at Yahoo is users want to read one data set and group it several different ways. Since scan time often dominates for these large data sets, sharing one scan across several group instances can result in nearly linear speed up of queries.
  • In this case multiple pipelines are needed in Map and Reduce phasesDue to our pull based model in execution, we have split and multiplex embed the pipelines within themselvesRecords are tagged with the pipeline number in the map stageGrouping is done by Hadoop using a union of the keysMultiplex operator on the reducer places incoming records in the correct pipeline
  • As your website grows, the number of unique users grows beyond what you can keep in memory.A given map only gets input from a given input source. It can therefore annotate tuples from that source with information on which source it came from. The join key is then used to partition the data, but the join key plus the input source id is used to sort it. This allows pig to buffer one side of the join keys in memory and then use that as a probe table as keys from the other input stream by.
  • Running example: You start a webiste. You want to know how users are using your website. So you collect a couple of streams of information from your logs: page views and users.When you start you have a fair number of page views, but not many users.In this algorithm the smaller table is copied to every map in its entirety (doesn’t yet use Distributed Cache, it should). Larger file is partitioned as per normal MR.
  • As your website grows even more, some pages become significantly more popular than others. This means that some pages are visited by almost every user, while others are visited only by a few users.First, a sampling pass is done to determine which keys are large enough to need special attention. These are keys that have enough values that we estimate we cannot hold the entire value in memory. It’s about holding the values in memory, not the key.Then at partitioning time, those keys are handled specially. All other keys are treated as in the regular join. These selected keys from input1 are split across multiple reducers. For input2, they are replicated to each of these reducers that had the split. In this way we guarantee that every instance of key k from input1 comes into contact with every instance of k from input2.
  • Now lets say that for some reason you start keeping both your page view data and user data sorted by user.Note that one way to do this is make sure that pages and users are partitioned the same way. But this leads to a big problem. In order to make sure you can join all your data sets you end up using the same hash function to join them all. But rarely does one bucketing scheme make sense for all your data. Whatever is big enough for one data set will be too small for others, and vice versa. So Pig’s implementation doesn’t depend on how the data is split.Pig does this by sampling one of the inputs and then building an index from that sample that indicates the key for the first record in every split. The other input is used as the standard input file for Hadoop and is split to the maps as per normal. When the map begins processing this file, when it encounters the first key in that file it uses the index to determine where it should open the second, sampled file. It then opens the file at the appropriate point, seeks forward until it finds the key it is looking for, and then begins doing a join on the two data sources.
  • Can’t yet inline the Python functions in Pig Latin script. In 0.9 we’ll add the ability to put them in the same file.
  • Transcript

    • 1. Alan F. Gates
      Pig, Making Hadoop Easy
    • 2. Who Am I?
      Pig committer and PMC Member
      An architect in Yahoo! grid team
      Photo credit: Steven Guarnaccia, The Three Little Pigs
    • 3. Motivation By Example
      You have web server logs of purchases on your site. You want to find the 10 users who bought the most and the cities they live in. You also want to know what percentage of purchases they account for in those cities.
      Load Logs
      Find top 10 users
      Sum purchases by city
      Join by city
      Store top 10 users
      Calculate percentage
      Store results
    • 4. In Pig Latin
      raw = load 'logs' as (name, city, purchase);
      -- Find top 10 users
      usrgrp = group raw by (name, city);
      byusr = foreach usrgrp generate group as k1,
      SUM(raw.purchase) as utotal;
      srtusr = order byusr by usrtotal desc;
      topusrs = limit srtusr 10;
      store topusrs into 'top_users';
      -- Count purchases per city
      citygrp = group raw by city;
      bycity = foreach citygrp generate group as k2,
      SUM(raw.purchase) as ctotal;
      -- Join top users back to city
      jnd = join topusrs by, bycity by k2;
      pct = foreach jnd generate,, utotal/ctotal;
      store pct into 'top_users_pct_of_city';
    • 5. Translates to Four MapReduce Jobs
    • 6. Performance
    • 7. Where Do Pigs Live?
      Data Factory
      Iterative Processing
      Data Warehouse
      BI Tools
      Data Collection
    • 8. Pig Highlights
      Language designed to enable efficient description of data flow
      Standard relational operators built in
      User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)
      UDFs can be written to take advantage of the combiner
      Four join implementations built in: hash, fragment-replicate, merge, skewed
      Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned
      Order by provides total ordering across reducers in a balanced way
      Writing load and store functions is easy once an InputFormat and OutputFormat exist
      Piggybank, a collection of user contributed UDFs
    • 9. Multi-store script
      A = load ‘users’ as (name, age, gender, city, state);
      B = filter A by name is not null;
      C1 = group B by age, gender;
      D1 = foreach C1 generate group, COUNT(B);
      store D into ‘bydemo’;
      C2= group B by state;
      D2 = foreach C2 generate group, COUNT(B);
      store D2 into ‘bystate’;
      group by age, gender
      store into ‘bydemo’
      apply UDFs
      load users
      filter nulls
      group by state
      store into ‘bystate’
      apply UDFs
    • 10. Multi-Store Map-Reduce Plan
      local rearrange
      local rearrange
    • 11. Hash Join
      Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user;
      Map 1
      Reducer 1
      (1, user)
      block n
      (1, fred)
      (2, fred)
      (2, fred)
      Map 2
      Reducer 2
      block m
      (1, jane)
      (2, jane)
      (2, jane)
      (2, name)
    • 12. Fragment Replicate Join
      Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “replicated”;
      Map 1
      block 1
      Map 2
      block 2
    • 13. Skew Join
      Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “skewed”;
      Map 1
      Reducer 1
      (1, user)
      block n
      (1, fred, p1)
      (1, fred, p2)
      (2, fred)
      Map 2
      Reducer 2
      block m
      (1, fred, p3)
      (1, fred, p4)
      (2, fred)
      (2, name)
    • 14. Merge Join
      Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “merge”;
      Map 1

      Map 2

    • 15. Who uses Pig for What?
      70% of production grid jobs at Yahoo (10ks per day)
      Also used by Twitter, LinkedIn, Ebay, AOL, …
      Used to
      Process web logs
      Build user behavior models
      Process images
      Build maps of the web
      Do research on raw data sets
    • 16. Components
      Accessing Pig:
      • Submit a script directly
      • 17. Grunt, the pig shell
      • 18. PigServer Java class, a JDBC like interface
      Job executes on cluster
      Hadoop Cluster
      Pig resides on user machine
      User machine
      No need to install anything extra on your Hadoop cluster.
    • 19. How It Works
      Pig Latin
      A = LOAD ‘myfile’
      AS (x, y, z);
      B = FILTER A by x > 0;
      C = GROUP B BY x;
      x, COUNT(B);
      STORE D INTO ‘output’;
      Execution Plan
    • 25. New in 0.8
      UDFs can be Jython
      Improved and expanded statistics
      Performance Improvements
      Automatic merging of small files
      Compression of intermediate results
      PigUnit for unit testing your Pig Latin scripts
      Access to static Java functions as UDFs
      Improved HBase integration
      Custom PartitionersB = group A by $0 partition by YourPartitioner parallel2;
      Greatly expanded string and math built in UDFs
    • 26. What’s Next?
      Preview of Pig 0.9
      Integrate Pig with scripting languages for control flow
      Add macros to Pig Latin
      Revive ILLUSTRATE
      Fix runtime type errors
      Rewrite parser to give more useful error messages
      Programming Pig from O’Reilly Press
    • 27. Learn More
      Online documentation:
      Hadoop, The Definitive Guide 2nd edition has an up to date chapter on Pig, search at your favorite bookstore
      Join the mailing lists: for user questions for developer issues
      Follow me on Twitter, @alanfgates
    • 28. UDFs in Scripting Languages
      Evaluation functions can now be written in scripting languages that compile down to the JVM
      Reference implementation provided in Jython
      Jruby, others, could be added with minimal code
      JavaScript implementation in progress
      Jython sold separately
    • 29. Example Python UDF
      def square(num): return ((num)*(num))
      register '' using jython as myfuncs;
      A = load ‘input’ as (i:int);
      B = foreach A generate myfuncs.square(i);
      dump B;
    • 30. Better statistics
      Statistics printed out at end of job run
      Pig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usage
      Loader for reading job history files included in Piggybank
      New PigRunner interface that allows users to invoke Pig and get back a statistics object that contains stats information
      Can also pass listener to track Pig jobs as they run
      Done for Oozie so it can show users Pig statistics
    • 31. Sample stats info
      Job Stats (time in seconds):
      JobId Maps Reduces MxMTMnMT AMT MxRTMnRT ART Alias
      job_0 2 1 15 3 9 27 27 27 a,b,c,d,e
      job_1 1 1 3 3 3 12 12 12 g,h
      job_2 1 1 3 3 3 12 12 12 i
      job_3 1 1 3 3 3 12 12 12 i
      Successfully read 10000 records from: “studenttab10k"
      Successfully read 10000 records from: “votertab10k"
      Successfully stored 6 records (150 bytes) in: ”outfile"
      Total records written : 6
      Total bytes written : 150
    • 32. Invoke Static Java Functions as UDFs
      Often UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLs
      define UrlDecode InvokeForString('', 'String String');A = load 'encoded.txt' as (e:chararray);B = foreach A generate UrlDecode(e, 'UTF-8');
      Currently only works with simple types and static functions
    • 33. Improved HBase Integration
      Can now read records as bytes instead of auto converting to strings
      Filters can be pushed down
      Can store data in HBase as well as load from it
      Works with HBase 0.20 but not 0.89 or 0.90. Patch in PIG-1680 addresses this but has not been committed yet.