Apache Pig on Amazon AWS - Swine Not?

Uploaded on

A basic introduction to Apache Pig, focused on understanding what it is as well as quickly getting started using it through Amazon's Elastic Map Reduce service. The second part details my experience …

A basic introduction to Apache Pig, focused on understanding what it is as well as quickly getting started using it through Amazon's Elastic Map Reduce service. The second part details my experience following along with Amazon's Pig tutorial at: http://aws.amazon.com/articles/2729

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Apache Pigon Amazon AWSSwine Not?
  • 2. What is Apache Pig?Pig is an execution framework that interpretsscripts written in a language called Pig Latinand then runs them on a Hadoop cluster.(DisturbingLogo)-->
  • 3. Pig is a tool that...● creates complex jobs that efficiently processlarge volumes of data● supports many relational features, making iteasy to join, group, and aggregate data● performs ETL tasks quickly, on manyservers simultaneously
  • 4. What is Pig Latin?It is a high level data transformation languagethat:● allows you to concentrate on the datatransformations you requireRather than:● force you to be concerned with individualmap and reduce functions
  • 5. Walkthrough - Create a Job Flow* Basically following the Amazon Pig tutorial at: http://aws.amazon.com/articles/2729
  • 6. And now we wait...
  • 7. SSH into master instance$ ssh -i ~/keys/crocs.pem -l hadoop ec2-54-215-107-197.us-west-1.compute.amazonaws.com
  • 8. Type "pig" to enter the grunt shell$ piggrunt> _Its a freakin shell!grunt> pwdhdfs://
  • 9. You can enter the HDFS file system:grunt> cd hdfs:///grunt> lshdfs:// <dir>Even enter an S3 bucket:grunt> cd s3://elasticmapreduce/samples/pig-apache/input/grunt> lss3://elasticmapreduce/samples/pig-apache/input/access_log_1<r 1> 8754118s3://elasticmapreduce/samples/pig-apache/input/access_log_2<r 1> 8902171
  • 10. Load Piggybank - Open source library, usercontributed functionsgrunt> register file:/home/hadoop/lib/pig/piggybank.jarDEFINE the EXTRACT alias from piggybankgrunt> DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;
  • 11. LOADUse TextLoader (internal Pig function) to Loadeach line of the source file:grunt> RAW_LOGS = LOAD s3://elasticmapreduce/samples/pig-apache/input/access_log_1 USING TextLoader as(line:chararray);
  • 12. ILLUSTRATEShows a step-by-step process on how Pig wouldtransform a small sample of datagrunt> illustrate RAW_LOGS;Connecting to hadoop file system at: hdfs:// to map-reduce job tracker at:| RAW_LOGS | line:chararray |---------------------------------------------------------------| | - - [21/Jul/2009:02:29:56 -0700]"GET /gallery/main.php?g2_itemId=32050 HTTP/1.1" 200 7119 "-""msnbot/2.0b (+http://search.msn.com/msnbot.htm)"---------------------------------------------------------------
  • 13. Now lets:● split each line into fields● store everything in a baggrunt> LOGS_BASE = FOREACH RAW_LOGS GENERATEFLATTEN(EXTRACT(line, ^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"))as (remoteAddr: chararray,remoteLogname: chararray,user: chararray,time: chararray,request: chararray,status: int,bytes_string: chararray,referrer: chararray,browser: chararray);
  • 14. ILLUSTRATE an example of our workgrunt> illustrate LOGS_BASE;...| LOGS_BASE || remoteAddr:chararray || remoteLogname:chararray | -| user:chararray | -| time:chararray | 20/Jul/2009:20:30:55 -0700| request:chararray | GET /gwidgets/alexa.xml HTTP/1.1| status:int | 200| bytes_string:chararray | 2969| referrer:chararray | -| browser:chararray | Mozilla/5.0 (compatible)Feedfetcher-Google; (+http://www.google.com/feedfetcher.html)
  • 15. Create a bag containing tuples with just thereferrer element (limit 10 items):grunt> REFERRER_ONLY = FOREACH LOGS_BASE GENERATE referrer;grunt> TEMP = LIMIT REFERRER_ONLY 10;Output the contents of the bag:grunt> DUMP TEMP;Pig features used in the script: LIMITFile concatenation threshold: 100 optimistic? falseMR plan size before optimization: 1MR plan size after optimization: 1Pig script settings are added to the jobcreating jar file Job5394669249002614476.jarSetting up single store job1 map-reduce job(s) waiting for submission....
  • 16. More log output before we get our results (cleanedup here)...Input(s):Successfully read 39344 records (126 bytes) from: "s3://elasticmapreduce/samples/pig-apache/input/access_log_1"Output(s):Successfully stored 10 records (126 bytes) in: "hdfs://"Counters:Total records written : 10...
  • 17. Voila! Our exciting results:(-)(-)(-)(-)(-)(-)(http://example.org/)(http://example.org/)(-)(-)First 10 referrers (the dashes represent noreferrer)
  • 18. Now lets filter only by referrerals from bing.com*grunt> FILTERED = FILTER REFERRER_ONLY BY referrer matches .*bing.*;grunt> TEMP = LIMIT FILTERED 9;grunt> DUMP TEMP;(http://www.bing.com/search?q=login)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=value)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=views)(http://www.bing.com/search?q=search)(http://www.bing.com/search?q=philmont)* We all use Bing, am I right?
  • 19. Dont forget to terminate your JobFlowAmazon will charge you even if its idle!