AWS Hadoop and PIG and overview


Published on

A quick overview of Hadoop, AWS and PIG using the AWS provided PIG script for parsing log files.

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

AWS Hadoop and PIG and overview

  1. 1. A SQL like scripting language for Hadoop CIS 210 – February 2013 Highline Community College
  2. 2. Apache Pig is a platform for analyzing large data sets thatconsists of a high-level language for expressing dataanalysis programs, coupled with infrastructure forevaluating these programs. The salient property of Pigprograms is that their structure is amenable to substantialparallelization, which in turns enables them to handle verylarge data sets.
  3. 3. At the present time, Pigs infrastructure layer consists of acompiler that produces sequences of Map-Reduce programs,for which large-scale parallel implementations already exist(e.g., the Hadoop subproject). Pigs language layer currentlyconsists of a textual language called Pig Latin, which has thefollowing key properties: Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing.
  4. 4. Amazon Web Services has Hadoop and will support PIG aspart of the Hadoop infrastructure of “Elastic Map Reduce”.Sample Pig Script:s3://elasticmapreduce/samples/pig-apache/do-reports2.pigSample Dataset:s3://elasticmapreduce/samples/pig-apache/input
  5. 5. Local Mode - To run Pig in local mode, you need access to asingle machine; all files are installed and run using yourlocal host and file system. Specify local mode using the -xflag (pig -x local).Mapreduce Mode - To run Pig in mapreduce mode, youneed access to a Hadoop cluster and HDFS installation.Mapreduce mode is the default mode; you can, but dontneed to, specify it using the -x flag (pig OR pig -xmapreduce).
  6. 6. Interactive ModeYou can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shownbelow) and then enter your Pig Latin statements and Pig commands interactively at the command line.Batch ModeYou can run Pig in batch mode using Pig scripts and the "pig" command (in local or hadoop mode).ExampleThe Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy the/etc/passwd file to your local working directory. Next, run the Pig script from the command line (using local ormapreduce mode). The STORE operator will write the results to a file (id.out).
  7. 7. There are two types of job flows supported with Pig:interactive and batch.In an interactive mode a customer can start a job flow andrun Pig scripts interactively directly on the master node.Typically, this mode is used to do ad hoc data analyses andfor application development.In batch mode, the Pig script is stored in Amazon S3 and isreferenced at the start of the job flow. Typically, batch modeis used for repeatable runs such as report generation.
  8. 8. ---- setup piggyback functions--register file:/home/hadoop/lib/pig/piggybank.jarDEFINE EXTRACTorg.apache.pig.piggybank.evaluation.string.EXTRACT();DEFINE FORMATorg.apache.pig.piggybank.evaluation.string.FORMAT();DEFINE REPLACEorg.apache.pig.piggybank.evaluation.string.REPLACE();DEFINE DATE_TIMEorg.apache.pig.piggybank.evaluation.datetime.DATE_TIME();DEFINE FORMAT_DTorg.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
  9. 9. ---- import logs and break into tuples--raw_logs = -- load the weblogs into a sequence of one element tuples LOAD $INPUT USING TextLoader AS (line:chararray);logs_base = -- for each weblog string convert the weblong string into a -- structure with named fields FOREACHraw_logs GENERATE FLATTEN ( EXTRACT( line, ^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)""([^"]*)" ) ) AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray ) ;
  10. 10. What is a Tuple?In mathematics and computer science, a tuple is an ordered list ofelements. In set theory, an (ordered) -tuple is a sequence (or orderedlist) of elements, where is a non-negative integer. There is only one 0-tuple, an empty sequence.An -tuple is defined inductively using the construction of an orderedpair. Tuples are usually written by listing the elements withinparentheses "" and separated by commas; for example, denotes a 5-tuple. Sometimes other delimiters are used, such as square brackets ""or angle brackets "". Braces "" are almost never used for tuples, as theyare the standard notation for sets.Tuples are often used to describe other mathematical objects, such asvectors. In computer science, tuples are directly implemented asproduct types in most functional programming languages. Morecommonly, they are implemented as record types, where thecomponents are labeled instead of being identified by position alone.This approach is also used in relational algebra.
  11. 11. This is a regular expression: ^(S+) (S+) (S+) [([w:/]+s[+-]d{4})]"(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"’Regular expressions can be used to parse data out of a file,or used to validate data in SQL or other programminglanguages. We will focus on SQL because PIG is very similarto SQL
  12. 12. This is a little hard to read because of the wrapping. What youshould see is that Pig is loading the line into a tuple with just asingle element --- the line itself. You now need to split the lineinto fields. To do this, use the EXTRACT Piggybank function,which applies a regular expression to the input and extracts thematched groups as elements of a tuple. The regular expressionis a little tricky because the Apache log defines a couple offields with quotes.Unfortunately, you cant use this as is because in Pig strings allbackslashes must be escaped with a backslash. Making theregular expression a little bulky in relationship to use in otherprogramming languages.^(S+) (S+) (S+) [([w:/]+s[+-]d{4})]"(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"
  13. 13. logs_base = -- for each weblog string convert the weblong string into a -- structure with named fields FOREACHraw_logs GENERATE FLATTEN ( EXTRACT( line, ^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+)"([^"]*)" "([^"]*)" ) ) AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray ) ;
  14. 14. logs = -- convert from string values to typed values such as date_timeand integers FOREACHlogs_base GENERATE *,DATE_TIME(time, dd/MMM/yyyy:HH:mm:ss Z, UTC) asdatetime, (int)REPLACE(bytes_string, -, 0) as bytes ;
  15. 15. ---- determine total number of requests and bytes served by UTC hour of day-- aggregating as a typical day across the total time of the logs--by_hour_count = -- group logs by their hour of day, counting the number of logs in thathour -- and the sum of the bytes of rows for that hour FOREACH (GROUP logs BY FORMAT_DT(HH,datetime)) GENERATE $0, COUNT($1) AS num_requests, SUM($1.bytes) AS num_bytes ;STORE by_hour_count INTO $OUTPUT/total_requests_bytes_per_hour;
  16. 16. ---- top 50 X.X.X.* blocks--by_ip_count = -- group weblog entries by the ip address from the remote address field -- and count the number of entries for each address as well as -- the sum of the bytes FOREACH (GROUP logs BY FORMAT(%s.*, EXTRACT(remoteAddr,(d+.d+.d+)))) GENERATE $0, COUNT($1) AS num_requests, SUM($1.bytes) AS num_bytes ;by_ip_count_sorted = -- order ip by the number of requests they make LIMIT (ORDER by_ip_count BY num_requests DESC) 50;STORE by_ip_count_sorted into $OUTPUT/top_50_ips;
  17. 17. -- top 50 external referrers--by_referrer_count = -- group by the referrer URL and count the number of requests FOREACH (GROUP logs BY EXTRACT(referrer, (http://[a-z0-9.-]+))) GENERATE FLATTEN($0), COUNT($1) AS num_requests ;by_referrer_count_filtered = -- exclude matches for FILTER by_referrer_count BY NOT $0 matches .*;by_referrer_count_sorted = -- take the top 50 results LIMIT (ORDER by_referrer_count_filtered BY num_requests DESC) 50;STORE by_referrer_count_sorted INTO $OUTPUT/top_50_external_referrers;
  18. 18. -- top search terms coming from bing or google--google_and_bing_urls = -- find referrer fields that match either bing or google FILTER (FOREACH logs GENERATE referrer) BY referrer matches .*bing.* OR referrer matches .*google.* ;search_terms = -- extract from each referrer url the search phrases FOREACHgoogle_and_bing_urls GENERATEFLATTEN(EXTRACT(referrer, .*[&?]q=([^&]+).*)) as (term:chararray) ;search_terms_filtered = -- reject urls that contained no search terms FILTER search_terms BY NOT $0 IS NULL;search_terms_count = -- for each search phrase count the number of weblogs entries that contained it FOREACH (GROUP search_terms_filtered BY $0) GENERATE $0, COUNT($1) AS num ;search_terms_count_sorted = -- take the top 50 results LIMIT (ORDER search_terms_count BY num DESC) 50;STORE search_terms_count_sorted INTO $OUTPUT/top_50_search_terms_from_bing_google;
  19. 19. (GROUP logs BY EXTRACT(referrer, (http://[a-z0-9.-]+)))(GROUP logs BY FORMAT(%s.*, EXTRACT(remoteAddr,(d+.d+.d+))))FLATTEN(EXTRACT(referrer, .*[&?]q=([^&]+).*)) as(term:chararray)Learning regular expressions will help you with scripting
  20. 20.