A SQL like scripting language for
            Hadoop

   CIS 210 – February 2013
 Highline Community College
Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data
analysis programs, coupled with infrastructure for
evaluating these programs. The salient property of Pig
programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very
large data sets.
At the present time, Pig's infrastructure layer consists of a
compiler that produces sequences of Map-Reduce programs,
for which large-scale parallel implementations already exist
(e.g., the Hadoop subproject). Pig's language layer currently
consists of a textual language called Pig Latin, which has the
following key properties:
    Ease of programming. It is trivial to achieve parallel
      execution of simple, "embarrassingly parallel" data analysis
      tasks. Complex tasks comprised of multiple interrelated
      data transformations are explicitly encoded as data flow
      sequences, making them easy to write, understand, and
      maintain.
    Optimization opportunities. The way in which tasks are
      encoded permits the system to optimize their execution
      automatically, allowing the user to focus on semantics
      rather than efficiency.
    Extensibility. Users can create their own functions to do
      special-purpose processing.
Amazon Web Services has Hadoop and will support PIG as
part of the Hadoop infrastructure of “Elastic Map Reduce”.

Sample Pig Script:
s3://elasticmapreduce/samples/pig-apache/do-
reports2.pig

Sample Dataset:
s3://elasticmapreduce/samples/pig-apache/input
Local Mode - To run Pig in local mode, you need access to a
single machine; all files are installed and run using your
local host and file system. Specify local mode using the -x
flag (pig -x local).

Mapreduce Mode - To run Pig in mapreduce mode, you
need access to a Hadoop cluster and HDFS installation.
Mapreduce mode is the default mode; you can, but don't
need to, specify it using the -x flag (pig OR pig -xmapreduce).
Interactive Mode
You can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown
below) and then enter your Pig Latin statements and Pig commands interactively at the command line.

Batch Mode
You can run Pig in batch mode using Pig scripts and the "pig" command (in local or hadoop mode).
Example
The Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy the
/etc/passwd file to your local working directory. Next, run the Pig script from the command line (using local or
mapreduce mode). The STORE operator will write the results to a file (id.out).
There are two types of job flows supported with Pig:
interactive and batch.

In an interactive mode a customer can start a job flow and
run Pig scripts interactively directly on the master node.
Typically, this mode is used to do ad hoc data analyses and
for application development.

In batch mode, the Pig script is stored in Amazon S3 and is
referenced at the start of the job flow. Typically, batch mode
is used for repeatable runs such as report generation.
--
-- setup piggyback functions
--
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT
org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE FORMAT
org.apache.pig.piggybank.evaluation.string.FORMAT();
DEFINE REPLACE
org.apache.pig.piggybank.evaluation.string.REPLACE();
DEFINE DATE_TIME
org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE FORMAT_DT
org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
--
-- import logs and break into tuples
--
raw_logs =
 -- load the weblogs into a sequence of one element tuples
 LOAD '$INPUT' USING TextLoader AS (line:chararray);

logs_base =
 -- for each weblog string convert the weblong string into a
 -- structure with named fields
 FOREACH
raw_logs
 GENERATE
   FLATTEN (
     EXTRACT(
       line,
       '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)"
"([^"]*)"'
     )
   )
   AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
     request: chararray, status: int, bytes_string: chararray, referrer: chararray,
     browser: chararray
   )
 ;
What is a Tuple?

In mathematics and computer science, a tuple is an ordered list of
elements. In set theory, an (ordered) -tuple is a sequence (or ordered
list) of elements, where is a non-negative integer. There is only one 0-
tuple, an empty sequence.
An -tuple is defined inductively using the construction of an ordered
pair. Tuples are usually written by listing the elements within
parentheses "" and separated by commas; for example, denotes a 5-
tuple. Sometimes other delimiters are used, such as square brackets ""
or angle brackets "". Braces "" are almost never used for tuples, as they
are the standard notation for sets.
Tuples are often used to describe other mathematical objects, such as
vectors. In computer science, tuples are directly implemented as
product types in most functional programming languages. More
commonly, they are implemented as record types, where the
components are labeled instead of being identified by position alone.
This approach is also used in relational algebra.
This is a regular expression:

 '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})]
"(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"’

Regular expressions can be used to parse data out of a file,
or used to validate data in SQL or other programming
languages. We will focus on SQL because PIG is very similar
to SQL
This is a little hard to read because of the wrapping. What you
should see is that Pig is loading the line into a tuple with just a
single element --- the line itself. You now need to split the line
into fields. To do this, use the EXTRACT Piggybank function,
which applies a regular expression to the input and extracts the
matched groups as elements of a tuple. The regular expression
is a little tricky because the Apache log defines a couple of
fields with quotes.

Unfortunately, you can't use this as is because in Pig strings all
backslashes must be escaped with a backslash. Making the
regular expression a little bulky in relationship to use in other
programming languages.

'^(S+) (S+) (S+) [([w:/]+s[+-]d{4})]
"(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"'
logs_base =
 -- for each weblog string convert the weblong string into a
 -- structure with named fields
 FOREACH
raw_logs
 GENERATE
   FLATTEN (
     EXTRACT(
       line,
       '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+)
"([^"]*)" "([^"]*)"'
     )
   )
   AS (
remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray,
     request: chararray, status: int, bytes_string: chararray, referrer: chararray,
     browser: chararray
   )
 ;
logs =
 -- convert from string values to typed values such as date_time
and integers
 FOREACH
logs_base
 GENERATE
   *,
DATE_TIME(time, 'dd/MMM/yyyy:HH:mm:ss Z', 'UTC') as
datetime,
   (int)REPLACE(bytes_string, '-', '0')    as bytes
 ;
--
-- determine total number of requests and bytes served by UTC hour of day
-- aggregating as a typical day across the total time of the logs
--
by_hour_count =
 -- group logs by their hour of day, counting the number of logs in that
hour
 -- and the sum of the bytes of rows for that hour
 FOREACH
   (GROUP logs BY FORMAT_DT('HH',datetime))
 GENERATE
   $0,
   COUNT($1) AS num_requests,
   SUM($1.bytes) AS num_bytes
 ;

STORE by_hour_count INTO '$OUTPUT/total_requests_bytes_per_hour';
--
-- top 50 X.X.X.* blocks
--
by_ip_count =
  -- group weblog entries by the ip address from the remote address field
  -- and count the number of entries for each address as well as
  -- the sum of the bytes
  FOREACH
    (GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr,
'(d+.d+.d+)')))
  GENERATE
    $0,
    COUNT($1) AS num_requests,
    SUM($1.bytes) AS num_bytes
  ;

by_ip_count_sorted =
 -- order ip by the number of requests they make
 LIMIT (ORDER by_ip_count BY num_requests DESC) 50;

STORE by_ip_count_sorted into '$OUTPUT/top_50_ips';
-- top 50 external referrers
--
by_referrer_count =
 -- group by the referrer URL and count the number of requests
 FOREACH
   (GROUP logs BY EXTRACT(referrer, '(http://[a-z0-9.-]+)'))
 GENERATE
   FLATTEN($0),
   COUNT($1) AS num_requests
 ;

by_referrer_count_filtered =
 -- exclude matches for example.org
 FILTER by_referrer_count BY NOT $0 matches '.*example.org';

by_referrer_count_sorted =
 -- take the top 50 results
 LIMIT (ORDER by_referrer_count_filtered BY num_requests DESC) 50;

STORE by_referrer_count_sorted INTO '$OUTPUT/top_50_external_referrers';
-- top search terms coming from bing or google
--
google_and_bing_urls =
 -- find referrer fields that match either bing or google
 FILTER
   (FOREACH logs GENERATE referrer)
 BY
   referrer matches '.*bing.*'
 OR
   referrer matches '.*google.*'
 ;

search_terms =
 -- extract from each referrer url the search phrases
 FOREACH
google_and_bing_urls
 GENERATE
FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as (term:chararray)
 ;

search_terms_filtered =
 -- reject urls that contained no search terms
 FILTER search_terms BY NOT $0 IS NULL;

search_terms_count =
 -- for each search phrase count the number of weblogs entries that contained it
 FOREACH
   (GROUP search_terms_filtered BY $0)
 GENERATE
   $0,
   COUNT($1) AS num
 ;

search_terms_count_sorted =
 -- take the top 50 results
 LIMIT (ORDER search_terms_count BY num DESC) 50;


STORE search_terms_count_sorted INTO '$OUTPUT/top_50_search_terms_from_bing_google';
(GROUP logs BY EXTRACT(referrer, '(http://[a-z0-
9.-]+)'))

(GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr,
'(d+.d+.d+)')))

FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as
(term:chararray)

Learning regular expressions will help you with scripting
https://www.owasp.org/index.php
/Input_Validation_Cheat_Sheet

http://www.regular-
expressions.info/
AWS Hadoop and PIG and overview

AWS Hadoop and PIG and overview

  • 1.
    A SQL likescripting language for Hadoop CIS 210 – February 2013 Highline Community College
  • 2.
    Apache Pig isa platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
  • 3.
    At the presenttime, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing.
  • 4.
    Amazon Web Serviceshas Hadoop and will support PIG as part of the Hadoop infrastructure of “Elastic Map Reduce”. Sample Pig Script: s3://elasticmapreduce/samples/pig-apache/do- reports2.pig Sample Dataset: s3://elasticmapreduce/samples/pig-apache/input
  • 5.
    Local Mode -To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local). Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -xmapreduce).
  • 6.
    Interactive Mode You canrun Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown below) and then enter your Pig Latin statements and Pig commands interactively at the command line. Batch Mode You can run Pig in batch mode using Pig scripts and the "pig" command (in local or hadoop mode). Example The Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy the /etc/passwd file to your local working directory. Next, run the Pig script from the command line (using local or mapreduce mode). The STORE operator will write the results to a file (id.out).
  • 7.
    There are twotypes of job flows supported with Pig: interactive and batch. In an interactive mode a customer can start a job flow and run Pig scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Pig script is stored in Amazon S3 and is referenced at the start of the job flow. Typically, batch mode is used for repeatable runs such as report generation.
  • 8.
    -- -- setup piggybackfunctions -- register file:/home/hadoop/lib/pig/piggybank.jar DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT(); DEFINE REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE(); DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME(); DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
  • 9.
    -- -- import logsand break into tuples -- raw_logs = -- load the weblogs into a sequence of one element tuples LOAD '$INPUT' USING TextLoader AS (line:chararray); logs_base = -- for each weblog string convert the weblong string into a -- structure with named fields FOREACH raw_logs GENERATE FLATTEN ( EXTRACT( line, '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"' ) ) AS ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray ) ;
  • 10.
    What is aTuple? In mathematics and computer science, a tuple is an ordered list of elements. In set theory, an (ordered) -tuple is a sequence (or ordered list) of elements, where is a non-negative integer. There is only one 0- tuple, an empty sequence. An -tuple is defined inductively using the construction of an ordered pair. Tuples are usually written by listing the elements within parentheses "" and separated by commas; for example, denotes a 5- tuple. Sometimes other delimiters are used, such as square brackets "" or angle brackets "". Braces "" are almost never used for tuples, as they are the standard notation for sets. Tuples are often used to describe other mathematical objects, such as vectors. In computer science, tuples are directly implemented as product types in most functional programming languages. More commonly, they are implemented as record types, where the components are labeled instead of being identified by position alone. This approach is also used in relational algebra.
  • 11.
    This is aregular expression: '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"’ Regular expressions can be used to parse data out of a file, or used to validate data in SQL or other programming languages. We will focus on SQL because PIG is very similar to SQL
  • 12.
    This is alittle hard to read because of the wrapping. What you should see is that Pig is loading the line into a tuple with just a single element --- the line itself. You now need to split the line into fields. To do this, use the EXTRACT Piggybank function, which applies a regular expression to the input and extracts the matched groups as elements of a tuple. The regular expression is a little tricky because the Apache log defines a couple of fields with quotes. Unfortunately, you can't use this as is because in Pig strings all backslashes must be escaped with a backslash. Making the regular expression a little bulky in relationship to use in other programming languages. '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"'
  • 13.
    logs_base = --for each weblog string convert the weblong string into a -- structure with named fields FOREACH raw_logs GENERATE FLATTEN ( EXTRACT( line, '^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(.+?)" (S+) (S+) "([^"]*)" "([^"]*)"' ) ) AS ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray ) ;
  • 14.
    logs = --convert from string values to typed values such as date_time and integers FOREACH logs_base GENERATE *, DATE_TIME(time, 'dd/MMM/yyyy:HH:mm:ss Z', 'UTC') as datetime, (int)REPLACE(bytes_string, '-', '0') as bytes ;
  • 15.
    -- -- determine totalnumber of requests and bytes served by UTC hour of day -- aggregating as a typical day across the total time of the logs -- by_hour_count = -- group logs by their hour of day, counting the number of logs in that hour -- and the sum of the bytes of rows for that hour FOREACH (GROUP logs BY FORMAT_DT('HH',datetime)) GENERATE $0, COUNT($1) AS num_requests, SUM($1.bytes) AS num_bytes ; STORE by_hour_count INTO '$OUTPUT/total_requests_bytes_per_hour';
  • 16.
    -- -- top 50X.X.X.* blocks -- by_ip_count = -- group weblog entries by the ip address from the remote address field -- and count the number of entries for each address as well as -- the sum of the bytes FOREACH (GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr, '(d+.d+.d+)'))) GENERATE $0, COUNT($1) AS num_requests, SUM($1.bytes) AS num_bytes ; by_ip_count_sorted = -- order ip by the number of requests they make LIMIT (ORDER by_ip_count BY num_requests DESC) 50; STORE by_ip_count_sorted into '$OUTPUT/top_50_ips';
  • 17.
    -- top 50external referrers -- by_referrer_count = -- group by the referrer URL and count the number of requests FOREACH (GROUP logs BY EXTRACT(referrer, '(http://[a-z0-9.-]+)')) GENERATE FLATTEN($0), COUNT($1) AS num_requests ; by_referrer_count_filtered = -- exclude matches for example.org FILTER by_referrer_count BY NOT $0 matches '.*example.org'; by_referrer_count_sorted = -- take the top 50 results LIMIT (ORDER by_referrer_count_filtered BY num_requests DESC) 50; STORE by_referrer_count_sorted INTO '$OUTPUT/top_50_external_referrers';
  • 18.
    -- top searchterms coming from bing or google -- google_and_bing_urls = -- find referrer fields that match either bing or google FILTER (FOREACH logs GENERATE referrer) BY referrer matches '.*bing.*' OR referrer matches '.*google.*' ; search_terms = -- extract from each referrer url the search phrases FOREACH google_and_bing_urls GENERATE FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as (term:chararray) ; search_terms_filtered = -- reject urls that contained no search terms FILTER search_terms BY NOT $0 IS NULL; search_terms_count = -- for each search phrase count the number of weblogs entries that contained it FOREACH (GROUP search_terms_filtered BY $0) GENERATE $0, COUNT($1) AS num ; search_terms_count_sorted = -- take the top 50 results LIMIT (ORDER search_terms_count BY num DESC) 50; STORE search_terms_count_sorted INTO '$OUTPUT/top_50_search_terms_from_bing_google';
  • 19.
    (GROUP logs BYEXTRACT(referrer, '(http://[a-z0- 9.-]+)')) (GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr, '(d+.d+.d+)'))) FLATTEN(EXTRACT(referrer, '.*[&?]q=([^&]+).*')) as (term:chararray) Learning regular expressions will help you with scripting
  • 20.