Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

04 pig data operations


Published on


Published in: Software
  • Be the first to comment

  • Be the first to like this

04 pig data operations

  1. 1. Apache Pig Data Operations
  2. 2. An Example • Let’s look at a simple example by writing the program to calculate the maximum recorded temperature by year for the weather dataset in Pig Latin. • Data: YEAR TMP QUALITY 1950 0 1 1950 22 1 1950 -11 1 1949 111 1 1949 78 1 • Start up Grunt in local mode, then enter the first line of the Pig script: records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); • For simplicity, the program assumes that the input is tab-delimited text, with each line having just year, temperature, and quality fields.
  3. 3. An Example records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); • This line describes the input data we want to process. • The year:chararray notation describes the field’s name and type; a chararray is like a Java string, and an int is like a Java int. • The LOAD operator takes a URI argument; here we are just using a local file, but we could refer to an HDFS URI. • The AS clause (which is optional) gives the fields names to make it convenient to refer to them in subsequent statements. • The result of the LOAD operator, indeed any operator in Pig Latin, is a relation, which is just a set of tuples. • A tuple is just like a row of data in a database table, with multiple fields in a particular order. • In this example, the LOAD function produces a set of (year, temperature, quality) tuples that are present in the input file. • We write a relation with one tuple per line, where tuples are represented as comma-separated items in parentheses: (1950,0,1)
  4. 4. An Example • Relations are given names, or aliases, so they can be referred to. • This relation is given the records alias. • We can examine the contents of an alias using the DUMP operator: DUMP records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1)
  5. 5. An Example • We can also see the structure of a relation—the relation’s schema—using the DESCRIBE operator on the relation’s alias: DESCRIBE records; filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); • This statement removes records that have a missing temperature (indicated by a value of 9999) or an unsatisfactory quality reading. • For this small dataset, no records are filtered out. grouped_records = GROUP filtered_records BY year; • The third statement uses the GROUP function to group the records relation by the year field. • Let’s use DUMP to see what it produces for grouped_records. • Let’s use DESCRIBE grouped_records; to see what is the structure of grouped_records.
  6. 6. An Example • We now have two rows, or tuples, one for each year in the input data. The first field in each tuple is the field being grouped by (the year), and the second field is a bag of tuples for that year. • A bag is just an unordered collection of tuples, which in Pig Latin is represented using curly braces. • So now all that remains is to find the maximum temperature for the tuples in each bag. max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); • FOREACH processes every row to generate a derived set of rows, using a GENERATE clause to define the fields in each derived row. • In this example, the first field is group, which is just the year. • The second field filtered_records.temperature reference is to the temperature field of the filtered_records bag in the grouped_records relation. • MAX is a built-in function for calculating the maximum value of fields in a bag.
  7. 7. Pig Latin • Supports read-only data analysis workloads that are scan-centric; no transactions! • Fully nested data model. – Does not satisfy First normal form! – By definition will violate the other normal forms. • Extensive support for user-defined functions. – UDF as first class citizen. • Manages plain input files without any schema information. • A novel debugging environment.
  8. 8. Nested data/set model • The nested set model is a particular technique for representing nested sets (also known as trees or hierarchies) in relational databases.
  9. 9. Why Nested Data Model? • Closer to how programmers think and more natural to them. – E.g., To capture information about the positional occurrences of terms in a collection of documents, a programmer may create a structure of the form Idx<documentId, Set<positions>> for each term. – Normalization of the data creates two tables: Term_info: (TermId, termString, ….) Pos_info: (TermId, documentId, position) – Obtain positional occurrence by joining these two tables on TermId and grouping on <TermId, documentId>
  10. 10. Why Nested Data Model? • Data is often stored on disk in an inherently nested fashion. – A web crawler might output for each url, the set of outlinks from that url. • A nested data model justifies a new algebraic language! • Adaptation by programmers because it is easier to write user-defined functions.
  11. 11. Dataflow Language • User specifies a sequence of steps where each step specifies only a single, high level data transformation. Similar to relational algebra and procedural – desirable for programmers. • With SQL, the user specifies a set of declarative constraints. Non-procedural and desirable for non-programmers.
  12. 12. Dataflow Language: Example • A high level program that specifies a query execution plan. • Example: Suppose we have a table urls: (url, category, pagerank). The following is a simple SQL query that finds, for each sufficiently large category, the average pagerank of high-pagerank urls in that category. • In PigLatin:
  13. 13. Lazy Execution • Database style optimization by lazy processing of expressions. • Example Recall urls: (url, category, pagerank) Set of urls of pages that are classified as spam and have a high pagerank score. 1. Spam_urls = Filter urls BY isSpam(url); 2. Culprit_urls = FILTER spam_urls BY pagerank > 0.8; Optimized execution: 1. HighRank_urls = FILTER urls BY pagerank > 0.8; 2. Cultprit_urls = FILTER HighRank_urls BY isSpam (url);
  14. 14. Quick Start/Interoperability • To process a file, the user provides a function that gives Pig the ability to parse the content of the file into records. • Output of a Pig program is formatted based on a user-defined function. • Why do not conventional DBMSs do the same? (They require importing data into system-managed tables)
  15. 15. Quick Start/Interoperability • To process a file, the user provides a function that gives Pig the ability to parse the content of the file into records. • Output of a Pig program is formatted based on a user-defined function. • Why do not conventional DBMSs do the same? (They require importing data into system-managed tables) – To enable transactional consistency guarantees, – To enable efficient point lookups (RIDs), – To curate data on behalf of the user, and record the schema so that other users can make sense of the data.
  16. 16. Pig Latin - Simple Data Types • PIG Latin statements work with relations, – A Relation is a Bag (Outer Bag) – A Bag is a collection of Tuples – A Tuple is an ordered set of Fields – A Field can be any simple or complex data type • thus supports nested data model • Simple data types – int => signed 32 bit => 10 – long => signed 64 bit => 10L – float => 32 bit => 10.5f, 10.5e2 – double => 64 bit => 10.5, 10.5e2 – Arrays • chararray => string in UTF-8 => ‘Hello World” • bytearray => byte array (blob)
  17. 17. Data Model • Consists of four types: – Atom: Contains a simple atomic value such as a string or a number, e.g., ‘Joe’. – Tuple: Sequence of fields, each of which might be any data type, e.g., (‘Joe’, ‘lakers’) – Bag: A collection of tuples with possible duplicates. Schema of a bag is flexible. – Map: A collection of data items, where each item has an associated key through which it can be looked up. Keys must be data atoms. Flexibility enables data to change without re-writing programs.
  18. 18. A Comparison with Relational Algebra • Pig Latin – Everything is a bag. – Dataflow language. • Relational Algebra – Everything is a table. – Dataflow language.
  19. 19. Pig Latin – NULL support • Same as SQL definition : unknown or non-existent • Null can be used as constant expression in place of expression of any type • If certain fields in the data are missing, it is load/store functions responsibility to insert NULL – E.g. text loader returns NULL in place of empty strings in the data • Operations that produce NULL – Divide by zero – Dereferencing a field or map key that does not exists – UDFs can return NULL • NULLs and Operators – Comparison, Matches, Cast, Dereferencing returns null if one of the input variables is null – AVG, MIN, MAX, SUM functions ignore NULLs – COUNT function counts values including NULLs – If Filter expression is NULL then record is rejected
  20. 20. Expressions in Pig Latin
  21. 21. Expressions A = LOAD ‘data.txt’ AS (f1:int , f2:bag{t:tuple (n1:int, n2:int)}, f3: map[] ) (1, (2,3), (4,6) , [‘yahoo’#‘mail’]) A.f1 or A.$0 A.f2 or A.$1 Field referred to by position Field referred to by name Projection of a data item (2,3), (4,6) A.f3 or A.$2 A.f2 = (2,3), (4,6) A.$1 = A.$0 = 1 A.f2.$0 = (2), (4) Map lookup A.f3# ‘yahoo’ = ‘mail’ Function application SUM(A.f2.$0) = 6 COUNT(A.f2) = 2L A = name of an outer bag/relation NOTE: bag, tuple keywords are optional
  22. 22. Comparison Operators a  (f1:int , f2:bag{t:tuple (n1:int, n2:int)}, f3: map[] ) (1, (2,3), (4,6) ,[‘yahoo’#‘mail’]) f1or $0 f2 or $1 f3 or $2  Numerical comparison (==, !=, >, >=, <, <= )  f1 > 5  f3#‘yahoo’ == ‘mail’  Regular expression matching : matches  f3#‘yahoo’ matches ‘(?i)MAIL’ Logical Operators AND, OR, NOT  f1==1 AND f3#‘yahoo’ eq ‘mail’  Conditional Expression (aka BinCond)  (Condition?exp1:exp2)  f3#‘yahoo’ matches ‘(?i)MAIL’ ? ‘matched’ : ‘notmatched’
  23. 23. Pig Built-in Functions • Pig has a variety of built-in functions for each type – Storage • TextLoader: for loading unstructured text files. Each line is loaded as a tuple with a single field which is the entire line. – Filter • isEmpty: tests if bags are empty – Eval Functions • COUNT: computes number of elements in a bag • SUM: computes the sum of the numeric values in a single-column bag • AVG: computes the average of the numeric values in a single-column bag • MIN/MAX: computes the min/max of the numeric values in a single-column bag. • SIZE: returns size of any datum example map • CONCAT: concatenate two chararrays or two bytearrays • TOKENIZE: splits a string and outputs a bag of words • DIFF: compares the fields of a tuple with size 2
  24. 24. Specifying Input Data • Use LOAD command to specify input data file. • Input file is query_log.txt • Convert input file into tuples using myLoad deserializer. • Loaded tuples have 3 fields. • USING and AS clauses are optional. – Default serializer that expects a plain text, tab-deliminated file, is used. • No schema  reference fields by position $0 • Return value, assigned to “queries”, is a handle to a bag. – “queries” can be used as input to subsequent Pig Latin expressions. – Handles such as “queries” are logical. No data is actually read and no processing carried out until the instruction that explicitly asks for output (STORE). – Think of it as a “logical view”.
  25. 25. FOREACH • Once input data file(s) have been specified through LOAD, one can specify the processing that needs to be carried out on the data. • One of the basic operations is that of applying some processing to every tuple of a data set. • This is achieved through the FOREACH command. For example: • The above command specifies that each tuple of the bag queries (loaded by previous command) should be processed independently to produce an output tuple. • The first field of the output tuple is the userId field of the input tuple, and the second field of the output tuple is the result of applying the UDF expandQuery to the queryString field of the input tuple.
  26. 26. Per-tuple Processing with FOREACH • Suppose the UDF expandQuery generates a bag of likely expansions of a given query string. • Then an example transformation carried out by the above statement is a bag of likely expansions of a given query string. • Semantics: – No dependence between processing of different tupels of the input  Parallelism! – GENERATE can be followed by a list of any expression.
  27. 27. FOREACH & Flattening • To eliminate nesting in data, use FLATTEN. • FLATTEN consumes a bag, extracts the fields of the tuples in the bag, and makes them fields of the tuple being output by GENERATE, removing one level of nesting.
  28. 28. Discarding Unwanted Data: FILTER • Identical to the select operator of relational algebra. • Synatx: – FILTER bag-id BY expression • Expression is: field-name op Constant Field-name op UDF op might be ==, eq, !=, neq, <, >, <=, >= • A comparison operation may utilize boolean operators (AND, OR, NOT) with several expressions • For example, to get rid of bot traffic in the bag queries • Since arbitrary expressions are allowed, it follows that we can use UDFs while filtering. • Thus, in our less ideal world, where bots don’t identify themselves, we can use a sophisticated UDF (isBot) to perform the filtering, e.g.:
  29. 29. A Comparison with Relational Algebra • Pig Latin – Everything is a bag. – Dataflow language. – FILTER is same as the Select operator. • Relational Algebra – Everything is a table. – Dataflow language. – Select operator is same as the FILTER cmd.
  30. 30. Grouping related data • COGROUP groups together tuples from one or more data sets that are related in some way. • Example: – For example, suppose we have two data sets that we have specified through a LOAD command: – Results contains, for different query strings, the urls shown as search results and the position at which they are shown. – Revenue contains, for different query strings, and different ad slots, the average amount of revenue made by the ad for that query string at that slot. – Then to group together all search result data and revenue data for the same query string, we can write:
  31. 31. COGROUP • The output of a COGROUP contains one tuple for each group. – First field of the tuple, named group, is the group identifier. – Each of the next fields is a bag, one for each input being cogrouped, and is named the same as the alias of that input.
  32. 32. COGROUP is not JOIN • Grouping can be performed according to arbitrary expressions which may include UDFs. • Grouping is different than “Join” • It is evident that JOIN is equivalent to COGROUP, followed by taking a cross product of the tuples in the nested bags. While joins are widely applicable, certain custom processing might require access to the tuples of the groups before the cross-product is taken.
  33. 33. Example • Suppose we were trying to attribute search revenue to search-result urls to figure out the monetary worth of each url. We might have a sophisticated model for doing so. To accomplish this task in Pig Latin, we can follow the COGROUP with the following statement: • Where distributeRevenue is a UDF that accepts search results and revenue information for a query string at a time, and outputs a bag of urls and the revenue attributed to them. • For example, distributeRevenue might attribute revenue from the top slot entirely to the first search result, while the revenue from the side slot may be attributed equally to all the results.
  34. 34. Example… • Assign search revenue to search-result urls to figure out the monetary worth of each url. A UDF, distributeRevenue attributes revenue from the top slot entirely to the first search result, while the revenue from the side slot may be attributed equally to all the results.
  35. 35. WITH JOIN • To specify the same operation in SQL, one would have to join by queryString, then group by queryString, and then apply a custom aggregation function. • But while doing the join, the system would compute the cross product of the • search and revenue information, which the custom aggregation function would then have to undo. • Thus, the whole process become quite inefficient, and the query becomes hard to read and understand.
  36. 36. Special Case of COGROUP: GROUP • A special case of COGROUP when there is only one data set involved. • Example: Find the total revenue for each query string. • In the second statement above, revenue.amount refers to a projection of the nested bag in the tuples of grouped_revenue. • Also, as in SQL, the AS clause is used to assign names to fields on the fly. • To group all tuples of a data set together (e.g., to compute the overall total revenue), one uses the syntax GROUP revenue ALL.
  37. 37. JOIN • Pig Latin supports equi-joins. • It is easy to verify that JOIN is only a syntactic shortcut for COGROUP followed by flattening. • The above join command is equivalent to:
  38. 38. MapReduce in Pig Latin • With the GROUP and FOREACH statements, it is trivial to express a mapreduce program in Pig Latin. • Converting to our data-model terminology, a map function operates on one input tuple at a time, and outputs a bag of key-value pairs. • The reduce function then operates on all values for a key at a time to produce the final result. • The first line applies the map UDF to every tuple on the input, and flattens the bag of key value pairs that it produces. • We use the shorthand * as in SQL to denote that all the fields of the input tuples are passed to the map UDF. • Assuming the first field of the map output to be the key, the second statement groups by key. • The third statement then passes the bag of values for every key to the reduce UDF to obtain the final result.
  39. 39. Other Commands • Pig Latin has a number of other commands that are very similar to their SQL counterparts. These are: – UNION: Returns the union of two or more bags. – CROSS: Returns the cross product of two or more bags. – ORDER: Orders a bag by the specified field(s). – DISTINCT: Eliminates duplicate tuples in a bag. This command is just a shortcut for grouping the bag by all fields, and then projecting out the groups.
  40. 40. Asking for Output: STORE • The user can ask for the result of a Pig Latin expression • sequence to be materialized to a file, by issuing the STORE • command, e.g., • The above command specifies that bag query_revenues should be serialized to the file myoutput using the custom serializer myStore. • As with LOAD, the USING clause may be omitted for a default serializer that writes plain text, tabdelimited files. • Pig also comes with a built-in serializer/ deserializer that can load/store arbitrarily nested data.
  41. 41. Word Count using Pig myinput = LOAD ‘input.txt' USING TextLoader() as (text_line:chararray); words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(text_line)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); STORE counts into ‘pigoutput’ using PigStorage(); Write to HDFS pigoutput/part-* file
  42. 42. Build Inverted Index • Load set of files as string:chararray • Associate filenames with their string representation • Union all the entries <filename, string> • For each entry tokenize string to generate – <filename, word> tuples • Group by word – <word1, {(filename1, word1), (filename2, word1)…}> – For each group take records with distinct filenames from the associated bag – Generate <word1, {(fillename1) (filename2)..} • Store it
  43. 43. Build Inverted Index t1 = LOAD ‘input1.txt’ USING TextLoader() AS (string:chararray); t2 = FOREACH t1 GENERATE ‘input1.txt’ as fname, string; t3 = LOAD ‘input2.txt’ USING TextLoader() as (string:chararray); t4 = FOREACH t3 GENERATE ‘input2.txt’ as fname, string; text = UNION t2, t4; words = FOREACH text GENERATE fname, FLATTEN(TOKENIZE(string)); word_groups = GROUP words BY $1; index = FOREACH word_groups { files = DISTINCT $1.$0; GENERATE $0, cnt, files; }; STORE index INTO ‘inverted_index’ using PigStorage(); Nested FOREACH
  44. 44. End of session Day – 3: Apache Pig Data Operations