CSC 5800:

CSC 8710
IntelligentData Management
Systems:
Big
Algorithms and Tools

Pig Latin: A Not-So-Foreign
Language for...
What we will be covering
 Introduction
 MapReduce Overview
 Pig Overview
 Pig Features
 Pig Latin
 Pig Debugger
 De...
Introduction
 Enormous data

 Innovation critically depends upon analyzing terabytes of
data collected everyday
 SQL ca...
Parallel DB Products
 Teradata, Oracle RAC, Netezza
 Expensive at web scale
 Programmers have to write complex SQL quer...
Procedural programming
 Map-Reduce programming model
 It can easily perform a group by aggregation in parallel
over a cl...
MapReduce Overview
 Programming Model
– To cater large data analytics
– Works over Hadoop
– Java based
– Splits data into...
MapReduce Driver Program
 Works as ‘Main’ function for MR job
 Takes care of
– Number of arguments
– Input Data Location...
Mapper and Reducer Class
 Mapper Class
– Main task is to perform any function logic
– Computes tasks like:
• Filtering
• ...
Word Count Execution

Input

the quick
brown fox

Map

Shuffle & Sort

Reduce

the, 1
brown, 1
fox, 1

Output

Reduce

Map...
MapReduce Word Count Program
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>
{
private final...
Map Reduce Limitations
 1 input – 2 stage data flow is extremely rigid.
– To perform a task like join or sum iteration ta...
Pig
 An Apache open source project.
 Provides an engine for executing data flows in parallel on
Hadoop.
 Includes a lan...
Hadoop Stack

Hive

…
HBase
Data Processing Layer
Pig

Hadoop MR

Hadoop Yarn
Resource Management Layer
HDFS
Storage Layer...
Why Choose Pig
 Written like SQL, compiled into MapReduce
 Fully nested data model
 Extensive support for UDFs
 Can an...
Features and Motivation
 Design goal of pig is to provide programmers with
appealing experience for performing ad-hoc ana...
Data Flow Language
 Each step specifies a single high level data
transformation
 Different from SQL where all these resu...
Quick start and Interoperability
 Data Load
– Capability of Ad-Hoc analysis
– Can run queries directly on Data from dump ...
Pig as part of workflow
 Pig easily becomes a part of workflow eco-system
– Can take most of the input types
– Can output...
Optional data schemas
 Schema can be provided by the user :
– In the beginning
– On the fly

– Example:
• A= LOAD ‘input....
Nested Data Model
 Suppose, for a document, we want to extract the term and
its position.
 Format of output : Map<docume...
Problem resolved using Pig
 In pig we have complex data types like map, tuple or bag
to occur as a field of a table itsel...
UDFs
 Significant part of data analysis is custom processing
 For example, user might want to process natural
language s...
Debugging Environment
 In any language, getting a data processing program work
correctly usually takes many iterations
 ...
Pig Latin
 Language in which data workflow statements are written
 It runs on the shell called ‘Grunt’
 It has a shared...
Data Model
 Rich, yet simple data models
 Atoms

– Simple atomic values like string or number
 Tuple
– A collection of ...
Data Model (cont.)


Example of a relation
Atom

Tuple

Bag

T= ‘alice’, (labours,1), {(‘ipod’, 2),‘james’}


Tuple is r...
Specifying Input Data : LOAD
 Its the first step in Pig Latin program
 Specifying what the input files are
 How are its...
LOAD (cont.)
 Both the ‘USING’ clause and the ‘AS’ clause are optional
 We can work without them as shown earlier ($0 fo...
Per Tuple Processing : FOREACH
 Similar to FOR statements
 Its used for applying special processing to each tuple of
the...
Per Tuple Processing : FOREACH(cont.)
 The semantics of FOREACH is such that there is no
dependency between different tup...
Discarding Unwanted Data : FILTER
 Used as a where clause
 Can provide anything in the expression
– Query = FILTER queri...
COGROUP
 Similar to Join
 Groups bags of different inputs together
 Ease of use for UDF’s
– Grouped_data = COGROUP resu...
JOIN
 Not all users want to use COGROUP
 Simple equi-join is all that is required
– Example
Join_result = JOIN results b...
Other Commands
 Relational Operators
– UNION
– CROSS
– ORDER
– DISTINCT
– LIMIT
 Eval Functions

– Concat
– Count
– Diff...
PARALLEL clause
 It is used to increase the parallelization of the job
 We can specify the number of reduce tasks of the...
PARALLEL clause (cont.)
 Can be applied to only those commands which come
under reduce phase
– COGROUP
– CROSS

– DISTINC...
Split Clause
 We can split the input record into many by providing
condition
A = LOAD ‘data’ AS (F1:int, F2:int, F3:int)
...
Output
 There are two ways to display
– STORE
• If you want to store the output in any location
STORE output_1 INTO ‘hado...
Building a Logical Plan
 Pig interpreter first parses all the commands which the
client issues
 Verifies that the input ...
Debugging Environment
 This is used to avoid running the complete code on the
entire dataset
 User can create a sample d...
Pig Pen

 Outputs can be easily analyzed
 Errors can be rectified earlier
41
Future Work
 User Interface
– Drag-Drop style would help
– Logical plan diagram create made easy
 UDF support for other ...
Summary
 Not So Foreign Language
 Aims a sweet spot between SQL and MapReduce
 Reusable and easy to use
 Novel Debuggi...
References
 http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf
 http://pig.apache.org/docs/r0.7.0/piglatin_ref2.htm...
Upcoming SlideShare
Loading in …5
×

Apache pig presentation_siddharth_mathur

252 views
160 views

Published on

Small Introduction to MapReduce, overview of

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
252
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Apache pig presentation_siddharth_mathur

  1. 1. CSC 5800: CSC 8710 IntelligentData Management Systems: Big Algorithms and Tools Pig Latin: A Not-So-Foreign Language for Data Processing 1
  2. 2. What we will be covering  Introduction  MapReduce Overview  Pig Overview  Pig Features  Pig Latin  Pig Debugger  Demo 2
  3. 3. Introduction  Enormous data  Innovation critically depends upon analyzing terabytes of data collected everyday  SQL can resolve the structure data problems  Parallel Database processing – Data is enormous can’t be analyzed serially. – Has to be analyzed in parallel. – Shared nothing clusters are the way to go. 3
  4. 4. Parallel DB Products  Teradata, Oracle RAC, Netezza  Expensive at web scale  Programmers have to write complex SQL queries because of this declarative programming is not preferred 4
  5. 5. Procedural programming  Map-Reduce programming model  It can easily perform a group by aggregation in parallel over a cluster of machines  The programmer provides map functions which is used as a filter or transforming method  The reduce function performs the aggregation  Appealing to the programmer because there are only 2 high level declarative functions to enable parallel processing 5
  6. 6. MapReduce Overview  Programming Model – To cater large data analytics – Works over Hadoop – Java based – Splits data into independent chunks and process them in-parallel  Program structure – Mapper – Reducer – Driver Program 6
  7. 7. MapReduce Driver Program  Works as ‘Main’ function for MR job  Takes care of – Number of arguments – Input Data Location – Input Data Types – Output Data Location – Output Data types – Number of Mappers – Number of Reducers 7
  8. 8. Mapper and Reducer Class  Mapper Class – Main task is to perform any function logic – Computes tasks like: • Filtering • Splitting • Tokenizing • Transforming  Reducer Class – Works as an aggregator – Aggregates the intermediate results gathered from Mapper 8
  9. 9. Word Count Execution Input the quick brown fox Map Shuffle & Sort Reduce the, 1 brown, 1 fox, 1 Output Reduce Map brown, 2 fox, 2 how, 1 now, 1 the, 3 Reduce ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 fox, 1 the, 1 the fox ate the mouse Map quick, 1 how, 1 now, 1 brown, 1 how now brown cow Map ate, 1 mouse, 1 cow, 1 9
  10. 10. MapReduce Word Count Program public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } 10
  11. 11. Map Reduce Limitations  1 input – 2 stage data flow is extremely rigid. – To perform a task like join or sum iteration task, workaround has to be devised. – Custom code for common task like filtering or transforming or projection – The code is difficult to reuse and maintain  Moreover, because of its own data types, workflow and the fact that people have to learn java, makes it’s a tough choice to take. 11
  12. 12. Pig  An Apache open source project.  Provides an engine for executing data flows in parallel on Hadoop.  Includes a language called ‘Pig Latin’ for expressing these data flows.  High level declarative data workflow language.  It has best of both worlds: – High Level declarative querying like SQL – Low Level procedural like Map Reduce 12
  13. 13. Hadoop Stack Hive … HBase Data Processing Layer Pig Hadoop MR Hadoop Yarn Resource Management Layer HDFS Storage Layer 13
  14. 14. Why Choose Pig  Written like SQL, compiled into MapReduce  Fully nested data model  Extensive support for UDFs  Can answer multiple questions in one single workflow. A = load './input.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into './output'; 14
  15. 15. Features and Motivation  Design goal of pig is to provide programmers with appealing experience for performing ad-hoc analysis of extremely large data sets. – DataFlow Language – QuickStart and Interoperability – Nested Data Model – UDF’s – Debugging Environment 15
  16. 16. Data Flow Language  Each step specifies a single high level data transformation  Different from SQL where all these results are a single output.  The system has given opportunity to provide optimization function. – Example: A= Load ‘input.txt’; B= Filter A by UDF (Column1); C= Filter B by Column1 > 0.8; 16
  17. 17. Quick start and Interoperability  Data Load – Capability of Ad-Hoc analysis – Can run queries directly on Data from dump of search engines – Just have to provide a function that tells Pig how to parse the content of file into tuple. – Similarly for output • Any output format. • These function can be reused. • Used for visualization or dumped to excel directly 17
  18. 18. Pig as part of workflow  Pig easily becomes a part of workflow eco-system – Can take most of the input types – Can output in many of the forms – Doesn’t take over the data, i.e., it does not lock the data that is being processed. – Read only data analysis 18
  19. 19. Optional data schemas  Schema can be provided by the user : – In the beginning – On the fly – Example: • A= LOAD ‘input.txt’ as (Column1;Column2); • B= Filter A by Column1>5;  If the schema is not provided then the columns can be referred by ‘$0’, ‘$1’, ‘$2’…. for the 1st, 2nd, 3rd column etc.  Example:  A= LOAD ‘input.txt’;  B= Filter A by $0>5; 19
  20. 20. Nested Data Model  Suppose, for a document, we want to extract the term and its position.  Format of output : Map<document Set<position>>  SQL data model: Term Document ID Position Hi 1 2 Hi 1 5  Or keep in normalized form, i.e., – term_info(termid, String) – position_info(termid, position, document) 20
  21. 21. Problem resolved using Pig  In pig we have complex data types like map, tuple or bag to occur as a field of a table itself.  Example: Term Document ID Position Hi 1 (2,5,8..)  This approach is good because its more closer to what a programmer thinks.  Data is stored on disk in a nested fashion only  It gives user an ease in writing UDFs. 21
  22. 22. UDFs  Significant part of data analysis is custom processing  For example, user might want to process natural language stemming  Or checking if the page is spam or not, or many other tasks  To work on this, Pig Latin has extensive support for UDFs, most of the tasks can be resolved using the UDFs  It can take non-atomic input and can provide a nonatomic output also  Currently the UDFs can be written in java or python 22
  23. 23. Debugging Environment  In any language, getting a data processing program work correctly usually takes many iterations  First few iterations mostly produce errors  With a large scale data this would result in serious time and resource wastage  Debuggers can help  Pig has a novel debugging environment  Generates concise examples from input data  Data samples are carefully chosen to resemble real data as far as possible  Sample data is carved specially 23
  24. 24. Pig Latin  Language in which data workflow statements are written  It runs on the shell called ‘Grunt’  It has a shared repository name Piggybank  We can create our custom UDFs and add them to Piggybank 24
  25. 25. Data Model  Rich, yet simple data models  Atoms – Simple atomic values like string or number  Tuple – A collection of fields each of which can be of any data type – Analogous to rows in SQL  Bag – Collection of tuples or both tuples and atoms – Can also be heterogeneous 25
  26. 26. Data Model (cont.)  Example of a relation Atom Tuple Bag T= ‘alice’, (labours,1), {(‘ipod’, 2),‘james’}  Tuple is represented with round braces  Bag is represented with curly braces 26
  27. 27. Specifying Input Data : LOAD  Its the first step in Pig Latin program  Specifying what the input files are  How are its contents to be deserialized, i.e., converted to pig data model.  LOAD command – Example queries= LOAD ‘query_log.csv’ USING PigStorage(‘,’) AS (userId,queryString,timestamp); 27
  28. 28. LOAD (cont.)  Both the ‘USING’ clause and the ‘AS’ clause are optional  We can work without them as shown earlier ($0 for first field)  Pig Storage is a pre-defined function  Can use custom function instead of Pig Storage 28
  29. 29. Per Tuple Processing : FOREACH  Similar to FOR statements  Its used for applying special processing to each tuple of the dataset  Example – Expanded_query = FOREACH queries GENERATE UserId, Expand(queryString), timeStamp;  Its not a FILTERING command  ‘Expand’ can take atomic input and can generate a bag of outputs 29
  30. 30. Per Tuple Processing : FOREACH(cont.)  The semantics of FOREACH is such that there is no dependency between different tuples of input, therefore permitting efficient parallel implementation 30
  31. 31. Discarding Unwanted Data : FILTER  Used as a where clause  Can provide anything in the expression – Query = FILTER queries By user_id neq ‘bot’;  We can provide a UDF also, like – Query = FILTER queries by Isbot(user_id); 31
  32. 32. COGROUP  Similar to Join  Groups bags of different inputs together  Ease of use for UDF’s – Grouped_data = COGROUP results by querystring, revenue by querystring; 32
  33. 33. JOIN  Not all users want to use COGROUP  Simple equi-join is all that is required – Example Join_result = JOIN results by querystring, revenue by querystring;  Other types of join are also supported: – Left outer – Right outer – Full outer 33
  34. 34. Other Commands  Relational Operators – UNION – CROSS – ORDER – DISTINCT – LIMIT  Eval Functions – Concat – Count – Diff 34
  35. 35. PARALLEL clause  It is used to increase the parallelization of the job  We can specify the number of reduce tasks of the MR jobs created by Pig  It only effects the reduce task  No control over map  The system also can figure out number of reducers  Mostly one reduce task is required 35
  36. 36. PARALLEL clause (cont.)  Can be applied to only those commands which come under reduce phase – COGROUP – CROSS – DISTINCT – GROUP – JOINS – ORDER A = LOAD ‘ File1’; B = LOAD ‘ File2’; C = CROSS A, B PARALLEL 10; 36
  37. 37. Split Clause  We can split the input record into many by providing condition A = LOAD ‘data’ AS (F1:int, F2:int, F3:int) (1,2;3) (2,3;7) SPLIT A INTO B IF F1>7, C IF F2==5; B (1,2,3) C (2,5,7) (2,5,7)  Any expression can be written  UDFs can be used  It is not partitioning 37
  38. 38. Output  There are two ways to display – STORE • If you want to store the output in any location STORE output_1 INTO ‘hadoopuser/output’ – DUMP • Basically used to display the result in the GRUNT shell itself • Dumping doesn’t store the output anywhere DUMP query_result; 38
  39. 39. Building a Logical Plan  Pig interpreter first parses all the commands which the client issues  Verifies that the input files, bags or columns referred by the command are valid  Builds a logical plan for every bag the user defines  No processing is carried out  Processing triggers where a user invokes STORE/DUMP command  Called as a Lazy execution approach  Helps in FILTER reordering 39
  40. 40. Debugging Environment  This is used to avoid running the complete code on the entire dataset  User can create a sample data  Difficult to tailor these datasets and end up in self cooked data  Pig Pen is Pig’s debugging environment  Creates side dataset automatically, called as sandbox dataset  Pig Pen has its own user interface 40
  41. 41. Pig Pen  Outputs can be easily analyzed  Errors can be rectified earlier 41
  42. 42. Future Work  User Interface – Drag-Drop style would help – Logical plan diagram create made easy  UDF support for other languages  Unified Environment – Currently, lacks in control structures like loops – Has to embedded for all iterative tasks 42
  43. 43. Summary  Not So Foreign Language  Aims a sweet spot between SQL and MapReduce  Reusable and easy to use  Novel Debugging Environment: Pig Pen  Pig has an active and growing user base in Yahoo!  Pigs – Eats anything – Live anywhere – Are domestic 43
  44. 44. References  http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf  http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html  Book: Programming pig  http://www.brentozar.com/archive/2011/11/good-pig/  http://hortonworks.com/hadoop/pig/ 44

×