0
Pig Workshop         Sudar Muthu    http://sudarmuthu.comhttp://twitter.com/sudarmuthu   https://github.com/sudar
Who am I?Research Engineer by professionI mine useful information from dataYou might recognize me from other HasGeek event...
Special ThanksHasGeek
What I will not cover?
What I will not cover?What is BigData, or why it is needed?What is MapReduce?What is Hadoop?Internal architecture of Pig  ...
What we will see today?
What we will see today?What is PigHow to use it  Loading and storing data  Pig Latin  SQL vs Pig  Writing UDF’sDebugging P...
So, all of you have Pig installed             right? ;)
What is Pig?“Platform for analyzing large        sets of data”
Components of PigPig Shell (Grunt)Pig Language (Latin)Libraries (Piggy Bank)User Defined Functions (UDF)
Why Pig?  It is a data flow language  Provides standard data processing operations  Insulates Hadoop complexity  Abstracts...
Pig Modes
For this workshop, we will be using Pig only in local mode
Getting to know your Pig shell
pig –x localSimilar to Python’s shell
Different ways of executing Pig            ScriptsInline in shellFrom a fileStreaming through other executableEmbed script...
Loading and Storing dataPigs eat anything
Loading Data into Pigfile = LOAD data/dropbox-policy.txt AS (line);data = LOAD data/tweets.csv USING PigStorage(,);data = ...
Loading Data into PigPigStorage – for most casesTextLoader – to load text filesJSONLoader – to load JSON filesCustom loade...
Viewing DataDUMP input;Very useful for debugging, but don’t use it on hugedatasets
Storing Data from PigSTORE data INTO output_location;STORE data INTO output_location USING PigStorage();STORE data INTO ou...
Storing DataSimilar to `LOAD`, lot of options are availableCan store locally or in HDFSYou can write your own custom Stora...
Load and Store exampledata = LOAD data/data-bag.txt USINGPigStorage(,);STORE data INTO data/output/load-store USINGPigStor...
Pig Latin
Data TypesScalar TypesComplex Types
Scalar Types  int, long – (32, 64 bit) integer  float, double – (32, 64 bit) floating point  boolean (true/false)  chararr...
Complex Typestuple – ordered set of fields(data) bag – collection of tuplesmap – set of key value pairs
Tuple Row with one or more fields Fields can be of any data type Ordering is important Enclosed inside parentheses ()Eg:(S...
BagSet of tuplesSQL equivalent is TableEach tuple can have different set of fieldsCan have duplicatesInner bag uses curly ...
Bag - ExampleOuter bag(1,2,3)(1,2,4)(2,3,4)(3,4,5)(4,5,6)https://github.com/sudar/pig-samples/data-bag.pig
Bag - ExampleInner bag(1,{(1,2,3),(1,2,4)})(2,{(2,3,4)})(3,{(3,4,5)})(4,{(4,5,6)})https://github.com/sudar/pig-samples/dat...
MapSet of key value pairsSimilar to HashMap in JavaKey must be uniqueKey must be of chararray data typeValues can be any t...
Map - Example[name#sudar, height#176, weight#80.5F][name#(sudar, muthu), height#176, weight#80.5F][name#(sudar, muthu), la...
NullSimilar to SQLDenotes that value of data element is unknownAny data type can be null
Schemas in Load statementWe can specify a schema (collection of datatypes) to `LOAD`statementsdata = LOAD data/data-bag.tx...
ExpressionsFields can be looked up by  Position  Name  Map Lookup
Expressions - Exampledata = LOAD data/nested-schema.txt AS(f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);by_pos = FOR...
Operators
Arithmetic OperatorsAll usual arithmetic operators are supported  Addition (+)  Subtraction (-)  Multiplication (*)  Divis...
Boolean OperatorsAll usual boolean operators are supported  AND  OR  NOT
Comparison OperatorsAll usual comparison operators are supported  ==  !=  <  >  <=  >=
Relational OperatorsFOREACHFLATTERNGROUPFILTERCOUNTORDER BYDISTINCTLIMITJOIN
FOREACHGenerates data transformations based on columns of datax = FOREACH data GENERATE *;x = FOREACH data GENERATE $0, $1...
FLATTENUn-nests tuples and bags. Most of the time results incross product(a, (b, c)) => (a,b,c)({(a,b),(d,e)}) => (a,b) an...
GROUP   Groups data in one or more relations   Groups tuples that have the same group key   Similar to SQL group by operat...
FILTERSelects tuples from a relation based on some conditiondata = LOAD data/data-bag.txt USING PigStorage(,) AS(f1:int, f...
COUNTCounts the number of tuples in a relationshipdata = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3...
ORDER BySort a relation based on one or more fields. Similar to SQL order bydata = LOAD data/nested-sample.txt USING PigSt...
DISTINCTRemoves duplicates from a relationdata = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DU...
LIMITLimits the number of tuples in the output.data = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:in...
JOINJoins relation based on a field. Both outer and innerjoins are supporteda = LOAD data/data-bag.txt USING PigStorage(,)...
SQL vs PigFrom Table – Load file(s)Select – FOREACH GENERATEWhere – FILTER BYGroup By – GROUP BY + FOREACH GENERATEHaving ...
Let’s see a complete exampleCount the number of words in a           text file   https://github.com/sudar/pig-samples/coun...
Extending Pig - UDF
Why UDF?  Do operations on more than one field  Do more than grouping and filtering  Programmer is comfortable  Want to re...
Different types of UDF’sEval FunctionsFilter functionsLoad functionsStore functions
Eval Functions  Can be used in FOREACH statement  Most common type of UDF  Can return simple types or Tuplesb = FOREACH a ...
Eval FunctionsExtend EvalFunc<T> interfaceThe generic <T> should contain the return typeInput comes as a TupleShould check...
Using Java UDF in Pig ScriptsCreate a jar file which contains your UDF classesRegister the jar at the top of Pig scriptReg...
Let’s see an example which       returns a string  https://github.com/sudar/pig-samples/strip-quote.pig
Let’s see an example which       returns a Tuple  https://github.com/sudar/pig-samples/get-twitter-names.pig
Filter Functions  Can be used in the Filter statements  Returns a boolean valueEg:vim_tweets = FILTER data By FromVim(Stri...
Filter FunctionsExtends FilterFun, which is a EvalFunc<Boolean>Should return a booleanInput it is same as EvalFunc<T>Shoul...
Let’s see an example which     returns a Boolean  https://github.com/sudar/pig-samples/from-vim.pig
Error Handling in UDFIf the error affects only particular row then returnnull.If the error affects other rows, but can rec...
Can we try to write some more            UDF’s?
Writing UDF in other languages
Streaming
StreamingEntire data set is passed through an external taskThe external task can be in any languageEven shell script also ...
Stream through shell scriptdata = LOAD data/tweets.csv USING PigStorage(,);filtered = STREAM data THROUGH `cut -f6,8`;DUMP...
Stream through Pythondata = LOAD data/tweets.csv USING PigStorage(,);filtered = STREAM data THROUGH `strip.py`;DUMP filter...
Debugging Pig ScriptsDUMP is your friend, but use with LIMITDESCRIBE – will print the schema namesILLUSTRATE – Will show t...
Optimizing Pig ScriptsProject early and oftenFilter early and oftenDrop nulls before a joinPrefer DISTINCT over GROUP BYUs...
Using Param substitution -p key=value - substitutes a single key, value -m file.ini – substitutes using an ini file defaul...
Problems that can be solved using PigAnything data related
When not to use Pig?Lot of custom logic needs to be implementedNeed to do lot of cross lookupData is mostly binary (proces...
External LibrariesPiggyBank -https://cwiki.apache.org/PIG/piggybank.htmlDataFu – Linked-In Pig Library -https://github.com...
Useful Links  Pig homepage - http://pig.apache.org/  My blog about Pig -http://sudarmuthu.com/blog/category/hadoop-pig  Sa...
Thank you
Upcoming SlideShare
Loading in...5
×

Pig workshop

5,930

Published on

Slides that I used for my Pig Workshop

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,930
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
174
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Transcript of "Pig workshop"

  1. 1. Pig Workshop Sudar Muthu http://sudarmuthu.comhttp://twitter.com/sudarmuthu https://github.com/sudar
  2. 2. Who am I?Research Engineer by professionI mine useful information from dataYou might recognize me from other HasGeek eventsBlog at http://sudarmuthu.comBuilds robots as hobby ;)
  3. 3. Special ThanksHasGeek
  4. 4. What I will not cover?
  5. 5. What I will not cover?What is BigData, or why it is needed?What is MapReduce?What is Hadoop?Internal architecture of Pig http://sudarmuthu.com/blog/getting-started-with-hadoop-and-pig
  6. 6. What we will see today?
  7. 7. What we will see today?What is PigHow to use it Loading and storing data Pig Latin SQL vs Pig Writing UDF’sDebugging Pig ScriptsOptimizing Pig ScriptsWhen to use Pig
  8. 8. So, all of you have Pig installed right? ;)
  9. 9. What is Pig?“Platform for analyzing large sets of data”
  10. 10. Components of PigPig Shell (Grunt)Pig Language (Latin)Libraries (Piggy Bank)User Defined Functions (UDF)
  11. 11. Why Pig? It is a data flow language Provides standard data processing operations Insulates Hadoop complexity Abstracts Map Reduce Increases programmer productivity… but there are cases where Pig is not suitable.
  12. 12. Pig Modes
  13. 13. For this workshop, we will be using Pig only in local mode
  14. 14. Getting to know your Pig shell
  15. 15. pig –x localSimilar to Python’s shell
  16. 16. Different ways of executing Pig ScriptsInline in shellFrom a fileStreaming through other executableEmbed script in other languages
  17. 17. Loading and Storing dataPigs eat anything
  18. 18. Loading Data into Pigfile = LOAD data/dropbox-policy.txt AS (line);data = LOAD data/tweets.csv USING PigStorage(,);data = LOAD data/tweets.csv USING PigStorage(,)AS (list, of, fields);
  19. 19. Loading Data into PigPigStorage – for most casesTextLoader – to load text filesJSONLoader – to load JSON filesCustom loaders – You can write your own customloaders as well
  20. 20. Viewing DataDUMP input;Very useful for debugging, but don’t use it on hugedatasets
  21. 21. Storing Data from PigSTORE data INTO output_location;STORE data INTO output_location USING PigStorage();STORE data INTO output_location USINGPigStorage(,);STORE data INTO output_location USING BinStorage();
  22. 22. Storing DataSimilar to `LOAD`, lot of options are availableCan store locally or in HDFSYou can write your own custom Storage as well
  23. 23. Load and Store exampledata = LOAD data/data-bag.txt USINGPigStorage(,);STORE data INTO data/output/load-store USINGPigStorage(|);https://github.com/sudar/pig-samples/load-store.pig
  24. 24. Pig Latin
  25. 25. Data TypesScalar TypesComplex Types
  26. 26. Scalar Types int, long – (32, 64 bit) integer float, double – (32, 64 bit) floating point boolean (true/false) chararray (String in UTF-8) bytearray (blob) (DataByteArray in Java)If you don’t specify anything bytearray is used bydefault
  27. 27. Complex Typestuple – ordered set of fields(data) bag – collection of tuplesmap – set of key value pairs
  28. 28. Tuple Row with one or more fields Fields can be of any data type Ordering is important Enclosed inside parentheses ()Eg:(Sudar, Muthu, Haris, Dinesh)(Sudar, 176, 80.2F)
  29. 29. BagSet of tuplesSQL equivalent is TableEach tuple can have different set of fieldsCan have duplicatesInner bag uses curly braces {}Outer bag doesn’t use anything
  30. 30. Bag - ExampleOuter bag(1,2,3)(1,2,4)(2,3,4)(3,4,5)(4,5,6)https://github.com/sudar/pig-samples/data-bag.pig
  31. 31. Bag - ExampleInner bag(1,{(1,2,3),(1,2,4)})(2,{(2,3,4)})(3,{(3,4,5)})(4,{(4,5,6)})https://github.com/sudar/pig-samples/data-bag.pig
  32. 32. MapSet of key value pairsSimilar to HashMap in JavaKey must be uniqueKey must be of chararray data typeValues can be any typeKey/value is separated by #Map is enclosed by []
  33. 33. Map - Example[name#sudar, height#176, weight#80.5F][name#(sudar, muthu), height#176, weight#80.5F][name#(sudar, muthu), languages#(Java, Pig, Python)]
  34. 34. NullSimilar to SQLDenotes that value of data element is unknownAny data type can be null
  35. 35. Schemas in Load statementWe can specify a schema (collection of datatypes) to `LOAD`statementsdata = LOAD data/data-bag.txt USING PigStorage(,) AS(f1:int, f2:int, f3:int);data = LOAD data/nested-schema.txt AS(f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);
  36. 36. ExpressionsFields can be looked up by Position Name Map Lookup
  37. 37. Expressions - Exampledata = LOAD data/nested-schema.txt AS(f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);by_pos = FOREACH data GENERATE $0;DUMP by_pos;by_field = FOREACH data GENERATE f2;DUMP by_field;by_map = FOREACH data GENERATE f3#name;DUMP by_map;https://github.com/sudar/pig-samples/lookup.pig
  38. 38. Operators
  39. 39. Arithmetic OperatorsAll usual arithmetic operators are supported Addition (+) Subtraction (-) Multiplication (*) Division (/) Modulo (%)
  40. 40. Boolean OperatorsAll usual boolean operators are supported AND OR NOT
  41. 41. Comparison OperatorsAll usual comparison operators are supported == != < > <= >=
  42. 42. Relational OperatorsFOREACHFLATTERNGROUPFILTERCOUNTORDER BYDISTINCTLIMITJOIN
  43. 43. FOREACHGenerates data transformations based on columns of datax = FOREACH data GENERATE *;x = FOREACH data GENERATE $0, $1;x = FOREACH data GENERATE $0 AS first, $1 ASsecond;
  44. 44. FLATTENUn-nests tuples and bags. Most of the time results incross product(a, (b, c)) => (a,b,c)({(a,b),(d,e)}) => (a,b) and (d,e)(a, {(b,c), (d,e)}) => (a, b, c) and (a, d, e)
  45. 45. GROUP Groups data in one or more relations Groups tuples that have the same group key Similar to SQL group by operatorouterbag = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP outerbag;innerbag = GROUP outerbag BY f1;DUMP innerbag;https://github.com/sudar/pig-samples/group-by.pig
  46. 46. FILTERSelects tuples from a relation based on some conditiondata = LOAD data/data-bag.txt USING PigStorage(,) AS(f1:int, f2:int, f3:int);DUMP data;filtered = FILTER data BY f1 == 1;DUMP filtered;https://github.com/sudar/pig-samples/filter-by.pig
  47. 47. COUNTCounts the number of tuples in a relationshipdata = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);grouped = GROUP data BY f2;counted = FOREACH grouped GENERATE group, COUNT (data);DUMP counted;https://github.com/sudar/pig-samples/count.pig
  48. 48. ORDER BySort a relation based on one or more fields. Similar to SQL order bydata = LOAD data/nested-sample.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP data;ordera = ORDER data BY f1 ASC;DUMP ordera;orderd = ORDER data BY f1 DESC;DUMP orderd;https://github.com/sudar/pig-samples/order-by.pig
  49. 49. DISTINCTRemoves duplicates from a relationdata = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP data;unique = DISTINCT data;DUMP unique;https://github.com/sudar/pig-samples/distinct.pig
  50. 50. LIMITLimits the number of tuples in the output.data = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP data;limited = LIMIT data 3;DUMP limited;https://github.com/sudar/pig-samples/limit.pig
  51. 51. JOINJoins relation based on a field. Both outer and innerjoins are supporteda = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP a;b = LOAD data/simple-tuples.txt USING PigStorage(,) AS (t1:int, t2:int);DUMP b;joined = JOIN a by f1, b by t1;DUMP joined;https://github.com/sudar/pig-samples/join.pig
  52. 52. SQL vs PigFrom Table – Load file(s)Select – FOREACH GENERATEWhere – FILTER BYGroup By – GROUP BY + FOREACH GENERATEHaving – FILTER BYOrder By – ORDER BYDistinct - DISTINCT
  53. 53. Let’s see a complete exampleCount the number of words in a text file https://github.com/sudar/pig-samples/count-words.pig
  54. 54. Extending Pig - UDF
  55. 55. Why UDF? Do operations on more than one field Do more than grouping and filtering Programmer is comfortable Want to reuse existing logicTraditionally UDF can be written only in Java. Now otherlanguages like Python are also supported
  56. 56. Different types of UDF’sEval FunctionsFilter functionsLoad functionsStore functions
  57. 57. Eval Functions Can be used in FOREACH statement Most common type of UDF Can return simple types or Tuplesb = FOREACH a generate udf.Function($0);b = FOREACH a generate udf.Function($0, $1);
  58. 58. Eval FunctionsExtend EvalFunc<T> interfaceThe generic <T> should contain the return typeInput comes as a TupleShould check for empty and nulls in inputExtend exec() function and it should return the valueExtend getArgToFuncMapping() to let UDF know aboutArgument mappingExtend outputSchema() to let UDF know about outputschema
  59. 59. Using Java UDF in Pig ScriptsCreate a jar file which contains your UDF classesRegister the jar at the top of Pig scriptRegister other jars if neededDefine the UDF functionUse your UDF function
  60. 60. Let’s see an example which returns a string https://github.com/sudar/pig-samples/strip-quote.pig
  61. 61. Let’s see an example which returns a Tuple https://github.com/sudar/pig-samples/get-twitter-names.pig
  62. 62. Filter Functions Can be used in the Filter statements Returns a boolean valueEg:vim_tweets = FILTER data By FromVim(StripQuote($6));
  63. 63. Filter FunctionsExtends FilterFun, which is a EvalFunc<Boolean>Should return a booleanInput it is same as EvalFunc<T>Should check for empty and nulls in inputExtend getArgToFuncMapping() to let UDF knowabout Argument mapping
  64. 64. Let’s see an example which returns a Boolean https://github.com/sudar/pig-samples/from-vim.pig
  65. 65. Error Handling in UDFIf the error affects only particular row then returnnull.If the error affects other rows, but can recover, thenthrow an IOExceptionIf the error affects other rows, and can’trecover, then also throw an IOException. Pig andHadoop will quit, if there are many IOExceptions.
  66. 66. Can we try to write some more UDF’s?
  67. 67. Writing UDF in other languages
  68. 68. Streaming
  69. 69. StreamingEntire data set is passed through an external taskThe external task can be in any languageEven shell script also worksUses the `STREAM` function
  70. 70. Stream through shell scriptdata = LOAD data/tweets.csv USING PigStorage(,);filtered = STREAM data THROUGH `cut -f6,8`;DUMP filtered;https://github.com/sudar/pig-samples/stream-shell-script.pig
  71. 71. Stream through Pythondata = LOAD data/tweets.csv USING PigStorage(,);filtered = STREAM data THROUGH `strip.py`;DUMP filtered;https://github.com/sudar/pig-samples/stream-python.pig
  72. 72. Debugging Pig ScriptsDUMP is your friend, but use with LIMITDESCRIBE – will print the schema namesILLUSTRATE – Will show the structure of the schemaIn UDF’s, we can use warn() function. It supportsupto 15 different debug levelsUse Penny -https://cwiki.apache.org/PIG/pennytoollibrary.html
  73. 73. Optimizing Pig ScriptsProject early and oftenFilter early and oftenDrop nulls before a joinPrefer DISTINCT over GROUP BYUse the right data structure
  74. 74. Using Param substitution -p key=value - substitutes a single key, value -m file.ini – substitutes using an ini file default – provide default valueshttp://sudarmuthu.com/blog/passing-command-line-arguments-to-pig-scripts
  75. 75. Problems that can be solved using PigAnything data related
  76. 76. When not to use Pig?Lot of custom logic needs to be implementedNeed to do lot of cross lookupData is mostly binary (processing image files)Real-time processing of data is needed
  77. 77. External LibrariesPiggyBank -https://cwiki.apache.org/PIG/piggybank.htmlDataFu – Linked-In Pig Library -https://github.com/linkedin/datafuElephant Bird – Twitter Pig Library -https://github.com/kevinweil/elephant-bird
  78. 78. Useful Links Pig homepage - http://pig.apache.org/ My blog about Pig -http://sudarmuthu.com/blog/category/hadoop-pig Sample code – https://github.com/sudar/pig-samples Slides – http://slideshare.net/sudar
  79. 79. Thank you
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×