Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pig workshop


Published on

Slides that I used for my Pig Workshop

Published in: Technology

Pig workshop

  1. 1. Pig Workshop Sudar Muthu http://sudarmuthu.com
  2. 2. Who am I?Research Engineer by professionI mine useful information from dataYou might recognize me from other HasGeek eventsBlog at http://sudarmuthu.comBuilds robots as hobby ;)
  3. 3. Special ThanksHasGeek
  4. 4. What I will not cover?
  5. 5. What I will not cover?What is BigData, or why it is needed?What is MapReduce?What is Hadoop?Internal architecture of Pig
  6. 6. What we will see today?
  7. 7. What we will see today?What is PigHow to use it Loading and storing data Pig Latin SQL vs Pig Writing UDF’sDebugging Pig ScriptsOptimizing Pig ScriptsWhen to use Pig
  8. 8. So, all of you have Pig installed right? ;)
  9. 9. What is Pig?“Platform for analyzing large sets of data”
  10. 10. Components of PigPig Shell (Grunt)Pig Language (Latin)Libraries (Piggy Bank)User Defined Functions (UDF)
  11. 11. Why Pig? It is a data flow language Provides standard data processing operations Insulates Hadoop complexity Abstracts Map Reduce Increases programmer productivity… but there are cases where Pig is not suitable.
  12. 12. Pig Modes
  13. 13. For this workshop, we will be using Pig only in local mode
  14. 14. Getting to know your Pig shell
  15. 15. pig –x localSimilar to Python’s shell
  16. 16. Different ways of executing Pig ScriptsInline in shellFrom a fileStreaming through other executableEmbed script in other languages
  17. 17. Loading and Storing dataPigs eat anything
  18. 18. Loading Data into Pigfile = LOAD data/dropbox-policy.txt AS (line);data = LOAD data/tweets.csv USING PigStorage(,);data = LOAD data/tweets.csv USING PigStorage(,)AS (list, of, fields);
  19. 19. Loading Data into PigPigStorage – for most casesTextLoader – to load text filesJSONLoader – to load JSON filesCustom loaders – You can write your own customloaders as well
  20. 20. Viewing DataDUMP input;Very useful for debugging, but don’t use it on hugedatasets
  21. 21. Storing Data from PigSTORE data INTO output_location;STORE data INTO output_location USING PigStorage();STORE data INTO output_location USINGPigStorage(,);STORE data INTO output_location USING BinStorage();
  22. 22. Storing DataSimilar to `LOAD`, lot of options are availableCan store locally or in HDFSYou can write your own custom Storage as well
  23. 23. Load and Store exampledata = LOAD data/data-bag.txt USINGPigStorage(,);STORE data INTO data/output/load-store USINGPigStorage(|);
  24. 24. Pig Latin
  25. 25. Data TypesScalar TypesComplex Types
  26. 26. Scalar Types int, long – (32, 64 bit) integer float, double – (32, 64 bit) floating point boolean (true/false) chararray (String in UTF-8) bytearray (blob) (DataByteArray in Java)If you don’t specify anything bytearray is used bydefault
  27. 27. Complex Typestuple – ordered set of fields(data) bag – collection of tuplesmap – set of key value pairs
  28. 28. Tuple Row with one or more fields Fields can be of any data type Ordering is important Enclosed inside parentheses ()Eg:(Sudar, Muthu, Haris, Dinesh)(Sudar, 176, 80.2F)
  29. 29. BagSet of tuplesSQL equivalent is TableEach tuple can have different set of fieldsCan have duplicatesInner bag uses curly braces {}Outer bag doesn’t use anything
  30. 30. Bag - ExampleOuter bag(1,2,3)(1,2,4)(2,3,4)(3,4,5)(4,5,6)
  31. 31. Bag - ExampleInner bag(1,{(1,2,3),(1,2,4)})(2,{(2,3,4)})(3,{(3,4,5)})(4,{(4,5,6)})
  32. 32. MapSet of key value pairsSimilar to HashMap in JavaKey must be uniqueKey must be of chararray data typeValues can be any typeKey/value is separated by #Map is enclosed by []
  33. 33. Map - Example[name#sudar, height#176, weight#80.5F][name#(sudar, muthu), height#176, weight#80.5F][name#(sudar, muthu), languages#(Java, Pig, Python)]
  34. 34. NullSimilar to SQLDenotes that value of data element is unknownAny data type can be null
  35. 35. Schemas in Load statementWe can specify a schema (collection of datatypes) to `LOAD`statementsdata = LOAD data/data-bag.txt USING PigStorage(,) AS(f1:int, f2:int, f3:int);data = LOAD data/nested-schema.txt AS(f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);
  36. 36. ExpressionsFields can be looked up by Position Name Map Lookup
  37. 37. Expressions - Exampledata = LOAD data/nested-schema.txt AS(f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);by_pos = FOREACH data GENERATE $0;DUMP by_pos;by_field = FOREACH data GENERATE f2;DUMP by_field;by_map = FOREACH data GENERATE f3#name;DUMP by_map;
  38. 38. Operators
  39. 39. Arithmetic OperatorsAll usual arithmetic operators are supported Addition (+) Subtraction (-) Multiplication (*) Division (/) Modulo (%)
  40. 40. Boolean OperatorsAll usual boolean operators are supported AND OR NOT
  41. 41. Comparison OperatorsAll usual comparison operators are supported == != < > <= >=
  43. 43. FOREACHGenerates data transformations based on columns of datax = FOREACH data GENERATE *;x = FOREACH data GENERATE $0, $1;x = FOREACH data GENERATE $0 AS first, $1 ASsecond;
  44. 44. FLATTENUn-nests tuples and bags. Most of the time results incross product(a, (b, c)) => (a,b,c)({(a,b),(d,e)}) => (a,b) and (d,e)(a, {(b,c), (d,e)}) => (a, b, c) and (a, d, e)
  45. 45. GROUP Groups data in one or more relations Groups tuples that have the same group key Similar to SQL group by operatorouterbag = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP outerbag;innerbag = GROUP outerbag BY f1;DUMP innerbag;
  46. 46. FILTERSelects tuples from a relation based on some conditiondata = LOAD data/data-bag.txt USING PigStorage(,) AS(f1:int, f2:int, f3:int);DUMP data;filtered = FILTER data BY f1 == 1;DUMP filtered;
  47. 47. COUNTCounts the number of tuples in a relationshipdata = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);grouped = GROUP data BY f2;counted = FOREACH grouped GENERATE group, COUNT (data);DUMP counted;
  48. 48. ORDER BySort a relation based on one or more fields. Similar to SQL order bydata = LOAD data/nested-sample.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP data;ordera = ORDER data BY f1 ASC;DUMP ordera;orderd = ORDER data BY f1 DESC;DUMP orderd;
  49. 49. DISTINCTRemoves duplicates from a relationdata = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP data;unique = DISTINCT data;DUMP unique;
  50. 50. LIMITLimits the number of tuples in the = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP data;limited = LIMIT data 3;DUMP limited;
  51. 51. JOINJoins relation based on a field. Both outer and innerjoins are supporteda = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP a;b = LOAD data/simple-tuples.txt USING PigStorage(,) AS (t1:int, t2:int);DUMP b;joined = JOIN a by f1, b by t1;DUMP joined;
  52. 52. SQL vs PigFrom Table – Load file(s)Select – FOREACH GENERATEWhere – FILTER BYGroup By – GROUP BY + FOREACH GENERATEHaving – FILTER BYOrder By – ORDER BYDistinct - DISTINCT
  53. 53. Let’s see a complete exampleCount the number of words in a text file
  54. 54. Extending Pig - UDF
  55. 55. Why UDF? Do operations on more than one field Do more than grouping and filtering Programmer is comfortable Want to reuse existing logicTraditionally UDF can be written only in Java. Now otherlanguages like Python are also supported
  56. 56. Different types of UDF’sEval FunctionsFilter functionsLoad functionsStore functions
  57. 57. Eval Functions Can be used in FOREACH statement Most common type of UDF Can return simple types or Tuplesb = FOREACH a generate udf.Function($0);b = FOREACH a generate udf.Function($0, $1);
  58. 58. Eval FunctionsExtend EvalFunc<T> interfaceThe generic <T> should contain the return typeInput comes as a TupleShould check for empty and nulls in inputExtend exec() function and it should return the valueExtend getArgToFuncMapping() to let UDF know aboutArgument mappingExtend outputSchema() to let UDF know about outputschema
  59. 59. Using Java UDF in Pig ScriptsCreate a jar file which contains your UDF classesRegister the jar at the top of Pig scriptRegister other jars if neededDefine the UDF functionUse your UDF function
  60. 60. Let’s see an example which returns a string
  61. 61. Let’s see an example which returns a Tuple
  62. 62. Filter Functions Can be used in the Filter statements Returns a boolean valueEg:vim_tweets = FILTER data By FromVim(StripQuote($6));
  63. 63. Filter FunctionsExtends FilterFun, which is a EvalFunc<Boolean>Should return a booleanInput it is same as EvalFunc<T>Should check for empty and nulls in inputExtend getArgToFuncMapping() to let UDF knowabout Argument mapping
  64. 64. Let’s see an example which returns a Boolean
  65. 65. Error Handling in UDFIf the error affects only particular row then returnnull.If the error affects other rows, but can recover, thenthrow an IOExceptionIf the error affects other rows, and can’trecover, then also throw an IOException. Pig andHadoop will quit, if there are many IOExceptions.
  66. 66. Can we try to write some more UDF’s?
  67. 67. Writing UDF in other languages
  68. 68. Streaming
  69. 69. StreamingEntire data set is passed through an external taskThe external task can be in any languageEven shell script also worksUses the `STREAM` function
  70. 70. Stream through shell scriptdata = LOAD data/tweets.csv USING PigStorage(,);filtered = STREAM data THROUGH `cut -f6,8`;DUMP filtered;
  71. 71. Stream through Pythondata = LOAD data/tweets.csv USING PigStorage(,);filtered = STREAM data THROUGH ``;DUMP filtered;
  72. 72. Debugging Pig ScriptsDUMP is your friend, but use with LIMITDESCRIBE – will print the schema namesILLUSTRATE – Will show the structure of the schemaIn UDF’s, we can use warn() function. It supportsupto 15 different debug levelsUse Penny -
  73. 73. Optimizing Pig ScriptsProject early and oftenFilter early and oftenDrop nulls before a joinPrefer DISTINCT over GROUP BYUse the right data structure
  74. 74. Using Param substitution -p key=value - substitutes a single key, value -m file.ini – substitutes using an ini file default – provide default values
  75. 75. Problems that can be solved using PigAnything data related
  76. 76. When not to use Pig?Lot of custom logic needs to be implementedNeed to do lot of cross lookupData is mostly binary (processing image files)Real-time processing of data is needed
  77. 77. External LibrariesPiggyBank - – Linked-In Pig Library - Bird – Twitter Pig Library -
  78. 78. Useful Links Pig homepage - My blog about Pig - Sample code – Slides –
  79. 79. Thank you