Pig workshop

Uploaded on

Slides that I used for my Pig Workshop

Slides that I used for my Pig Workshop

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Pig Workshop Sudar Muthu http://sudarmuthu.comhttp://twitter.com/sudarmuthu https://github.com/sudar
  • 2. Who am I?Research Engineer by professionI mine useful information from dataYou might recognize me from other HasGeek eventsBlog at http://sudarmuthu.comBuilds robots as hobby ;)
  • 3. Special ThanksHasGeek
  • 4. What I will not cover?
  • 5. What I will not cover?What is BigData, or why it is needed?What is MapReduce?What is Hadoop?Internal architecture of Pig http://sudarmuthu.com/blog/getting-started-with-hadoop-and-pig
  • 6. What we will see today?
  • 7. What we will see today?What is PigHow to use it Loading and storing data Pig Latin SQL vs Pig Writing UDF’sDebugging Pig ScriptsOptimizing Pig ScriptsWhen to use Pig
  • 8. So, all of you have Pig installed right? ;)
  • 9. What is Pig?“Platform for analyzing large sets of data”
  • 10. Components of PigPig Shell (Grunt)Pig Language (Latin)Libraries (Piggy Bank)User Defined Functions (UDF)
  • 11. Why Pig? It is a data flow language Provides standard data processing operations Insulates Hadoop complexity Abstracts Map Reduce Increases programmer productivity… but there are cases where Pig is not suitable.
  • 12. Pig Modes
  • 13. For this workshop, we will be using Pig only in local mode
  • 14. Getting to know your Pig shell
  • 15. pig –x localSimilar to Python’s shell
  • 16. Different ways of executing Pig ScriptsInline in shellFrom a fileStreaming through other executableEmbed script in other languages
  • 17. Loading and Storing dataPigs eat anything
  • 18. Loading Data into Pigfile = LOAD data/dropbox-policy.txt AS (line);data = LOAD data/tweets.csv USING PigStorage(,);data = LOAD data/tweets.csv USING PigStorage(,)AS (list, of, fields);
  • 19. Loading Data into PigPigStorage – for most casesTextLoader – to load text filesJSONLoader – to load JSON filesCustom loaders – You can write your own customloaders as well
  • 20. Viewing DataDUMP input;Very useful for debugging, but don’t use it on hugedatasets
  • 21. Storing Data from PigSTORE data INTO output_location;STORE data INTO output_location USING PigStorage();STORE data INTO output_location USINGPigStorage(,);STORE data INTO output_location USING BinStorage();
  • 22. Storing DataSimilar to `LOAD`, lot of options are availableCan store locally or in HDFSYou can write your own custom Storage as well
  • 23. Load and Store exampledata = LOAD data/data-bag.txt USINGPigStorage(,);STORE data INTO data/output/load-store USINGPigStorage(|);https://github.com/sudar/pig-samples/load-store.pig
  • 24. Pig Latin
  • 25. Data TypesScalar TypesComplex Types
  • 26. Scalar Types int, long – (32, 64 bit) integer float, double – (32, 64 bit) floating point boolean (true/false) chararray (String in UTF-8) bytearray (blob) (DataByteArray in Java)If you don’t specify anything bytearray is used bydefault
  • 27. Complex Typestuple – ordered set of fields(data) bag – collection of tuplesmap – set of key value pairs
  • 28. Tuple Row with one or more fields Fields can be of any data type Ordering is important Enclosed inside parentheses ()Eg:(Sudar, Muthu, Haris, Dinesh)(Sudar, 176, 80.2F)
  • 29. BagSet of tuplesSQL equivalent is TableEach tuple can have different set of fieldsCan have duplicatesInner bag uses curly braces {}Outer bag doesn’t use anything
  • 30. Bag - ExampleOuter bag(1,2,3)(1,2,4)(2,3,4)(3,4,5)(4,5,6)https://github.com/sudar/pig-samples/data-bag.pig
  • 31. Bag - ExampleInner bag(1,{(1,2,3),(1,2,4)})(2,{(2,3,4)})(3,{(3,4,5)})(4,{(4,5,6)})https://github.com/sudar/pig-samples/data-bag.pig
  • 32. MapSet of key value pairsSimilar to HashMap in JavaKey must be uniqueKey must be of chararray data typeValues can be any typeKey/value is separated by #Map is enclosed by []
  • 33. Map - Example[name#sudar, height#176, weight#80.5F][name#(sudar, muthu), height#176, weight#80.5F][name#(sudar, muthu), languages#(Java, Pig, Python)]
  • 34. NullSimilar to SQLDenotes that value of data element is unknownAny data type can be null
  • 35. Schemas in Load statementWe can specify a schema (collection of datatypes) to `LOAD`statementsdata = LOAD data/data-bag.txt USING PigStorage(,) AS(f1:int, f2:int, f3:int);data = LOAD data/nested-schema.txt AS(f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);
  • 36. ExpressionsFields can be looked up by Position Name Map Lookup
  • 37. Expressions - Exampledata = LOAD data/nested-schema.txt AS(f1:int, f2:bag{t:tuple(n1:int, n2:int)}, f3:map[]);by_pos = FOREACH data GENERATE $0;DUMP by_pos;by_field = FOREACH data GENERATE f2;DUMP by_field;by_map = FOREACH data GENERATE f3#name;DUMP by_map;https://github.com/sudar/pig-samples/lookup.pig
  • 38. Operators
  • 39. Arithmetic OperatorsAll usual arithmetic operators are supported Addition (+) Subtraction (-) Multiplication (*) Division (/) Modulo (%)
  • 40. Boolean OperatorsAll usual boolean operators are supported AND OR NOT
  • 41. Comparison OperatorsAll usual comparison operators are supported == != < > <= >=
  • 43. FOREACHGenerates data transformations based on columns of datax = FOREACH data GENERATE *;x = FOREACH data GENERATE $0, $1;x = FOREACH data GENERATE $0 AS first, $1 ASsecond;
  • 44. FLATTENUn-nests tuples and bags. Most of the time results incross product(a, (b, c)) => (a,b,c)({(a,b),(d,e)}) => (a,b) and (d,e)(a, {(b,c), (d,e)}) => (a, b, c) and (a, d, e)
  • 45. GROUP Groups data in one or more relations Groups tuples that have the same group key Similar to SQL group by operatorouterbag = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP outerbag;innerbag = GROUP outerbag BY f1;DUMP innerbag;https://github.com/sudar/pig-samples/group-by.pig
  • 46. FILTERSelects tuples from a relation based on some conditiondata = LOAD data/data-bag.txt USING PigStorage(,) AS(f1:int, f2:int, f3:int);DUMP data;filtered = FILTER data BY f1 == 1;DUMP filtered;https://github.com/sudar/pig-samples/filter-by.pig
  • 47. COUNTCounts the number of tuples in a relationshipdata = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);grouped = GROUP data BY f2;counted = FOREACH grouped GENERATE group, COUNT (data);DUMP counted;https://github.com/sudar/pig-samples/count.pig
  • 48. ORDER BySort a relation based on one or more fields. Similar to SQL order bydata = LOAD data/nested-sample.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP data;ordera = ORDER data BY f1 ASC;DUMP ordera;orderd = ORDER data BY f1 DESC;DUMP orderd;https://github.com/sudar/pig-samples/order-by.pig
  • 49. DISTINCTRemoves duplicates from a relationdata = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP data;unique = DISTINCT data;DUMP unique;https://github.com/sudar/pig-samples/distinct.pig
  • 50. LIMITLimits the number of tuples in the output.data = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP data;limited = LIMIT data 3;DUMP limited;https://github.com/sudar/pig-samples/limit.pig
  • 51. JOINJoins relation based on a field. Both outer and innerjoins are supporteda = LOAD data/data-bag.txt USING PigStorage(,) AS (f1:int, f2:int, f3:int);DUMP a;b = LOAD data/simple-tuples.txt USING PigStorage(,) AS (t1:int, t2:int);DUMP b;joined = JOIN a by f1, b by t1;DUMP joined;https://github.com/sudar/pig-samples/join.pig
  • 52. SQL vs PigFrom Table – Load file(s)Select – FOREACH GENERATEWhere – FILTER BYGroup By – GROUP BY + FOREACH GENERATEHaving – FILTER BYOrder By – ORDER BYDistinct - DISTINCT
  • 53. Let’s see a complete exampleCount the number of words in a text file https://github.com/sudar/pig-samples/count-words.pig
  • 54. Extending Pig - UDF
  • 55. Why UDF? Do operations on more than one field Do more than grouping and filtering Programmer is comfortable Want to reuse existing logicTraditionally UDF can be written only in Java. Now otherlanguages like Python are also supported
  • 56. Different types of UDF’sEval FunctionsFilter functionsLoad functionsStore functions
  • 57. Eval Functions Can be used in FOREACH statement Most common type of UDF Can return simple types or Tuplesb = FOREACH a generate udf.Function($0);b = FOREACH a generate udf.Function($0, $1);
  • 58. Eval FunctionsExtend EvalFunc<T> interfaceThe generic <T> should contain the return typeInput comes as a TupleShould check for empty and nulls in inputExtend exec() function and it should return the valueExtend getArgToFuncMapping() to let UDF know aboutArgument mappingExtend outputSchema() to let UDF know about outputschema
  • 59. Using Java UDF in Pig ScriptsCreate a jar file which contains your UDF classesRegister the jar at the top of Pig scriptRegister other jars if neededDefine the UDF functionUse your UDF function
  • 60. Let’s see an example which returns a string https://github.com/sudar/pig-samples/strip-quote.pig
  • 61. Let’s see an example which returns a Tuple https://github.com/sudar/pig-samples/get-twitter-names.pig
  • 62. Filter Functions Can be used in the Filter statements Returns a boolean valueEg:vim_tweets = FILTER data By FromVim(StripQuote($6));
  • 63. Filter FunctionsExtends FilterFun, which is a EvalFunc<Boolean>Should return a booleanInput it is same as EvalFunc<T>Should check for empty and nulls in inputExtend getArgToFuncMapping() to let UDF knowabout Argument mapping
  • 64. Let’s see an example which returns a Boolean https://github.com/sudar/pig-samples/from-vim.pig
  • 65. Error Handling in UDFIf the error affects only particular row then returnnull.If the error affects other rows, but can recover, thenthrow an IOExceptionIf the error affects other rows, and can’trecover, then also throw an IOException. Pig andHadoop will quit, if there are many IOExceptions.
  • 66. Can we try to write some more UDF’s?
  • 67. Writing UDF in other languages
  • 68. Streaming
  • 69. StreamingEntire data set is passed through an external taskThe external task can be in any languageEven shell script also worksUses the `STREAM` function
  • 70. Stream through shell scriptdata = LOAD data/tweets.csv USING PigStorage(,);filtered = STREAM data THROUGH `cut -f6,8`;DUMP filtered;https://github.com/sudar/pig-samples/stream-shell-script.pig
  • 71. Stream through Pythondata = LOAD data/tweets.csv USING PigStorage(,);filtered = STREAM data THROUGH `strip.py`;DUMP filtered;https://github.com/sudar/pig-samples/stream-python.pig
  • 72. Debugging Pig ScriptsDUMP is your friend, but use with LIMITDESCRIBE – will print the schema namesILLUSTRATE – Will show the structure of the schemaIn UDF’s, we can use warn() function. It supportsupto 15 different debug levelsUse Penny -https://cwiki.apache.org/PIG/pennytoollibrary.html
  • 73. Optimizing Pig ScriptsProject early and oftenFilter early and oftenDrop nulls before a joinPrefer DISTINCT over GROUP BYUse the right data structure
  • 74. Using Param substitution -p key=value - substitutes a single key, value -m file.ini – substitutes using an ini file default – provide default valueshttp://sudarmuthu.com/blog/passing-command-line-arguments-to-pig-scripts
  • 75. Problems that can be solved using PigAnything data related
  • 76. When not to use Pig?Lot of custom logic needs to be implementedNeed to do lot of cross lookupData is mostly binary (processing image files)Real-time processing of data is needed
  • 77. External LibrariesPiggyBank -https://cwiki.apache.org/PIG/piggybank.htmlDataFu – Linked-In Pig Library -https://github.com/linkedin/datafuElephant Bird – Twitter Pig Library -https://github.com/kevinweil/elephant-bird
  • 78. Useful Links Pig homepage - http://pig.apache.org/ My blog about Pig -http://sudarmuthu.com/blog/category/hadoop-pig Sample code – https://github.com/sudar/pig-samples Slides – http://slideshare.net/sudar
  • 79. Thank you