Upcoming SlideShare
×

5,145 views

Published on

Slides from my 2011 Pig hands on lab at @TelecomBretagne

Published in: Technology
8 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
5,145
On SlideShare
0
From Embeds
0
Number of Embeds
923
Actions
Shares
0
1
0
Likes
8
Embeds 0
No embeds

No notes for slide

1. 1. @herberts
2. 2. Simple Types
3. 3. int Signed 32-bit integer 31 31- (2 ) 2 -1− 2,147,483,648 2,147,483,647
4. 4. long Signed 64-bit integer 63 63 - (2 ) 2 -1− 9,223,372,036,854,775,808 9,223,372,036,854,775,807
5. 5. float Signed 32-bit floating point -23 127 -149∓(2-2 )·2 ∓2
6. 6. double Signed 64-bit floating point -52 1023 -1074∓(2-2 )·2 ∓2
7. 7. chararray UTF-8 Character StringHello, World!
8. 8. bytearray Byte Array (blob)Default Type
9. 9. Complex Types
10. 10. tuple Ordered Set of Fields( field [, field …] )
11. 11. bag Collection of Tuples{ tuple [, tuple …] }
12. 12. map Set of Key Value Pairs[ key#val <, key#val …> ]
13. 13. relation Outer bag of tuplesPig works on relations
14. 14. Constants
15. 15. int 42long 42Lfloat 42.0F or 0.42e2fdouble 3.14 or 0.314e1chararray Hello, World!bytearray ∅tuple (3.14, 2.78)bag { (1,2), (3,4) }map [a#b, c#42 ]
16. 16. Schemas
17. 17. Simple Types ( alias[:type] [, alias[:type] …] )(name:chararray, age:int, gpa:float)
18. 18. Tuple Schema( alias[:tuple] ( alias[:type] [, alias[:type] …] ) ) (T: tuple (f1:int, f2:int, f3:int))
19. 19. Bag Schema ( alias[:bag] { tuple schema } )(B: bag {T: tuple(t1:int, t2:int, t3:int)})
20. 20. Map Schema( alias[:map] [ ] ) (M:map [])
21. 21. Operators
22. 22. Dereferencetuple T.field_name T.\$0 T.(\$1, field_name)bag B.field_name B.\$0 B.(\$1, field_name)map M#key
23. 23. Arithmeticaddition +subtraction -multiplication *division /modulo %bincond (cond?v_true:v_false)
24. 24. Comparisonequal ==not equal !=less than <greater than >less or equal to <=greater or equal to >=pattern matching matches regexpis null is nullis not null is not null
25. 25. Sign / Booleanpositive + (has no effect)negative -AND andOR orNOT not
26. 26. FLATTEN Un-nests tuples and bags(a,(b,c)) (a,b,c) (GENERATE \$0,FLATTEN(\$1))({(a,b), (c,d)}) (a,b) (GENERATE FLATTEN(\$0)) (c,d)(a, {(b,c), (d,e)}) (a, b, c) (GENERATE \$1, FLATTEN(\$0)) (a, d, e)
27. 27. Relational Operators
29. 29. DESCRIBE Returns the schema of a relation DESCRIBE alias; DUMP Dumps or displays results to screen DUMP alias;EXPLAIN Displays execution plans EXPLAIN alias;
30. 30. STORE Stores or saves results to the file system STORE alias INTO directory [USING function];function PigStorage(field_delimiter) [t] PigDump() BinStorage() Store UDFdirectory Directory where output files named part-nnnnn will be createdalias Name of relation to store STORE A INTO myoutput using PigStorage(,);
31. 31. LIMIT Limits the number of output tupless alias_1 = LIMIT alias_0 n;SAMPLE Selects a random data sample alias_1 = SAMPLE alias_0 size; (size in [0.0, 1.0])DISTINCT Removes duplicates tuples in a relation alias_1 = DISTINCT alias_0;
32. 32. ORDER BY Sorts a relation based on one+ fields alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n]; field_alias A field in the relation, MUST be a simple type n Number of reducers to use X = ORDER A BY a3 DESC; Retrieve relation immediately after ORDER BY to retain order
33. 33. FILTER Selects tuples based on some condition alias_1 = FILTER alias_0 BY expression;expression A boolean expression Use parentheses for complex Expressions Can call UDFs X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));
34. 34. SPLIT Partitions a relation into other relationsSPLIT alias0 INTO alias1 IF expression1 [, aliasN IF expressionN …]; expressionX A boolean expression Use parentheses for complex Expressions Can call UDFs SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
35. 35. UNION Computes the union of multiple relationsalias = UNION [ONSCHEMA] alias_0, alias_1 [, alias_N ...];ONSCHEMA Use this keyword so the union is based on column names. Columns with identical names MUST have compatible types. U = UNION ONSCHEMA L1, L2;
36. 36. CROSS Computes the cross product of N relationsalias = CROSS alias0, alias1 [, aliasN ...] [PARTITION BY part] [PARALLEL n]; part Specify a custom Hadoop partitioner, part is the name of a class which extends org.apache.hadoop.mapred.Partitioner n Specify the number of reduce tasks to increase parallelism. X = CROSS A, B; CROSS can produce a large amount of tuples
37. 37. JOIN Performs an inner join of two+ relationsalias = JOIN alias0 BY {expression|(expression [, expression …])} [, alias1 BY {expression|(expression [, expression …])} …] [USING replicated | skewed | merge] [PARTITION BY partitioner] [PARALLEL n]; expression Field expression partitioner Specify a custom Hadoop partitioner n Number of reduce tasks replicated All but first relation kept in memory skewed Counterbalance skewed key set merge Use optimized merge-sort (sorted data) Nulls are dropped X = JOIN A BY a1, B BY b1;
38. 38. JOIN Performs an outer join of two relationsalias = JOIN left_alias BY left_alias_column [LEFT|RIGHT|FULL] [OUTER], right_alias BY right_alias_column [USING replicated | skewed | merge] [PARTITION BY partitioner] [PARALLEL n]; partitioner Specify a custom Hadoop partitioner n Number of reduce tasks replicated All but first relation kept in memory skewed Counterbalance skewed key set merge Use optimized merge-sort (sorted data) C = JOIN A by \$0 LEFT OUTER, B BY \$0; Nulls are dropped
39. 39. GROUP Groups the data in one or more relationsalias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING collected | merge] [PARTITION BY partitioner] [PARALLEL n]; partitioner Specify a custom Hadoop partitioner n Number of reduce tasks collected Optimization for CollectableLoader merge Optimization for sorted data ALL Put all tuples in one group B = GROUP A BY age; X = COGROUP A BY owner, B BY friend2; For readability GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations.
40. 40. FOREACH Generates data transformations alias1 = FOREACH alias GENERATE expression [AS schema] [, expression [AS schema] ... ]; alias Name of a relation (outer bag) Schema Schema to use for the output relation Enclose schema in parentheses if the FLATTEN operator is used expression Expression or * to project all tuples X = FOREACH A GENERATE f1;
41. 41. FOREACH Generates data transformationsalias1 = FOREACH alias { alias = nested_op; [ alias = nested_op; ... ] GENERATE expression [AS schema][, expression [AS schema] ... ] }; nested_op Operation on an inner bag Allowed operations are: DISTINCT, FILTER, LIMIT, ORDER BY Projections can also be performed GENERATE keyword must appear last X = FOREACH B { FA= FILTER A BY outlink == www.xyz.org; PA = FA.outlink; DA = DISTINCT PA; GENERATE group, COUNT(DA); };
42. 42. MAPREDUCE Executes native MapReduce jobsalias = MAPREDUCE mr.jar STORE in_alias INTO inputLocation USING storeFunc LOAD outputLocation USING loadFunc AS schema [`params, ... `]; mr.jar MapReduce jar file that can be run by hadoop jar mr.jar params in_alias Relation to feed as input to the job params MapReduce job params in backticks, MUST include {input,output}Location B = MAPREDUCE wordcount.jar STORE A INTO inputDir LOAD outputDir AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
43. 43. STREAM Sends data to an external program alias = STREAM alias [, alias …] THROUGH {`command` | cmd_alias } [AS schema] ; command A command, including the arguments, enclosed in backticks cmd_alias Name of a command created using the DEFINE operator DEFINE mycmd stream.pl –n 5; B = STREAM A THROUGH mycmd; C = STREAM A THROUGH `stream.pl -n 5`;
44. 44. UDFs
45. 45. REGISTER path; Registers a JAR file so its UDFs can be usedpath The path to the JAR file (no quotes) Side effect of REGISTER is the copy of the file into tasks work dir Can be used for resource files REGISTER src/foo.jar;
46. 46. DEFINE alias {function | [`command` [input] [output] [ship] [cache]] }; Assigns alias to a UDF or a streaming command function Class name of UDF and init parameters Command Command (with arguments) in backticks Input INPUT ({stdin|path} [USING serializer] [,{stdin|path} [USING serializer]…]) Output OUTPUT ({stdout|stderr|path} [USING deserializer] [,{stdout|stderr|path} [USING deserializer] …] ) Default serializer/deserializer is PigStreaming
47. 47. DEFINE alias {function | [`command` [input] [output] [ship] [cache]] }; ship SHIP(path [,path…]) Include the given paths with the job Use to include the script to run cache CACHE(dfs_path#dfs_file [,dfs_path#dfs_file…]) Copy in the scripts working dir Some data already in HDFS DEFINE Y stream.pl data.gz SHIP(/work/stream.pl) CACHE(/input/data.gz#data.gz); DEFINE myFunc myfunc.MyEvalfunc(foo);
48. 48. Misc
49. 49. Parameters-param name=val Assigns val to \$name-param_file f Reads parameters from file f%declare p val Assigns val to param p%default p val Assigns val as default value for parameter p\$param Dereferences param
50. 50. Builtin UDFs Eval Math String Bag/TupleAVG ABS INDEXOF TOBAGCONCAT ACOS LAST_INDEX_OF TOPCOUNTS ASIN LCFIRST TOTUPLECOUNT_STAR ATAN LOWERDIFF CBRT REGEX_EXTRACTMAX CEIL REGEX_EXTRACT_ALLMIN COS REPLACESIZE COSH STRSPLITSUM EXP SUBSTRINGTOKENIZE FLOOR TRIM LOG UCFIRST LOG10 UPPER RANDOM ROUND SIN SINH SQRT TAN TANH
51. 51. CommandsShell fs FSShell_subcommand subcommand_parameters sh shell_subcommand subcommand_parametersUtility {exec, run} [–param name = value] [–param_file file_name] script kill jobid set key value (default_parallel, debug, job.name, job.priority, stream.skippath) help quit