@herberts
Simple Types
int       Signed 32-bit integer         31                 31- (2 )                  2 -1− 2,147,483,648          2,147,48...
long              Signed 64-bit integer              63                       63    - (2 )                         2 -1− 9...
float  Signed 32-bit floating point      -23         127            -149∓(2-2       )·2           ∓2
double   Signed 64-bit floating point        -52         1023          -1074∓(2-2         )·2          ∓2
chararray UTF-8 Character StringHello, World!
bytearray Byte Array (blob)Default Type
Complex Types
tuple   Ordered Set of Fields( field [, field …] )
bag     Collection of Tuples{ tuple [, tuple …] }
map      Set of Key Value Pairs[ key#val <, key#val …> ]
relation      Outer bag of tuplesPig works on relations
Constants
int         42long        42Lfloat       42.0F or 0.42e2fdouble      3.14 or 0.314e1chararray   Hello, World!bytearray   ∅...
Schemas
Simple Types     ( alias[:type] [, alias[:type] …] )(name:chararray, age:int, gpa:float)
Tuple Schema( alias[:tuple] ( alias[:type] [, alias[:type] …] ) )    (T: tuple (f1:int, f2:int,            f3:int))
Bag Schema        ( alias[:bag] { tuple schema } )(B: bag {T: tuple(t1:int, t2:int, t3:int)})
Map Schema( alias[:map] [ ] ) (M:map [])
Operators
Dereferencetuple      T.field_name           T.$0           T.($1, field_name)bag        B.field_name           B.$0      ...
Arithmeticaddition         +subtraction      -multiplication   *division         /modulo           %bincond          (cond...
Comparisonequal                 ==not equal             !=less than             <greater than          >less or equal to  ...
Sign / Booleanpositive   + (has no effect)negative   -AND        andOR         orNOT        not
FLATTEN        Un-nests tuples and bags(a,(b,c))              (a,b,c)       (GENERATE $0,FLATTEN($1))({(a,b), (c,d)})     ...
Relational Operators
LOAD            Loads data from the file system       alias = LOAD data [USING function] [AS schema];function         PigS...
DESCRIBE           Returns the schema of a relation                   DESCRIBE alias; DUMP           Dumps or displays res...
STORE            Stores or saves results to the file system        STORE alias INTO directory [USING function];function   ...
LIMIT           Limits the number of output tupless                alias_1 = LIMIT alias_0 n;SAMPLE           Selects a ra...
ORDER BY            Sorts a relation based on one+ fields  alias = ORDER alias BY { * [ASC|DESC]                          ...
FILTER            Selects tuples based on some condition           alias_1 = FILTER alias_0 BY expression;expression      ...
SPLIT             Partitions a relation into other relationsSPLIT alias0 INTO alias1 IF expression1 [, aliasN IF expressio...
UNION         Computes the union of multiple relationsalias = UNION [ONSCHEMA] alias_0, alias_1 [, alias_N ...];ONSCHEMA  ...
CROSS               Computes the cross product of N relationsalias = CROSS alias0, alias1 [, aliasN ...] [PARTITION BY par...
JOIN            Performs an inner join of two+ relationsalias = JOIN alias0 BY {expression|(expression [, expression …])} ...
JOIN            Performs an outer join of two relationsalias = JOIN left_alias BY left_alias_column [LEFT|RIGHT|FULL] [OUT...
GROUP                Groups the data in one or more relationsalias = GROUP alias { ALL | BY expression} [, alias ALL | BY ...
FOREACH           Generates data transformations     alias1 = FOREACH alias GENERATE expression [AS schema]              [...
FOREACH              Generates data transformationsalias1 = FOREACH alias {          alias = nested_op; [ alias = nested_o...
MAPREDUCE             Executes native MapReduce jobsalias = MAPREDUCE mr.jar        STORE in_alias INTO inputLocation USIN...
STREAM              Sends data to an external program         alias = STREAM alias [, alias …]                 THROUGH {`c...
UDFs
REGISTER path;   Registers a JAR file so its UDFs can be usedpath       The path to the JAR file (no quotes)           Sid...
DEFINE alias {function              | [`command` [input] [output] [ship] [cache]] }; Assigns alias to a UDF or a streaming...
DEFINE alias {function              | [`command` [input] [output] [ship] [cache]] };   ship                  SHIP(path [,p...
Misc
Parameters-param name=val Assigns val to $name-param_file f   Reads parameters from                file f%declare p val   ...
Builtin UDFs   Eval        Math         String         Bag/TupleAVG           ABS      INDEXOF              TOBAGCONCAT   ...
CommandsShell fs FSShell_subcommand subcommand_parameters sh shell_subcommand subcommand_parametersUtility {exec, run} [–p...
Hadoop Pig
Upcoming SlideShare
Loading in …5
×

Hadoop Pig

5,145 views

Published on

Slides from my 2011 Pig hands on lab at @TelecomBretagne

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,145
On SlideShare
0
From Embeds
0
Number of Embeds
923
Actions
Shares
0
Downloads
1
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Hadoop Pig

  1. 1. @herberts
  2. 2. Simple Types
  3. 3. int Signed 32-bit integer 31 31- (2 ) 2 -1− 2,147,483,648 2,147,483,647
  4. 4. long Signed 64-bit integer 63 63 - (2 ) 2 -1− 9,223,372,036,854,775,808 9,223,372,036,854,775,807
  5. 5. float Signed 32-bit floating point -23 127 -149∓(2-2 )·2 ∓2
  6. 6. double Signed 64-bit floating point -52 1023 -1074∓(2-2 )·2 ∓2
  7. 7. chararray UTF-8 Character StringHello, World!
  8. 8. bytearray Byte Array (blob)Default Type
  9. 9. Complex Types
  10. 10. tuple Ordered Set of Fields( field [, field …] )
  11. 11. bag Collection of Tuples{ tuple [, tuple …] }
  12. 12. map Set of Key Value Pairs[ key#val <, key#val …> ]
  13. 13. relation Outer bag of tuplesPig works on relations
  14. 14. Constants
  15. 15. int 42long 42Lfloat 42.0F or 0.42e2fdouble 3.14 or 0.314e1chararray Hello, World!bytearray ∅tuple (3.14, 2.78)bag { (1,2), (3,4) }map [a#b, c#42 ]
  16. 16. Schemas
  17. 17. Simple Types ( alias[:type] [, alias[:type] …] )(name:chararray, age:int, gpa:float)
  18. 18. Tuple Schema( alias[:tuple] ( alias[:type] [, alias[:type] …] ) ) (T: tuple (f1:int, f2:int, f3:int))
  19. 19. Bag Schema ( alias[:bag] { tuple schema } )(B: bag {T: tuple(t1:int, t2:int, t3:int)})
  20. 20. Map Schema( alias[:map] [ ] ) (M:map [])
  21. 21. Operators
  22. 22. Dereferencetuple T.field_name T.$0 T.($1, field_name)bag B.field_name B.$0 B.($1, field_name)map M#key
  23. 23. Arithmeticaddition +subtraction -multiplication *division /modulo %bincond (cond?v_true:v_false)
  24. 24. Comparisonequal ==not equal !=less than <greater than >less or equal to <=greater or equal to >=pattern matching matches regexpis null is nullis not null is not null
  25. 25. Sign / Booleanpositive + (has no effect)negative -AND andOR orNOT not
  26. 26. FLATTEN Un-nests tuples and bags(a,(b,c)) (a,b,c) (GENERATE $0,FLATTEN($1))({(a,b), (c,d)}) (a,b) (GENERATE FLATTEN($0)) (c,d)(a, {(b,c), (d,e)}) (a, b, c) (GENERATE $1, FLATTEN($0)) (a, d, e)
  27. 27. Relational Operators
  28. 28. LOAD Loads data from the file system alias = LOAD data [USING function] [AS schema];function PigStorage(field_delimiter) [t] BinStorage() Load UDFdata File(s) to load data fromschema Schema to use when loading data A = LOAD myfile.txt USING PigStorage(‘t’) AS (f1:int, f2:int, f3:int);
  29. 29. DESCRIBE Returns the schema of a relation DESCRIBE alias; DUMP Dumps or displays results to screen DUMP alias;EXPLAIN Displays execution plans EXPLAIN alias;
  30. 30. STORE Stores or saves results to the file system STORE alias INTO directory [USING function];function PigStorage(field_delimiter) [t] PigDump() BinStorage() Store UDFdirectory Directory where output files named part-nnnnn will be createdalias Name of relation to store STORE A INTO myoutput using PigStorage(,);
  31. 31. LIMIT Limits the number of output tupless alias_1 = LIMIT alias_0 n;SAMPLE Selects a random data sample alias_1 = SAMPLE alias_0 size; (size in [0.0, 1.0])DISTINCT Removes duplicates tuples in a relation alias_1 = DISTINCT alias_0;
  32. 32. ORDER BY Sorts a relation based on one+ fields alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n]; field_alias A field in the relation, MUST be a simple type n Number of reducers to use X = ORDER A BY a3 DESC; Retrieve relation immediately after ORDER BY to retain order
  33. 33. FILTER Selects tuples based on some condition alias_1 = FILTER alias_0 BY expression;expression A boolean expression Use parentheses for complex Expressions Can call UDFs X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));
  34. 34. SPLIT Partitions a relation into other relationsSPLIT alias0 INTO alias1 IF expression1 [, aliasN IF expressionN …]; expressionX A boolean expression Use parentheses for complex Expressions Can call UDFs SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
  35. 35. UNION Computes the union of multiple relationsalias = UNION [ONSCHEMA] alias_0, alias_1 [, alias_N ...];ONSCHEMA Use this keyword so the union is based on column names. Columns with identical names MUST have compatible types. U = UNION ONSCHEMA L1, L2;
  36. 36. CROSS Computes the cross product of N relationsalias = CROSS alias0, alias1 [, aliasN ...] [PARTITION BY part] [PARALLEL n]; part Specify a custom Hadoop partitioner, part is the name of a class which extends org.apache.hadoop.mapred.Partitioner n Specify the number of reduce tasks to increase parallelism. X = CROSS A, B; CROSS can produce a large amount of tuples
  37. 37. JOIN Performs an inner join of two+ relationsalias = JOIN alias0 BY {expression|(expression [, expression …])} [, alias1 BY {expression|(expression [, expression …])} …] [USING replicated | skewed | merge] [PARTITION BY partitioner] [PARALLEL n]; expression Field expression partitioner Specify a custom Hadoop partitioner n Number of reduce tasks replicated All but first relation kept in memory skewed Counterbalance skewed key set merge Use optimized merge-sort (sorted data) Nulls are dropped X = JOIN A BY a1, B BY b1;
  38. 38. JOIN Performs an outer join of two relationsalias = JOIN left_alias BY left_alias_column [LEFT|RIGHT|FULL] [OUTER], right_alias BY right_alias_column [USING replicated | skewed | merge] [PARTITION BY partitioner] [PARALLEL n]; partitioner Specify a custom Hadoop partitioner n Number of reduce tasks replicated All but first relation kept in memory skewed Counterbalance skewed key set merge Use optimized merge-sort (sorted data) C = JOIN A by $0 LEFT OUTER, B BY $0; Nulls are dropped
  39. 39. GROUP Groups the data in one or more relationsalias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING collected | merge] [PARTITION BY partitioner] [PARALLEL n]; partitioner Specify a custom Hadoop partitioner n Number of reduce tasks collected Optimization for CollectableLoader merge Optimization for sorted data ALL Put all tuples in one group B = GROUP A BY age; X = COGROUP A BY owner, B BY friend2; For readability GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations.
  40. 40. FOREACH Generates data transformations alias1 = FOREACH alias GENERATE expression [AS schema] [, expression [AS schema] ... ]; alias Name of a relation (outer bag) Schema Schema to use for the output relation Enclose schema in parentheses if the FLATTEN operator is used expression Expression or * to project all tuples X = FOREACH A GENERATE f1;
  41. 41. FOREACH Generates data transformationsalias1 = FOREACH alias { alias = nested_op; [ alias = nested_op; ... ] GENERATE expression [AS schema][, expression [AS schema] ... ] }; nested_op Operation on an inner bag Allowed operations are: DISTINCT, FILTER, LIMIT, ORDER BY Projections can also be performed GENERATE keyword must appear last X = FOREACH B { FA= FILTER A BY outlink == www.xyz.org; PA = FA.outlink; DA = DISTINCT PA; GENERATE group, COUNT(DA); };
  42. 42. MAPREDUCE Executes native MapReduce jobsalias = MAPREDUCE mr.jar STORE in_alias INTO inputLocation USING storeFunc LOAD outputLocation USING loadFunc AS schema [`params, ... `]; mr.jar MapReduce jar file that can be run by hadoop jar mr.jar params in_alias Relation to feed as input to the job params MapReduce job params in backticks, MUST include {input,output}Location B = MAPREDUCE wordcount.jar STORE A INTO inputDir LOAD outputDir AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
  43. 43. STREAM Sends data to an external program alias = STREAM alias [, alias …] THROUGH {`command` | cmd_alias } [AS schema] ; command A command, including the arguments, enclosed in backticks cmd_alias Name of a command created using the DEFINE operator DEFINE mycmd stream.pl –n 5; B = STREAM A THROUGH mycmd; C = STREAM A THROUGH `stream.pl -n 5`;
  44. 44. UDFs
  45. 45. REGISTER path; Registers a JAR file so its UDFs can be usedpath The path to the JAR file (no quotes) Side effect of REGISTER is the copy of the file into tasks work dir Can be used for resource files REGISTER src/foo.jar;
  46. 46. DEFINE alias {function | [`command` [input] [output] [ship] [cache]] }; Assigns alias to a UDF or a streaming command function Class name of UDF and init parameters Command Command (with arguments) in backticks Input INPUT ({stdin|path} [USING serializer] [,{stdin|path} [USING serializer]…]) Output OUTPUT ({stdout|stderr|path} [USING deserializer] [,{stdout|stderr|path} [USING deserializer] …] ) Default serializer/deserializer is PigStreaming
  47. 47. DEFINE alias {function | [`command` [input] [output] [ship] [cache]] }; ship SHIP(path [,path…]) Include the given paths with the job Use to include the script to run cache CACHE(dfs_path#dfs_file [,dfs_path#dfs_file…]) Copy in the scripts working dir Some data already in HDFS DEFINE Y stream.pl data.gz SHIP(/work/stream.pl) CACHE(/input/data.gz#data.gz); DEFINE myFunc myfunc.MyEvalfunc(foo);
  48. 48. Misc
  49. 49. Parameters-param name=val Assigns val to $name-param_file f Reads parameters from file f%declare p val Assigns val to param p%default p val Assigns val as default value for parameter p$param Dereferences param
  50. 50. Builtin UDFs Eval Math String Bag/TupleAVG ABS INDEXOF TOBAGCONCAT ACOS LAST_INDEX_OF TOPCOUNTS ASIN LCFIRST TOTUPLECOUNT_STAR ATAN LOWERDIFF CBRT REGEX_EXTRACTMAX CEIL REGEX_EXTRACT_ALLMIN COS REPLACESIZE COSH STRSPLITSUM EXP SUBSTRINGTOKENIZE FLOOR TRIM LOG UCFIRST LOG10 UPPER RANDOM ROUND SIN SINH SQRT TAN TANH
  51. 51. CommandsShell fs FSShell_subcommand subcommand_parameters sh shell_subcommand subcommand_parametersUtility {exec, run} [–param name = value] [–param_file file_name] script kill jobid set key value (default_parallel, debug, job.name, job.priority, stream.skippath) help quit

×