Pig
Casting, Reference
Casting
Casting enables us to cast or convert data from one type to
another, as long as conversion is supported. For example,
suppose if we have an integer field (int) which you want to
convert to a string. We can cast this field from int to chararray
using chararray
For example:
grunt> select = foreach data generate $0, (chararray)$4,
(chararray)$5;
Grunt> dump select;
(ryan,67,57)
(Bob,77,75)
(Alica,68,)
(Bryan,81,79)
(Kate,66,69)
Rupak Roy
Reference field by position
 We can refer the data fields by name as well as
with there positions( $0,$1,,,,,).
$0 $1 $2 $3 $4 $5
Name Age School Location
Test
Score
1
Test
Score
2
Ryan 22 St.JohnsSchool NewAvenue 67 57
Bob 23 St.EdumndSchool Downtown 77 75
Alica Na Don Bosco ParkAvenue 65 79
Bryan 24 St.JhonsSchool NewAvenue 81 79
Kate 22 Don Bosco ParkAvenue 66 69
Rupak Roy
#filter the data by age >= 22
grunt> age = FILTER data by $1 >= 22;
grunt> dump age;
Here, we are referencing the age column by position $1. However
we can reference them directly by name itself such as
grunt > age = FILTER data by age >=22;
But sometimes it becomes tedious to reference the column by its
name when we will be dealing large datasets with complex
column names.
#filter the data by test score1 <= 66
grunt> testscore = FILTER data $4<= 66;
grunt> dump testscore;
Rupak Roy
grunt> dump testscore;
We will notice that the output will show only
one record that is (kate,22, Don bosco,
ParkAvenue,66,69) but in our original dataset
we have an another record of testscore1<= 66
i.e. Alica’s.
This is because when we defined while loading
the data the column values are separated by
comma (, ) and in Alica row 2nd column have
no values so it automatically took the next
value after comma Don Bosco as the 2nd
column($3) value input for column($1) ‘age’.
Rupak Roy
Filter data based on position of the column
grunt> select = foreach data generate $0,$4,$5;
grunt> dump select;
(ryan,67,57)
(Bob,77,75)
(Alica,68,)
(Bryan,81,79)
(Kate,66,69)
Rupak Roy
Select columns using reference
grunt> select_all= foreach data generate *;
grunt> dump select_all;
Grunt> select_range= foreach data generate $0..$3;
grunt> dump select_range;
(Name,age)
(Ryan,22)
(bob,23)
(Alica,Don Bosco)
(Bryan,24)
(kate,22)
Showing Don Bosco instead of age
because the 2nd value for Alica’s
age is missing, therefore it will
consider the next value as the 2nd
column ‘age’ value. It is advisable
to mark the missing value as NA/NIL
so that it will not get misplaced
with the other column values.
Rupak Roy
Reference range of columns/fields
grunt> leftsidedata = foreach data generate ..$1;
grunt> middle = foreach data generate $0 .. $2;
grunt> from_last= foreach data generate $2.. ;
grunt> random= foreach data generate $0, $4 ..$6;
If schema is not defined while loading the dataset, we can even define
the schema by using a query. For example:
grunt> random = foreach data generate (chararray)$0, (chararray)$3;
Alternatively, we can also assign Alias name to the field like
grunt> random = foreach data generate (chararray)$0 as FC,
chararray)$3 as LC ;
grunt> describe random;
grunt> alias = FILTER alias by fc ==‘Kate’
Rupak Roy
Next
 We will learn PIG relational operators and
how to perform them.
Rupak Roy

Apache PIG casting, reference

  • 1.
  • 2.
    Casting Casting enables usto cast or convert data from one type to another, as long as conversion is supported. For example, suppose if we have an integer field (int) which you want to convert to a string. We can cast this field from int to chararray using chararray For example: grunt> select = foreach data generate $0, (chararray)$4, (chararray)$5; Grunt> dump select; (ryan,67,57) (Bob,77,75) (Alica,68,) (Bryan,81,79) (Kate,66,69) Rupak Roy
  • 3.
    Reference field byposition  We can refer the data fields by name as well as with there positions( $0,$1,,,,,). $0 $1 $2 $3 $4 $5 Name Age School Location Test Score 1 Test Score 2 Ryan 22 St.JohnsSchool NewAvenue 67 57 Bob 23 St.EdumndSchool Downtown 77 75 Alica Na Don Bosco ParkAvenue 65 79 Bryan 24 St.JhonsSchool NewAvenue 81 79 Kate 22 Don Bosco ParkAvenue 66 69 Rupak Roy
  • 4.
    #filter the databy age >= 22 grunt> age = FILTER data by $1 >= 22; grunt> dump age; Here, we are referencing the age column by position $1. However we can reference them directly by name itself such as grunt > age = FILTER data by age >=22; But sometimes it becomes tedious to reference the column by its name when we will be dealing large datasets with complex column names. #filter the data by test score1 <= 66 grunt> testscore = FILTER data $4<= 66; grunt> dump testscore; Rupak Roy
  • 5.
    grunt> dump testscore; Wewill notice that the output will show only one record that is (kate,22, Don bosco, ParkAvenue,66,69) but in our original dataset we have an another record of testscore1<= 66 i.e. Alica’s. This is because when we defined while loading the data the column values are separated by comma (, ) and in Alica row 2nd column have no values so it automatically took the next value after comma Don Bosco as the 2nd column($3) value input for column($1) ‘age’. Rupak Roy
  • 6.
    Filter data basedon position of the column grunt> select = foreach data generate $0,$4,$5; grunt> dump select; (ryan,67,57) (Bob,77,75) (Alica,68,) (Bryan,81,79) (Kate,66,69) Rupak Roy
  • 7.
    Select columns usingreference grunt> select_all= foreach data generate *; grunt> dump select_all; Grunt> select_range= foreach data generate $0..$3; grunt> dump select_range; (Name,age) (Ryan,22) (bob,23) (Alica,Don Bosco) (Bryan,24) (kate,22) Showing Don Bosco instead of age because the 2nd value for Alica’s age is missing, therefore it will consider the next value as the 2nd column ‘age’ value. It is advisable to mark the missing value as NA/NIL so that it will not get misplaced with the other column values. Rupak Roy
  • 8.
    Reference range ofcolumns/fields grunt> leftsidedata = foreach data generate ..$1; grunt> middle = foreach data generate $0 .. $2; grunt> from_last= foreach data generate $2.. ; grunt> random= foreach data generate $0, $4 ..$6; If schema is not defined while loading the dataset, we can even define the schema by using a query. For example: grunt> random = foreach data generate (chararray)$0, (chararray)$3; Alternatively, we can also assign Alias name to the field like grunt> random = foreach data generate (chararray)$0 as FC, chararray)$3 as LC ; grunt> describe random; grunt> alias = FILTER alias by fc ==‘Kate’ Rupak Roy
  • 9.
    Next  We willlearn PIG relational operators and how to perform them. Rupak Roy