This document discusses various ways to reference and select fields or columns from a Pig dataset:
- Fields can be referenced by position (e.g. $0, $1) or by name. When the schema is unknown, position is safer.
- The entire range of fields can be selected using .. syntax (e.g. $0..$3).
- Fields can be cast to different types (e.g. (chararray)$4) during selection.
- Filters should reference fields by position rather than name when the schema is unknown, to avoid errors from missing or misplaced values.
2. Casting
Casting enables us to cast or convert data from one type to
another, as long as conversion is supported. For example,
suppose if we have an integer field (int) which you want to
convert to a string. We can cast this field from int to chararray
using chararray
For example:
grunt> select = foreach data generate $0, (chararray)$4,
(chararray)$5;
Grunt> dump select;
(ryan,67,57)
(Bob,77,75)
(Alica,68,)
(Bryan,81,79)
(Kate,66,69)
Rupak Roy
3. Reference field by position
We can refer the data fields by name as well as
with there positions( $0,$1,,,,,).
$0 $1 $2 $3 $4 $5
Name Age School Location
Test
Score
1
Test
Score
2
Ryan 22 St.JohnsSchool NewAvenue 67 57
Bob 23 St.EdumndSchool Downtown 77 75
Alica Na Don Bosco ParkAvenue 65 79
Bryan 24 St.JhonsSchool NewAvenue 81 79
Kate 22 Don Bosco ParkAvenue 66 69
Rupak Roy
4. #filter the data by age >= 22
grunt> age = FILTER data by $1 >= 22;
grunt> dump age;
Here, we are referencing the age column by position $1. However
we can reference them directly by name itself such as
grunt > age = FILTER data by age >=22;
But sometimes it becomes tedious to reference the column by its
name when we will be dealing large datasets with complex
column names.
#filter the data by test score1 <= 66
grunt> testscore = FILTER data $4<= 66;
grunt> dump testscore;
Rupak Roy
5. grunt> dump testscore;
We will notice that the output will show only
one record that is (kate,22, Don bosco,
ParkAvenue,66,69) but in our original dataset
we have an another record of testscore1<= 66
i.e. Alica’s.
This is because when we defined while loading
the data the column values are separated by
comma (, ) and in Alica row 2nd column have
no values so it automatically took the next
value after comma Don Bosco as the 2nd
column($3) value input for column($1) ‘age’.
Rupak Roy
6. Filter data based on position of the column
grunt> select = foreach data generate $0,$4,$5;
grunt> dump select;
(ryan,67,57)
(Bob,77,75)
(Alica,68,)
(Bryan,81,79)
(Kate,66,69)
Rupak Roy
7. Select columns using reference
grunt> select_all= foreach data generate *;
grunt> dump select_all;
Grunt> select_range= foreach data generate $0..$3;
grunt> dump select_range;
(Name,age)
(Ryan,22)
(bob,23)
(Alica,Don Bosco)
(Bryan,24)
(kate,22)
Showing Don Bosco instead of age
because the 2nd value for Alica’s
age is missing, therefore it will
consider the next value as the 2nd
column ‘age’ value. It is advisable
to mark the missing value as NA/NIL
so that it will not get misplaced
with the other column values.
Rupak Roy
8. Reference range of columns/fields
grunt> leftsidedata = foreach data generate ..$1;
grunt> middle = foreach data generate $0 .. $2;
grunt> from_last= foreach data generate $2.. ;
grunt> random= foreach data generate $0, $4 ..$6;
If schema is not defined while loading the dataset, we can even define
the schema by using a query. For example:
grunt> random = foreach data generate (chararray)$0, (chararray)$3;
Alternatively, we can also assign Alias name to the field like
grunt> random = foreach data generate (chararray)$0 as FC,
chararray)$3 as LC ;
grunt> describe random;
grunt> alias = FILTER alias by fc ==‘Kate’
Rupak Roy
9. Next
We will learn PIG relational operators and
how to perform them.
Rupak Roy