Documented with the two data types of PiG Data Model including Complex PIG data types in detail.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
2. There are two data types for the Pig Latin
Data Model
1. Single value or Atomic data types:
consists of single and atomic value, irrespective of
their data type. Atomic values can be integer,
float, long, chararray, bytearrary and a field is a
piece of single data value.
By default Pig takes any data value/type as
bytearray if it is not explicitly defined.
2. Complex data types or Non- Atomic data types:
consists of map, tuple, bag.
Rupak Roy
4. Complex PIG data types in detail
1. Map contains Key-Value pairs
for example :
‘chararray-key1#Value1’,’ chararray-key2#Value2’
‘#’ is the separator of key and value
Key1 i.e. the key is chararray (character array)
Value1 i.e. the value and can be of any data type.
2. Tuple: is a collection of elements and each element can be of
any data type and because of collection of elements are
sequentially ordered it is possible to refer the field by position.
Example: (‘Ryan’, 22,’ St.JohnsSchool’, ’NewAvenue’)
We will perform this example in our next chapter.
Rupak Roy
5. Complex PIG data types in detail
3. Bag:
Is a collection of tuples in a non-sequentially manner or
we can say unordered manner.
A bag is represented by ‘ { } ’
Example:
{(‘Ryan’, 22,’ St.JohnsSchool’, ’NewAvenue’), (‘Bob’,
23,’ St.EdmundSchool’, ’Downtown’), (‘Alica’, 22, ’Don
bosco’, ’ParkAvenue’) }
Inner bag can also be a field in relation
Example – { Bob, 23 (9834514, bob@gmail.com}}
Rupak Roy
6. Now let’s perform assigning the data types while
loading a data in PIG
#lets start the pig in local mode
grunt> pig –x local
grunt> data = LOAD ‘/pig/student.csv’ AS ( name:
chararray, age: bytearray, school: chararray,
location: chararray);
Grunt> describe data;
data: {name; chararray,age:bytearray, school:chararray,
location:chararrray}
Rupak Roy
7. Load and Store
In grunt shell use the following commands:
Grunt> data= LOAD ‘/pig/student.csv’;
Grunt> describe data ;
Output
Schema for data unknown i.e. the structure for data is unknown.
Since we havn’t told pig how we want to identify the data. Now let’s
describe the schema.
Grunt> data = LOAD ‘/pig/students.csv’ AS (name, age, school,
location);
Grunt> describe data;
Output
data:{ name: bytearrary, age: bytearray, school: bytearray ,
location:bytearrary }
Since we haven't assign the data types, PIG tries to assign the data
types for each data value by its best guess.
Rupak Roy
8. Pig loads data to the field which is defined, if less field is defined then it will
not load the next fields.
Suppose
Grunt> data = LOAD ‘/pig/students.csv’ AS (name, age, school);
Here we have assigned the field upto the ‘school’ variable, so the next
field i.e. ‘location‘ will not be loaded.
----------------------------------------------------------------
What if we define more fields then actually the file have?
Suppose
Grunt> data = LOAD ‘/pig/student.csv’ AS (name, age, school, location,
abc_extrafield);
Then the ‘abc_extrafield’ column will give NULLL VALUES.
----------------------------------------------------------
By default the pig loads the data as a tab delimiter file. If a tab delimiter is
not found then pig will consider all the fields as one field and will load the
entire record into the first field/column keeping the other columns as null.
Rupak Roy
9. So, if we want to load data that is not of delimited? Then we will use
Pig Storage function.
Pig Storage is a build-in function of Pig that is most commonly used
to load data by parsing the text data with an arbitrary delimiter.
Suppose student.csv is a comma separated file. Then
grunt> data= LOAD /pig/student.csv using PigStorage(‘,’) AS
( name: chararray, age: int, school: chararray, location: chararray);
Even we can also use for it TAB delimited files.
grunt> data= LOAD /pig/student.csv using PigStorage(‘t’) AS
( name: chararray, age: int, school: chararray, location: chararray);
Or simply type PigStorage(‘ ’)
grunt> data= LOAD /pig/student.csv using PigStorage(‘ ’) A S
( name: chararray, age: int, school: chararray, location: chararray);
Rupak Roy
10. Storing The Pig Output
To store the output physically into HDFS type command:
Grunt> STORE data INTO ‘/pig/output/data’;
By default it stores the output as tab delimited output file in HDFS
---------------------------------------------------
Another important function in Pig is DUMP.
This function is used to view the intermediate results without actually
storing the physical output in the HDFS.
DUMP is very useful in debugging.
To use DUMP simply type:
Grunt> DUMP data;
Rupak Roy
11. Load and Store in HDFS(cluster mode)
First load the pig in mapReduce mode
grunt> Pig –x mapreduce
Load:
grunt> data = LOAD
‘hdfs://localhost:9000/pigdata/student.csv USING
PigStorage(‘,’) as (name:chararray, age: bytearray,
school: chararray, location: chararray);
Store:
grunt> STORE data INTO
‘hdfs://localhost:90000/pigoutput/’ USING PigStorage
(‘,’);
Rupak Roy
12. Next
We will learn PIG casting and reference
field by position.
Rupak Roy