Pig Latin, Data Model with Load and Store Functions

Pig Latin
Data Model with Load and
Store functions

There are two data types for the Pig Latin
Data Model
1. Single value or Atomic data types:
consists of single and atomic value, irrespective of
their data type. Atomic values can be integer,
float, long, chararray, bytearrary and a field is a
piece of single data value.
By default Pig takes any data value/type as
bytearray if it is not explicitly defined.
2. Complex data types or Non- Atomic data types:
consists of map, tuple, bag.
Rupak Roy

Data Model: Complex data types or
Non- Atomic data types:
Rupak Roy

Complex PIG data types in detail
1. Map contains Key-Value pairs
for example :
‘chararray-key1#Value1’,’ chararray-key2#Value2’
‘#’ is the separator of key and value
Key1 i.e. the key is chararray (character array)
Value1 i.e. the value and can be of any data type.
2. Tuple: is a collection of elements and each element can be of
any data type and because of collection of elements are
sequentially ordered it is possible to refer the field by position.
Example: (‘Ryan’, 22,’ St.JohnsSchool’, ’NewAvenue’)
We will perform this example in our next chapter.
Rupak Roy

Complex PIG data types in detail
3. Bag:
Is a collection of tuples in a non-sequentially manner or
we can say unordered manner.
A bag is represented by ‘ { } ’
Example:
{(‘Ryan’, 22,’ St.JohnsSchool’, ’NewAvenue’), (‘Bob’,
23,’ St.EdmundSchool’, ’Downtown’), (‘Alica’, 22, ’Don
bosco’, ’ParkAvenue’) }
Inner bag can also be a field in relation
Example – { Bob, 23 (9834514, bob@gmail.com}}
Rupak Roy

 Now let’s perform assigning the data types while
loading a data in PIG
#lets start the pig in local mode
grunt> pig –x local
grunt> data = LOAD ‘/pig/student.csv’ AS ( name:
chararray, age: bytearray, school: chararray,
location: chararray);
Grunt> describe data;
data: {name; chararray,age:bytearray, school:chararray,
location:chararrray}
Rupak Roy

Load and Store
In grunt shell use the following commands:
Grunt> data= LOAD ‘/pig/student.csv’;
Grunt> describe data ;
Output
Schema for data unknown i.e. the structure for data is unknown.
Since we havn’t told pig how we want to identify the data. Now let’s
describe the schema.
Grunt> data = LOAD ‘/pig/students.csv’ AS (name, age, school,
location);
Grunt> describe data;
Output
data:{ name: bytearrary, age: bytearray, school: bytearray ,
location:bytearrary }
Since we haven't assign the data types, PIG tries to assign the data
types for each data value by its best guess.
Rupak Roy

Pig loads data to the field which is defined, if less field is defined then it will
not load the next fields.
Suppose
Grunt> data = LOAD ‘/pig/students.csv’ AS (name, age, school);
Here we have assigned the field upto the ‘school’ variable, so the next
field i.e. ‘location‘ will not be loaded.
----------------------------------------------------------------
What if we define more fields then actually the file have?
Suppose
Grunt> data = LOAD ‘/pig/student.csv’ AS (name, age, school, location,
abc_extrafield);
Then the ‘abc_extrafield’ column will give NULLL VALUES.
----------------------------------------------------------
By default the pig loads the data as a tab delimiter file. If a tab delimiter is
not found then pig will consider all the fields as one field and will load the
entire record into the first field/column keeping the other columns as null.
Rupak Roy

So, if we want to load data that is not of delimited? Then we will use
Pig Storage function.
Pig Storage is a build-in function of Pig that is most commonly used
to load data by parsing the text data with an arbitrary delimiter.
Suppose student.csv is a comma separated file. Then
grunt> data= LOAD /pig/student.csv using PigStorage(‘,’) AS
( name: chararray, age: int, school: chararray, location: chararray);
Even we can also use for it TAB delimited files.
grunt> data= LOAD /pig/student.csv using PigStorage(‘t’) AS
Or simply type PigStorage(‘ ’)
grunt> data= LOAD /pig/student.csv using PigStorage(‘ ’) A S
Rupak Roy

Storing The Pig Output
To store the output physically into HDFS type command:
Grunt> STORE data INTO ‘/pig/output/data’;
By default it stores the output as tab delimited output file in HDFS
---------------------------------------------------
Another important function in Pig is DUMP.
This function is used to view the intermediate results without actually
storing the physical output in the HDFS.
DUMP is very useful in debugging.
To use DUMP simply type:
Grunt> DUMP data;
Rupak Roy

Load and Store in HDFS(cluster mode)
First load the pig in mapReduce mode
grunt> Pig –x mapreduce
Load:
grunt> data = LOAD
‘hdfs://localhost:9000/pigdata/student.csv USING
PigStorage(‘,’) as (name:chararray, age: bytearray,
school: chararray, location: chararray);
Store:
grunt> STORE data INTO
‘hdfs://localhost:90000/pigoutput/’ USING PigStorage
(‘,’);
Rupak Roy

Next
 We will learn PIG casting and reference
field by position.
Rupak Roy

Pig Latin, Data Model with Load and Store Functions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pig Latin, Data Model with Load and Store Functions

Similar to Pig Latin, Data Model with Load and Store Functions (16)

More from Rupak Roy

More from Rupak Roy (20)

Recently uploaded

Recently uploaded (20)

Pig Latin, Data Model with Load and Store Functions