Successfully reported this slideshow.

# Pig Latin, Data Model with Load and Store Functions

Upcoming SlideShare
Pig
×

# Pig Latin, Data Model with Load and Store Functions

Documented with the two data types of PiG Data Model including Complex PIG data types in detail.

Let me know if anything is required. Happy to help.
Talk soon!

Documented with the two data types of PiG Data Model including Complex PIG data types in detail.

Let me know if anything is required. Happy to help.
Talk soon!

## More Related Content

### Pig Latin, Data Model with Load and Store Functions

1. 1. Pig Latin Data Model with Load and Store functions
2. 2. There are two data types for the Pig Latin Data Model 1. Single value or Atomic data types: consists of single and atomic value, irrespective of their data type. Atomic values can be integer, float, long, chararray, bytearrary and a field is a piece of single data value. By default Pig takes any data value/type as bytearray if it is not explicitly defined. 2. Complex data types or Non- Atomic data types: consists of map, tuple, bag. Rupak Roy
3. 3. Data Model: Complex data types or Non- Atomic data types: Rupak Roy
4. 4. Complex PIG data types in detail 1. Map contains Key-Value pairs for example : ‘chararray-key1#Value1’,’ chararray-key2#Value2’ ‘#’ is the separator of key and value Key1 i.e. the key is chararray (character array) Value1 i.e. the value and can be of any data type. 2. Tuple: is a collection of elements and each element can be of any data type and because of collection of elements are sequentially ordered it is possible to refer the field by position. Example: (‘Ryan’, 22,’ St.JohnsSchool’, ’NewAvenue’) We will perform this example in our next chapter. Rupak Roy
5. 5. Complex PIG data types in detail 3. Bag: Is a collection of tuples in a non-sequentially manner or we can say unordered manner. A bag is represented by ‘ { } ’ Example: {(‘Ryan’, 22,’ St.JohnsSchool’, ’NewAvenue’), (‘Bob’, 23,’ St.EdmundSchool’, ’Downtown’), (‘Alica’, 22, ’Don bosco’, ’ParkAvenue’) } Inner bag can also be a field in relation Example – { Bob, 23 (9834514, bob@gmail.com}} Rupak Roy
6. 6.  Now let’s perform assigning the data types while loading a data in PIG #lets start the pig in local mode grunt> pig –x local grunt> data = LOAD ‘/pig/student.csv’ AS ( name: chararray, age: bytearray, school: chararray, location: chararray); Grunt> describe data; data: {name; chararray,age:bytearray, school:chararray, location:chararrray} Rupak Roy
7. 7. Load and Store In grunt shell use the following commands: Grunt> data= LOAD ‘/pig/student.csv’; Grunt> describe data ; Output Schema for data unknown i.e. the structure for data is unknown. Since we havn’t told pig how we want to identify the data. Now let’s describe the schema. Grunt> data = LOAD ‘/pig/students.csv’ AS (name, age, school, location); Grunt> describe data; Output data:{ name: bytearrary, age: bytearray, school: bytearray , location:bytearrary } Since we haven't assign the data types, PIG tries to assign the data types for each data value by its best guess. Rupak Roy
8. 8. Pig loads data to the field which is defined, if less field is defined then it will not load the next fields. Suppose Grunt> data = LOAD ‘/pig/students.csv’ AS (name, age, school); Here we have assigned the field upto the ‘school’ variable, so the next field i.e. ‘location‘ will not be loaded. ---------------------------------------------------------------- What if we define more fields then actually the file have? Suppose Grunt> data = LOAD ‘/pig/student.csv’ AS (name, age, school, location, abc_extrafield); Then the ‘abc_extrafield’ column will give NULLL VALUES. ---------------------------------------------------------- By default the pig loads the data as a tab delimiter file. If a tab delimiter is not found then pig will consider all the fields as one field and will load the entire record into the first field/column keeping the other columns as null. Rupak Roy
9. 9. So, if we want to load data that is not of delimited? Then we will use Pig Storage function. Pig Storage is a build-in function of Pig that is most commonly used to load data by parsing the text data with an arbitrary delimiter. Suppose student.csv is a comma separated file. Then grunt> data= LOAD /pig/student.csv using PigStorage(‘,’) AS ( name: chararray, age: int, school: chararray, location: chararray); Even we can also use for it TAB delimited files. grunt> data= LOAD /pig/student.csv using PigStorage(‘t’) AS ( name: chararray, age: int, school: chararray, location: chararray); Or simply type PigStorage(‘ ’) grunt> data= LOAD /pig/student.csv using PigStorage(‘ ’) A S ( name: chararray, age: int, school: chararray, location: chararray); Rupak Roy
10. 10. Storing The Pig Output To store the output physically into HDFS type command: Grunt> STORE data INTO ‘/pig/output/data’; By default it stores the output as tab delimited output file in HDFS --------------------------------------------------- Another important function in Pig is DUMP. This function is used to view the intermediate results without actually storing the physical output in the HDFS. DUMP is very useful in debugging. To use DUMP simply type: Grunt> DUMP data; Rupak Roy
11. 11. Load and Store in HDFS(cluster mode) First load the pig in mapReduce mode grunt> Pig –x mapreduce Load: grunt> data = LOAD ‘hdfs://localhost:9000/pigdata/student.csv USING PigStorage(‘,’) as (name:chararray, age: bytearray, school: chararray, location: chararray); Store: grunt> STORE data INTO ‘hdfs://localhost:90000/pigoutput/’ USING PigStorage (‘,’); Rupak Roy
12. 12. Next  We will learn PIG casting and reference field by position. Rupak Roy