Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to pig

460 views

Published on

It will provide you the list of useful PIG concepts

Published in: Data & Analytics
  • Be the first to comment

Introduction to pig

  1. 1. Introduction to PIG
  2. 2. Agenda  What Is Pig? And what it is use for?  Pig Philosophy  Pig’s Data Model  Pig Example  Pig Latin  Pig Latin vs SQL  Pig Macros  Pig UDF’s
  3. 3. What Is Pig? And what it is use for?  Pig has a pig engine which is used for executing data flows in parallel like how map tasks are distributed among the cluster nodes and get job done in Mapreduce .  Pig uses its own Pig Latin language for expressing these data flows.  Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.  By default, Pig reads input files from HDFS, uses HDFS to store intermediate data between MapReduce jobs, and writes its output to HDFS.  Pig Latin use cases tend to fall into three separate categories: traditional extract transform load (ETL) data pipelines, research on raw data, and iterative processing.
  4. 4. Pig Philosophy  Pigs eat anything  Pigs live anywhere  Pigs are domestic animals  Pigs fly
  5. 5. Pig’s Data Model : Types  Pig’s data types can be divided into two categories: scalar types, which contain a single value, and complex types, which contain other types
  6. 6. Pig’s Data Model : Schemas  If a schema for the data is available, Pig will make use of it, both for up-front error checking and for optimization  Syntax: Loads= load ‘data.txt' as(col1:int, col2:chararray, col3:chararray, col4:float);  It is also possible to specify the schema without giving explicit data types. In this case, the data type is assumed to be bytearray  Syntax: Loads= load ‘data.txt' as(col1, col2, col3, col4);
  7. 7. Pig Example  Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and provides a shell for users to interact with HDFS.  To enter Grunt, pig -x local, pig -x mapreduce, pig -x tez  records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray, temperature:int, quality:int);  filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9);  grouped_records = GROUP filtered_records BY year;  max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature);  DUMP max_temp;
  8. 8. Pig Latin : RelationalOperators
  9. 9. Pig Latin : Diagnostic/UDF Operators
  10. 10. Pig Latin vs SQL
  11. 11. Pig Macros  Macros provide a way to package reusable pieces of Pig Latin code from within Pig Latin itself. DEFINE max_by_group(X, group_key, max_field) RETURNS Y { A = GROUP $X by $group_key; $Y = FOREACH A GENERATE group, MAX($X.$max_field); }; records = LOAD 'input/ncdc/micro-tab/sample.txt‘ AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND quality IN (0, 1, 4, 5, 9); max_temp = max_by_group(filtered_records, year, temperature); DUMP max_temp;
  12. 12. Pig UDF’s  A Filter UDF  An Eval UDF  A Load UDF

×