Introduction to PIG

PIG in Big Data
9/18/2016 1
Data keeps growing…

BIG DATA
• ‘Big Data’ is similar to ‘small data’, but bigger in size
• It requires different approaches:
Techniques, tools and architecture
• To solve new problems or old problems in a better way
• Storage and processing of very large quantities of digital
information that cannot be analyzed with traditional computing
techniques
9/18/2016 2

INTRODUCTION TO BIG DATA
…. AND FAR FAR BEYOND
User generated content
Mobile Web
User Click Stream
Sentiment
Social Network
External Demographics
Business Data Feeds
HD Video
Speech to Text
Product / Service Logs
SMS / MMS
Petabytes
WEB
Weblogs
Offer history
A / B Testing
Dynamic Pricing
Affiliate Network
Search Marketing
Behavioral Targeting
Dynamic Funnels
Terabytes
CRM
Segmentation
Offer Details
Customer Touches
Support Contacts
Gigabytes
ERP
Purchase Details
Purchase Records
Payment Records
Megabytes
Source:http://datameer.com
9/18/2016
3

CONT.,
• Walmart handles more than 1 million customer transactions every
hour
• Facebook handles 40 billion photos from its user base
• Decoding the human genome originally took 10years to process;
now it can be achieved in one week
4
9/18/2016

NECESSITY OF HADOOP
5
9/18/2016

HADOOP
• As data is growing, we need to be able to scale-out computation
• Uses cheap(er) hardware to grow horizontally
• Tolerates a few machines going down
• Happens all the time
• Stores all your data from all system
• No need to throw your data
9/18/2016 6

HDFS
• Hadoop Distributed File System
• A distributed, scalable, and portable file system written in Java
for the Hadoop framework
• Provides high-throughput access to the application data
• Runs on large clusters of commodity machines
• Used to store large datasets
9/18/2016 8

CONT.,
9/18/2016 9
• A file we want to store on HDFS …
We’re raising the question
because no one else wants to,
because no one else wants to say
what needs to be said.
And let’s be real, it’s the two-ton
elephant in the room with nearly
every other star’s name on the
trade rumor radar these days.
We’ve read over and over again
about Nash refusing to ask for a
trade, refusing to play the game
that so many others have late in
their careers.
600 MB

CONT.,
9/18/2016 10
• HDFS Splits file into blocks …
We’re raising the question
because no one else wants to,
because no one else wants to say
what needs to be said.
And let’s be real, it’s the two-ton
elephant in the room with nearly
every other star’s name on the
trade rumor radar these days.
We’ve read over and over again
about Nash refusing to play the
game that so many others have
late in their careers.
256 MB
256 MB
88 MB

MAP REDUCE
• Distributed data processing model and execution environment
that runs on large clusters of commodity machines
• Also called MR
• Programs are inherently parallel
9/18/2016 11

PIG-INTRODUCTION
• High level data flow language for exploring very large datasets
• Provides an engine for executing data flows in parallel on Hadoop
• Compiler that produces sequences of MapReduce programs
• Structure is amenable to substantial parallelization
• Operates on files in HDFS
• Metadata is not required, but used when available
9/18/2016 13

KEY PROPERTIES OF PIG
• Ease of programming: Trivial to achieve parallel execution of
simple and parallel data analysis tasks
• Optimization opportunities: Allows the user to focus on
semantics rather than efficiency
• Extensibility: Users can create their own functions to do
special-purpose processing
9/18/2016 14

PIG EXECUTION STAGE
9/18/2016 15
Client machine
Pig
Script
Pig Execution
Engine
MapReduce
Hadoop Cluster
Hadoop Job

EQUIVALENT MAP REDUCE CODE
9/18/2016 17

PIG VS HADOOP
• 5% of the MR code.
• 5% of the MR development
time.
• Within 25% of the MR
execution time.
• Readable and reusable.
• Easy to learn DSL.
• Increases programmer
productivity.
• No Java expertise required.
• Anyone [eg. BI folks] can
trigger the Jobs.
• Insulates against Hadoop
complexity
• Version upgrades
• Changes in Hadoop interfaces
• JobConf configuration tuning
• Job Chains
9/18/2016 18

PIG COMMANDS
Statement Description
Load Read data from the file system
Store Write data to the file system
Dump Write output to stdout
Foreach Apply expression to each record and generate one or more records
Filter Apply predicate to each record and remove records where false
Group / Cogroup Collect records with the same key from one or more inputs
Join Join two or more inputs based on a key
Order Sort records based on a Key
Distinct Remove duplicate records
Union Merge two datasets
Limit Limit the number of records
Split Split data into 2 or more sets, based on filter conditions
19
9/18/2016

LOADING DATA
• LOAD
• Reads data from the file system
• Syntax
• LOAD ‘input’ [USING function] [AS schema];
• Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS
(name:chararray, age:int, gpa:float);
9/18/2016 20

SCHEMA
• Use schemas to assign types to fields
• A = LOAD 'data' AS (name, age, gpa);
• name, age, gpa default to bytearrays
• A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
• name is now a String (chararray), age is integer and gpa is float
9/18/2016 21

DESCRIBING SCHEME
• Describe
• Provides the schema of a relation
• Syntax
• DESCRIBE [alias];
• If schema is not provided, describe will say “Schema for alias
unknown”
9/18/2016 22

DUMP AND STORE
• Dump writes the output to console
• grunt> A = load ‘data’;
• grunt> DUMP A; //This will print contents of A on Console
• Store writes output to a HDFS location
• grunt> A = load ‘data’;
• grunt> STORE A INTO ‘/user/username/output’; //This will
write contents of A to HDFS
• Pig starts a job only when a DUMP or STORE is encountered
9/18/2016 23

REFERENCING FIELDS
• Fields are referred to by positional notation or by name (alias)
• Positional notation is generated by the system
• Starts with $0
• Names are assigned by you using schemas
• Eg: A = load ‘data’ as (name:chararray, age:int);
• With positional notation, fields can be accessed as
• A = load ‘data’;
• B = foreach A generate $0, $1; //1st & 2nd column
9/18/2016 24

LIMIT
• Limits the number of output tuples
• Syntax
• alias = LIMIT alias n;
9/18/2016 25

FILTER
• Selects tuples from a relation based on some condition
• Syntax
• alias = FILTER alias BY expression;
• Example, to filter for ‘marcbenioff’
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as
(name:chararray , employeesince:int ,age:int);
• B = FILTER A BY name == ‘marcbenioff’;
• You can use boolean operators (AND, OR, NOT)
• B = FILTER A BY (employeesince < 2005) AND
(NOT(name == ‘marcbenioff’));
9/18/2016 26

GROUP BY
• Syntax:
• alias = GROUP alias { ALL | BY expression} [, alias ALL | BY
expression …] [PARALLEL n];
• Eg, to group by (employee start year at Salesforce)
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as
(name:chararray, employeesince:int, age:int);
• B = GROUP A BY (employeesince);
• You can also group by all fields together
• B = GROUP B BY ALL;
• Or Group by multiple fields
• B = GROUP A BY (age, employeesince); 9/18/2016 27

AGGREGATION
• Pig provides a bunch of aggregation functions
• AVG
• COUNT
• COUNT_STAR
• SUM
• MAX
• MIN
9/18/2016 28

DEFINE
• Assigns an alias to a UDF
• Syntax
• DEFINE alias {function}
• Use DEFINE to specify a UDF function when:
• UDF has a long package name
• UDF constructor takes string parameters.
9/18/2016 29

Introduction to PIG

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to PIG

Similar to Introduction to PIG (20)

Recently uploaded

Recently uploaded (20)

Introduction to PIG