Practical Hadoop using Pig

Practical Hadoop with Pig
Dave Wellman
#openwest @dwellman

How does it all work?
HDFS
Hadoop Shell
MR Data Structures
Pig Commands
Pig Example

The Name Node
The Name Node is “The Conductor”.
It directs the performance of the cluster.

The Data Nodes:
A Data Node stores blocks of data.
Clusters can be contain thousands
of Data Nodes.
*Yahoo has a 40,000 node cluster.

The Client
The client is a window to the
cluster.

The heart of the System.
Maintains a virtual File Directory.

Tracks all the nodes.

Listens for “heartbeats” and “Block Reports”
(more on this later).

Listens for “heartbeats” and “Block Reports”
(more on this later).
If the NameNode is down, the cluster is
offline.

Add a Data Node:
The Data Node says “Hello” to the Name
Node.

Add a Data Node:
Node.
The Name Node offers the Data Node a
handshake with version requirements.

Add a Data Node:
Node.
The Data Node replies back to the Name
Node, “Okay”, or “Shuts Down”.

Add a Data Node:
Node.
The Name Node hands the Data Node a
NodeId that it remembers.
.

Add a Data Node:
Node.
The Name Node hands the Data Node a
NodeId that it remembers.
The Data Node is now part of cluster and it
checks in with the Name Node every 3
seconds.

Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.

Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.

Request/Response.
Block Reports – what data I have and is it
okay.

Request/Response.
okay.
Name Node controls the Data Nodes by
issuing orders when they return and report
their status.

Request/Response.
okay.
their status.
Replicate Data, Delete Data, Verify Data

Request/Response.
okay.
their status.
Replicate Data, Delete Data, Verify Data
Same process for all nodes within a cluster.

The client “tells” the NameNode the
virtual directory location for the file.

A64 B64 C28
The Client breaks the file into 64MB
“blocks”

A64 B64 C28
“blocks”
The client “ask” the NameNode where
the blocks go.

A64 B64 C28
A64 B64 C28
“blocks”
the blocks go.
The Client “stream” the blocks, in
parallel, to the DataNodes.

A64 B64 C28
“blocks”
the blocks go.
DataNode(s) tells the NameNode they
have the data via the block report

“blocks”
the blocks go.
DataNode(s) tells the NameNode they
have the data via the block report
The NameNode tells the DataNode
where to replicate the block.
A64 A64
A64

The client tells the NameNode it would
like to read a file.

The NameNode reply’s with the list of
blocks and the nodes the blocks are on.

A64
B64 C28
The client request the first block from a
DataNode

B64 C28
A64
DataNode
The client compares the checksum of the
block against the manifest from the
NameNode.

DataNode
The client compares the checksum of the
block against the manifest from the
NameNode.
The client moves on to the next block in
the sequence until the file has been read.
B64 C28
A64 B64 C28

A Data Node Fails to “check-in”
A64

After 10 minutes the Name Node gives up
on that Data Node.
A64

on that Data Node.
When another node that has blocks
originally assigned to the lost node
checks-in, the name node sends a block
replication command. A64A64
A64

on that Data Node.
When another node that has blocks
originally assigned to the lost node
checks-in, the name node sends a block
replication command.
The Data Node replicates that block of
data. (Just like a write)
A64A64
A64A64

Interacting with Hadoop
HDFS Shell Commands

HDFS Shell Commands.
> Hadoop fs –ls <args>
Same as unix or osx ls command.
/user/hadoop/file1
/user/hadoop/file2
...

> Hadoop fs –mkdir <path>
Creates directories in HDFS using path.

> hadoop fs -copyFromLocal <localsrc>
URI
Copy a file from your client to HDFS.
Similar to put command, except that the source
is restricted to a local file reference.

> hadoop fs -cat <path>
Copies source paths to stdout.

> hadoop fs -copyToLocal URI
<localdst>
Copy a file from HDFS to your client.
Similar to get command, except that the
destination is restricted to a local file reference.

cat
chgrp
chmod
chown
copyFromLocal
copyToLocal
cp
du
dus
expunge
get
getmerge
ls
lsr
mkdir
movefromLocal
mv
put
rm
rmr
setrep
stat
tail
test
text
touchz

Map Reduce Data Structures
Basic, Tuples & Bags

Basic Data Types:
Strings, Integers, Doubles, Longs, Byte, Boolean,
etc.
Advanced Data Types:
Tuples and Bags

Tuples are JSON like and simple.
raw_data: {
date_time: bytearray,
seconds: bytearray
}

Bags hold Tuples and Bags
element: {
seconds: bytearray
group: chararray,
ordered_list: {
date: chararray,
hour: chararray,
score: long
}
}

Expert Advice:
Always know your data structures.
They are the foundation for all Map Reduce operations.
Complex (deep) data structures will kill -9 performance.
Keep them simple!

Processing Data
Interacting with Pig using Grunt

GRUNT
Grunt is a command line interface used to debug
pig jobs. Similar to Ruby IRB or Groovy CLI.
Grunt is your best weapon against bad pigs.
pig -x local
Grunt> |

GRUNT
Grunt> describe Element
Describe will display the data structure of an
Element
Grunt> dump Element
Dump will display the data represented by an
Element

GRUNT
> describe raw_data
Produces the output:
> raw_data: { date_time: bytearray,
items: bytearray }
Or in a more human readable form:
Raw_data: {
items: bytearray
}

GRUNT
> dump raw_data
You can dump terabytes of data to your screen,
so be careful.
(05/10/2011 20:30:00.0,0)
(05/10/2011 20:45:00.0,0)
(05/10/2011 21:00:00.0,0)
(05/10/2011 21:15:00.0,0)
...

Pig Programs
Map Reduce Made Simple

Most PIG commands are assignments.
• The element names the collection of records that exist out in
the cluster.
• It’s not a traditional programming variable.
• It describes the data from the operation.
• It does not change.
Element = Operation;

The SET command
Used to set a hadoop job variable. Like the name of your pig
job.
SET job.name 'Day over Day - [$input]’;

The REGISTER and DEFINE commands
-- Setup udf jars
REGISTER $jar_prefix/sidekick-hadoop-0.0.1.jar
DEFINE BUCKET_FORMAT_DATE
com.sidekick.hadoop.udf.UnixTimeFormatter('MM/dd/
yyyy HH:mm', 'HH');

The LOAD USING command
-- load in the data from HDFS
raw_data = LOAD '$input' USING
PigStorage('t') AS (date_time, items);

The FILTER BY command
Selects tuples from a relation based on some condition.
-- filter to the week we want
broadcast_week = FILTER bucket_list BY (date >=
'03-Oct-2011') AND (date <= '10-Oct-2011');

The GROUP BY command
Groups the data in one or multiple relations.
daily_stats = GROUP broadcast_week BY (date,
hour);

The FOREACH command
Generates data transformations based on columns of data.
bucket_list = FOREACH raw_data GENERATE
FLATTEN(DATE_FORMAT_DATE(date_time)) AS date,
MINUTE_BUCKET(date_time) AS hour,
MAX_ITEMS(items) AS items;
*DATE_FORMAT_DATE is a user defined function, an advanced topic we’ll come to in a minute.

The GENERATE command
Use the FOREACH GENERATE operation to work with columns
of data.
bucket_list = FOREACH raw_data GENERATE
FLATTEN(DATE_FORMAT_DATE(date_time)) AS date,
MINUTE_BUCKET(date_time) AS hour,
MAX_ITEMSS(items) AS items;

The FLATTEN command
FLATTEN substitutes the fields of a tuple in place of the tuple.
traffic_stats = FOREACH daily_stats GENERATE
FLATTEN(GROUP),
COUNT(broadcast_week) AS cnt,
SUM(broadcast_week.items) AS total;

The STORE INTO USING command
Store function determine how data stored after a pig job.
-- All done, now store it
STORE final_results INTO '$output' USING
PigStorage();

Demo Time!
“Because, it’s all a big lie
until someone demos’ the code.”
- Genghis Khan

Practical Hadoop using Pig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Practical Hadoop using Pig

Similar to Practical Hadoop using Pig (20)

Recently uploaded

Recently uploaded (20)

Practical Hadoop using Pig