SlideShare a Scribd company logo
Practical Hadoop with Pig
Dave Wellman
#openwest @dwellman
How does it all work?
HDFS
Hadoop Shell
MR Data Structures
Pig Commands
Pig Example
HDFS
HDFS has 3 main actors
The Name Node
The Name Node is “The Conductor”.
It directs the performance of the cluster.
The Data Nodes:
A Data Node stores blocks of data.
Clusters can be contain thousands
of Data Nodes.
*Yahoo has a 40,000 node cluster.
The Client
The client is a window to the
cluster.
The Name Node
The heart of the System.
The heart of the System.
Maintains a virtual File Directory.
The heart of the System.
Maintains a virtual File Directory.
Tracks all the nodes.
The heart of the System.
Maintains a virtual File Directory.
Tracks all the nodes.
Listens for “heartbeats” and “Block Reports”
(more on this later).
The heart of the System.
Maintains a virtual File Directory.
Tracks all the nodes.
Listens for “heartbeats” and “Block Reports”
(more on this later).
If the NameNode is down, the cluster is
offline.
Storing Data
The Data Nodes
Add a Data Node:
Add a Data Node:
The Data Node says “Hello” to the Name
Node.
Add a Data Node:
The Data Node says “Hello” to the Name
Node.
The Name Node offers the Data Node a
handshake with version requirements.
Add a Data Node:
The Data Node says “Hello” to the Name
Node.
The Name Node offers the Data Node a
handshake with version requirements.
The Data Node replies back to the Name
Node, “Okay”, or “Shuts Down”.
Add a Data Node:
The Data Node says “Hello” to the Name
Node.
The Name Node offers the Data Node a
handshake with version requirements.
The Data Node replies back to the Name
Node, “Okay”, or “Shuts Down”.
The Name Node hands the Data Node a
NodeId that it remembers.
.
Add a Data Node:
The Data Node says “Hello” to the Name
Node.
The Name Node offers the Data Node a
handshake with version requirements.
The Data Node replies back to the Name
Node, “Okay”, or “Shuts Down”.
The Name Node hands the Data Node a
NodeId that it remembers.
The Data Node is now part of cluster and it
checks in with the Name Node every 3
seconds.
Data Node Heartbeat:
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.
Block Reports – what data I have and is it
okay.
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.
Block Reports – what data I have and is it
okay.
Name Node controls the Data Nodes by
issuing orders when they return and report
their status.
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.
Block Reports – what data I have and is it
okay.
Name Node controls the Data Nodes by
issuing orders when they return and report
their status.
Replicate Data, Delete Data, Verify Data
Data Node Heartbeat:
The “check-in” is a simple HTTP
Request/Response.
This "check-in" is very important
communication protocol that guarantees the
health of the cluster.
Block Reports – what data I have and is it
okay.
Name Node controls the Data Nodes by
issuing orders when they return and report
their status.
Replicate Data, Delete Data, Verify Data
Same process for all nodes within a cluster.
Writing Data
The client “tells” the NameNode the
virtual directory location for the file.
A64 B64 C28
The client “tells” the NameNode the
virtual directory location for the file.
The Client breaks the file into 64MB
“blocks”
A64 B64 C28
The client “tells” the NameNode the
virtual directory location for the file.
The Client breaks the file into 64MB
“blocks”
The client “ask” the NameNode where
the blocks go.
A64 B64 C28
A64 B64 C28
The client “tells” the NameNode the
virtual directory location for the file.
The Client breaks the file into 64MB
“blocks”
The client “ask” the NameNode where
the blocks go.
The Client “stream” the blocks, in
parallel, to the DataNodes.
A64 B64 C28
The client “tells” the NameNode the
virtual directory location for the file.
The Client breaks the file into 64MB
“blocks”
The client “ask” the NameNode where
the blocks go.
The Client “stream” the blocks, in
parallel, to the DataNodes.
DataNode(s) tells the NameNode they
have the data via the block report
The client “tells” the NameNode the
virtual directory location for the file.
The Client breaks the file into 64MB
“blocks”
The client “ask” the NameNode where
the blocks go.
The Client “stream” the blocks, in
parallel, to the DataNodes.
DataNode(s) tells the NameNode they
have the data via the block report
The NameNode tells the DataNode
where to replicate the block.
A64 A64
A64
Reading Data
The client tells the NameNode it would
like to read a file.
The client tells the NameNode it would
like to read a file.
The NameNode reply’s with the list of
blocks and the nodes the blocks are on.
A64
B64 C28
The client tells the NameNode it would
like to read a file.
The NameNode reply’s with the list of
blocks and the nodes the blocks are on.
The client request the first block from a
DataNode
B64 C28
A64
The client tells the NameNode it would
like to read a file.
The NameNode reply’s with the list of
blocks and the nodes the blocks are on.
The client request the first block from a
DataNode
The client compares the checksum of the
block against the manifest from the
NameNode.
The client tells the NameNode it would
like to read a file.
The NameNode reply’s with the list of
blocks and the nodes the blocks are on.
The client request the first block from a
DataNode
The client compares the checksum of the
block against the manifest from the
NameNode.
The client moves on to the next block in
the sequence until the file has been read.
B64 C28
A64 B64 C28
Failure Recovery
A Data Node Fails to “check-in”
A64
A Data Node Fails to “check-in”
After 10 minutes the Name Node gives up
on that Data Node.
A64
A Data Node Fails to “check-in”
After 10 minutes the Name Node gives up
on that Data Node.
When another node that has blocks
originally assigned to the lost node
checks-in, the name node sends a block
replication command. A64A64
A64
A Data Node Fails to “check-in”
After 10 minutes the Name Node gives up
on that Data Node.
When another node that has blocks
originally assigned to the lost node
checks-in, the name node sends a block
replication command.
The Data Node replicates that block of
data. (Just like a write)
A64A64
A64A64
Interacting with Hadoop
HDFS Shell Commands
HDFS Shell Commands.
> Hadoop fs –ls <args>
Same as unix or osx ls command.
/user/hadoop/file1
/user/hadoop/file2
...
HDFS Shell Commands.
> Hadoop fs –mkdir <path>
Creates directories in HDFS using path.
HDFS Shell Commands.
> hadoop fs -copyFromLocal <localsrc>
URI
Copy a file from your client to HDFS.
Similar to put command, except that the source
is restricted to a local file reference.
HDFS Shell Commands.
> hadoop fs -cat <path>
Copies source paths to stdout.
HDFS Shell Commands.
> hadoop fs -copyToLocal URI
<localdst>
Copy a file from HDFS to your client.
Similar to get command, except that the
destination is restricted to a local file reference.
HDFS Shell Commands.
cat
chgrp
chmod
chown
copyFromLocal
copyToLocal
cp
du
dus
expunge
get
getmerge
ls
lsr
mkdir
movefromLocal
mv
put
rm
rmr
setrep
stat
tail
test
text
touchz
Map Reduce Data Structures
Basic, Tuples & Bags
Basic Data Types:
Strings, Integers, Doubles, Longs, Byte, Boolean,
etc.
Advanced Data Types:
Tuples and Bags
Tuples are JSON like and simple.
raw_data: {
date_time: bytearray,
seconds: bytearray
}
Bags hold Tuples and Bags
element: {
date_time: bytearray,
seconds: bytearray
group: chararray,
ordered_list: {
date: chararray,
hour: chararray,
score: long
}
}
Expert Advice:
Always know your data structures.
They are the foundation for all Map Reduce operations.
Complex (deep) data structures will kill -9 performance.
Keep them simple!
Processing Data
Interacting with Pig using Grunt
GRUNT
Grunt is a command line interface used to debug
pig jobs. Similar to Ruby IRB or Groovy CLI.
Grunt is your best weapon against bad pigs.
pig -x local
Grunt> |
GRUNT
Grunt> describe Element
Describe will display the data structure of an
Element
Grunt> dump Element
Dump will display the data represented by an
Element
GRUNT
> describe raw_data
Produces the output:
> raw_data: { date_time: bytearray,
items: bytearray }
Or in a more human readable form:
Raw_data: {
date_time: bytearray,
items: bytearray
}
GRUNT
> dump raw_data
You can dump terabytes of data to your screen,
so be careful.
(05/10/2011 20:30:00.0,0)
(05/10/2011 20:45:00.0,0)
(05/10/2011 21:00:00.0,0)
(05/10/2011 21:15:00.0,0)
...
Pig Programs
Map Reduce Made Simple
Most PIG commands are assignments.
• The element names the collection of records that exist out in
the cluster.
• It’s not a traditional programming variable.
• It describes the data from the operation.
• It does not change.
Element = Operation;
The SET command
Used to set a hadoop job variable. Like the name of your pig
job.
SET job.name 'Day over Day - [$input]’;
The REGISTER and DEFINE commands
-- Setup udf jars
REGISTER $jar_prefix/sidekick-hadoop-0.0.1.jar
DEFINE BUCKET_FORMAT_DATE
com.sidekick.hadoop.udf.UnixTimeFormatter('MM/dd/
yyyy HH:mm', 'HH');
The LOAD USING command
-- load in the data from HDFS
raw_data = LOAD '$input' USING
PigStorage('t') AS (date_time, items);
The FILTER BY command
Selects tuples from a relation based on some condition.
-- filter to the week we want
broadcast_week = FILTER bucket_list BY (date >=
'03-Oct-2011') AND (date <= '10-Oct-2011');
The GROUP BY command
Groups the data in one or multiple relations.
daily_stats = GROUP broadcast_week BY (date,
hour);
The FOREACH command
Generates data transformations based on columns of data.
bucket_list = FOREACH raw_data GENERATE
FLATTEN(DATE_FORMAT_DATE(date_time)) AS date,
MINUTE_BUCKET(date_time) AS hour,
MAX_ITEMS(items) AS items;
*DATE_FORMAT_DATE is a user defined function, an advanced topic we’ll come to in a minute.
The GENERATE command
Use the FOREACH GENERATE operation to work with columns
of data.
bucket_list = FOREACH raw_data GENERATE
FLATTEN(DATE_FORMAT_DATE(date_time)) AS date,
MINUTE_BUCKET(date_time) AS hour,
MAX_ITEMSS(items) AS items;
The FLATTEN command
FLATTEN substitutes the fields of a tuple in place of the tuple.
traffic_stats = FOREACH daily_stats GENERATE
FLATTEN(GROUP),
COUNT(broadcast_week) AS cnt,
SUM(broadcast_week.items) AS total;
The STORE INTO USING command
Store function determine how data stored after a pig job.
-- All done, now store it
STORE final_results INTO '$output' USING
PigStorage();
Demo Time!
“Because, it’s all a big lie
until someone demos’ the code.”
- Genghis Khan
Thank You.
- Genghis Khan

More Related Content

What's hot

Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 

What's hot (20)

An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 

Viewers also liked

Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Kevin Weil
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Apache Pig - JavaZone 2013
Apache Pig - JavaZone 2013Apache Pig - JavaZone 2013
Apache Pig - JavaZone 2013janerikcarlsen
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigAnshul Bhatnagar
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)mortardata
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Cloudera, Inc.
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop et son écosystème
Hadoop et son écosystèmeHadoop et son écosystème
Hadoop et son écosystèmeKhanh Maudoux
 
Smarty Pig - Life style gamification - Manu Melwin Joy
Smarty Pig - Life style gamification - Manu Melwin JoySmarty Pig - Life style gamification - Manu Melwin Joy
Smarty Pig - Life style gamification - Manu Melwin Joymanumelwin
 
Big Data: Concepts, techniques et démonstration de Apache Hadoop
Big Data: Concepts, techniques et démonstration de Apache HadoopBig Data: Concepts, techniques et démonstration de Apache Hadoop
Big Data: Concepts, techniques et démonstration de Apache Hadoophajlaoui jaleleddine
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big DataShawn Hermans
 
Solr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big DataSolr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big Datafrancelabs
 

Viewers also liked (20)

Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)Hadoop and pig at twitter (oscon 2010)
Hadoop and pig at twitter (oscon 2010)
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Un introduction à Pig
Un introduction à PigUn introduction à Pig
Un introduction à Pig
 
Apache Pig - JavaZone 2013
Apache Pig - JavaZone 2013Apache Pig - JavaZone 2013
Apache Pig - JavaZone 2013
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Apache pig
Apache pigApache pig
Apache pig
 
Pig statements
Pig statementsPig statements
Pig statements
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop et son écosystème
Hadoop et son écosystèmeHadoop et son écosystème
Hadoop et son écosystème
 
Smarty Pig - Life style gamification - Manu Melwin Joy
Smarty Pig - Life style gamification - Manu Melwin JoySmarty Pig - Life style gamification - Manu Melwin Joy
Smarty Pig - Life style gamification - Manu Melwin Joy
 
Big Data: Concepts, techniques et démonstration de Apache Hadoop
Big Data: Concepts, techniques et démonstration de Apache HadoopBig Data: Concepts, techniques et démonstration de Apache Hadoop
Big Data: Concepts, techniques et démonstration de Apache Hadoop
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 
Solr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big DataSolr + Hadoop - Fouillez facilement dans votre système Big Data
Solr + Hadoop - Fouillez facilement dans votre système Big Data
 

Similar to Practical Hadoop using Pig

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Networkbradhedlund
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxsunithachphd
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availabilityRuben Verborgh
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemsrikanthhadoop
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.srisatish ambati
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...Sri Ambati
 

Similar to Practical Hadoop using Pig (20)

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction P2p
Introduction P2pIntroduction P2p
Introduction P2p
 
Understanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the NetworkUnderstanding Hadoop Clusters and the Network
Understanding Hadoop Clusters and the Network
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop
HadoopHadoop
Hadoop
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Unit 1
Unit 1Unit 1
Unit 1
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 

Practical Hadoop using Pig

  • 1. Practical Hadoop with Pig Dave Wellman #openwest @dwellman
  • 2. How does it all work? HDFS Hadoop Shell MR Data Structures Pig Commands Pig Example
  • 4. HDFS has 3 main actors
  • 5. The Name Node The Name Node is “The Conductor”. It directs the performance of the cluster.
  • 6. The Data Nodes: A Data Node stores blocks of data. Clusters can be contain thousands of Data Nodes. *Yahoo has a 40,000 node cluster.
  • 7. The Client The client is a window to the cluster.
  • 9. The heart of the System.
  • 10. The heart of the System. Maintains a virtual File Directory.
  • 11. The heart of the System. Maintains a virtual File Directory. Tracks all the nodes.
  • 12. The heart of the System. Maintains a virtual File Directory. Tracks all the nodes. Listens for “heartbeats” and “Block Reports” (more on this later).
  • 13. The heart of the System. Maintains a virtual File Directory. Tracks all the nodes. Listens for “heartbeats” and “Block Reports” (more on this later). If the NameNode is down, the cluster is offline.
  • 16. Add a Data Node:
  • 17. Add a Data Node: The Data Node says “Hello” to the Name Node.
  • 18. Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements.
  • 19. Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements. The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”.
  • 20. Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements. The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”. The Name Node hands the Data Node a NodeId that it remembers. .
  • 21. Add a Data Node: The Data Node says “Hello” to the Name Node. The Name Node offers the Data Node a handshake with version requirements. The Data Node replies back to the Name Node, “Okay”, or “Shuts Down”. The Name Node hands the Data Node a NodeId that it remembers. The Data Node is now part of cluster and it checks in with the Name Node every 3 seconds.
  • 23. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response.
  • 24. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster.
  • 25. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay.
  • 26. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay. Name Node controls the Data Nodes by issuing orders when they return and report their status.
  • 27. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay. Name Node controls the Data Nodes by issuing orders when they return and report their status. Replicate Data, Delete Data, Verify Data
  • 28. Data Node Heartbeat: The “check-in” is a simple HTTP Request/Response. This "check-in" is very important communication protocol that guarantees the health of the cluster. Block Reports – what data I have and is it okay. Name Node controls the Data Nodes by issuing orders when they return and report their status. Replicate Data, Delete Data, Verify Data Same process for all nodes within a cluster.
  • 30. The client “tells” the NameNode the virtual directory location for the file.
  • 31. A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks”
  • 32. A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go.
  • 33. A64 B64 C28 A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go. The Client “stream” the blocks, in parallel, to the DataNodes.
  • 34. A64 B64 C28 The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go. The Client “stream” the blocks, in parallel, to the DataNodes. DataNode(s) tells the NameNode they have the data via the block report
  • 35. The client “tells” the NameNode the virtual directory location for the file. The Client breaks the file into 64MB “blocks” The client “ask” the NameNode where the blocks go. The Client “stream” the blocks, in parallel, to the DataNodes. DataNode(s) tells the NameNode they have the data via the block report The NameNode tells the DataNode where to replicate the block. A64 A64 A64
  • 37. The client tells the NameNode it would like to read a file.
  • 38. The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on.
  • 39. A64 B64 C28 The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on. The client request the first block from a DataNode
  • 40. B64 C28 A64 The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on. The client request the first block from a DataNode The client compares the checksum of the block against the manifest from the NameNode.
  • 41. The client tells the NameNode it would like to read a file. The NameNode reply’s with the list of blocks and the nodes the blocks are on. The client request the first block from a DataNode The client compares the checksum of the block against the manifest from the NameNode. The client moves on to the next block in the sequence until the file has been read. B64 C28 A64 B64 C28
  • 43. A Data Node Fails to “check-in” A64
  • 44. A Data Node Fails to “check-in” After 10 minutes the Name Node gives up on that Data Node. A64
  • 45. A Data Node Fails to “check-in” After 10 minutes the Name Node gives up on that Data Node. When another node that has blocks originally assigned to the lost node checks-in, the name node sends a block replication command. A64A64 A64
  • 46. A Data Node Fails to “check-in” After 10 minutes the Name Node gives up on that Data Node. When another node that has blocks originally assigned to the lost node checks-in, the name node sends a block replication command. The Data Node replicates that block of data. (Just like a write) A64A64 A64A64
  • 48. HDFS Shell Commands. > Hadoop fs –ls <args> Same as unix or osx ls command. /user/hadoop/file1 /user/hadoop/file2 ...
  • 49. HDFS Shell Commands. > Hadoop fs –mkdir <path> Creates directories in HDFS using path.
  • 50. HDFS Shell Commands. > hadoop fs -copyFromLocal <localsrc> URI Copy a file from your client to HDFS. Similar to put command, except that the source is restricted to a local file reference.
  • 51. HDFS Shell Commands. > hadoop fs -cat <path> Copies source paths to stdout.
  • 52. HDFS Shell Commands. > hadoop fs -copyToLocal URI <localdst> Copy a file from HDFS to your client. Similar to get command, except that the destination is restricted to a local file reference.
  • 54. Map Reduce Data Structures Basic, Tuples & Bags
  • 55. Basic Data Types: Strings, Integers, Doubles, Longs, Byte, Boolean, etc. Advanced Data Types: Tuples and Bags
  • 56. Tuples are JSON like and simple. raw_data: { date_time: bytearray, seconds: bytearray }
  • 57. Bags hold Tuples and Bags element: { date_time: bytearray, seconds: bytearray group: chararray, ordered_list: { date: chararray, hour: chararray, score: long } }
  • 58. Expert Advice: Always know your data structures. They are the foundation for all Map Reduce operations. Complex (deep) data structures will kill -9 performance. Keep them simple!
  • 60. GRUNT Grunt is a command line interface used to debug pig jobs. Similar to Ruby IRB or Groovy CLI. Grunt is your best weapon against bad pigs. pig -x local Grunt> |
  • 61. GRUNT Grunt> describe Element Describe will display the data structure of an Element Grunt> dump Element Dump will display the data represented by an Element
  • 62. GRUNT > describe raw_data Produces the output: > raw_data: { date_time: bytearray, items: bytearray } Or in a more human readable form: Raw_data: { date_time: bytearray, items: bytearray }
  • 63. GRUNT > dump raw_data You can dump terabytes of data to your screen, so be careful. (05/10/2011 20:30:00.0,0) (05/10/2011 20:45:00.0,0) (05/10/2011 21:00:00.0,0) (05/10/2011 21:15:00.0,0) ...
  • 64. Pig Programs Map Reduce Made Simple
  • 65. Most PIG commands are assignments. • The element names the collection of records that exist out in the cluster. • It’s not a traditional programming variable. • It describes the data from the operation. • It does not change. Element = Operation;
  • 66. The SET command Used to set a hadoop job variable. Like the name of your pig job. SET job.name 'Day over Day - [$input]’;
  • 67. The REGISTER and DEFINE commands -- Setup udf jars REGISTER $jar_prefix/sidekick-hadoop-0.0.1.jar DEFINE BUCKET_FORMAT_DATE com.sidekick.hadoop.udf.UnixTimeFormatter('MM/dd/ yyyy HH:mm', 'HH');
  • 68. The LOAD USING command -- load in the data from HDFS raw_data = LOAD '$input' USING PigStorage('t') AS (date_time, items);
  • 69. The FILTER BY command Selects tuples from a relation based on some condition. -- filter to the week we want broadcast_week = FILTER bucket_list BY (date >= '03-Oct-2011') AND (date <= '10-Oct-2011');
  • 70. The GROUP BY command Groups the data in one or multiple relations. daily_stats = GROUP broadcast_week BY (date, hour);
  • 71. The FOREACH command Generates data transformations based on columns of data. bucket_list = FOREACH raw_data GENERATE FLATTEN(DATE_FORMAT_DATE(date_time)) AS date, MINUTE_BUCKET(date_time) AS hour, MAX_ITEMS(items) AS items; *DATE_FORMAT_DATE is a user defined function, an advanced topic we’ll come to in a minute.
  • 72. The GENERATE command Use the FOREACH GENERATE operation to work with columns of data. bucket_list = FOREACH raw_data GENERATE FLATTEN(DATE_FORMAT_DATE(date_time)) AS date, MINUTE_BUCKET(date_time) AS hour, MAX_ITEMSS(items) AS items;
  • 73. The FLATTEN command FLATTEN substitutes the fields of a tuple in place of the tuple. traffic_stats = FOREACH daily_stats GENERATE FLATTEN(GROUP), COUNT(broadcast_week) AS cnt, SUM(broadcast_week.items) AS total;
  • 74. The STORE INTO USING command Store function determine how data stored after a pig job. -- All done, now store it STORE final_results INTO '$output' USING PigStorage();
  • 75. Demo Time! “Because, it’s all a big lie until someone demos’ the code.” - Genghis Khan