This document discusses Hadoop Distributed File System (HDFS) and MapReduce. It begins by explaining HDFS architecture, including the NameNode and DataNodes. It then discusses how HDFS is used to store large files reliably across commodity hardware. The document also provides steps to install Hadoop in single node cluster and describes core Hadoop services like JobTracker and TaskTracker. It concludes by discussing HDFS commands and a quiz about Hadoop components.
2. www.sensaran.wordpress.com
Describe the use of Hadoop in commodity hardware.
Explain the various configurations and services of Hadoop.
Differentiate between regular file system and Hadoop Distributed
File System (HDFS).
Explain HDFS architecture and Map Reduce
TOPICS
3. TEN STEPS TO INSTALL HADOOP 2.6.0 IN
UBUNTU USING SINGLE NODE CLUSTER
4. Step 1 – Download Apache Hadoop 2.6.0 version from
https://hadoop.apache.org/docs/r2.6.0 and unzip the file in
home directory.
Step 2 – Download java Linux 64 bit version from
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-
downloads-1880260.html and unzip the file in home directory.
5. Step 3 – Select Home button at top left hand side and click ctrl+H in
order to view the hidden files .
Then open the .bashrc in edit mode and configure the Hadoop Java
configuration as mention below .
6. Step 4 – Core-site.xml In this edit the following XMLs & flat files enter
you IP Address as mention below
11. Step 9 – Once it’s completed . open the terminal and
type startall.sh and press enter
12. Step 10 – To check all the all node are properly started . type jps
13. HADOOP CLUSTER USING COMMODITY
HARDWARE
Hadoop supports the concept of distributed architecture.
The diagram represents the nodes connected and installed
with Hadoop.
Let see what is Distributed File system and how it’s working in
Hadoop system.
www.sensaran.wordpress.com
14. FILE SYSTEM
A file system is the underlying structure a computer uses to
organize data on a hard disk.
Without a file system, information placed in a storage area would
be one large body of data with no way to tell where one piece of
information stops and the next begins.
If you are installing a new hard disk, you need to partition and
format it using a file system before you can begin storing data
or programs.
www.sensaran.wordpress.com
15. WHAT IS “DISTRIBUTED”?
The file system will be physically distributed among several
machines in same network.
Actually, the file system is treated as one single coherent system.
Data will be replicated within the file system to for availability
and fault tolerance.
www.sensaran.wordpress.com
16. 16
Resource sharing
Combined computation speedup
Reliability (Fault tolerant, No single point failure)
Location transparency & location independence.
Scaling up to any level
Best processing capability & availability with cheap hardware
WHY DISTRIBUTED? WHY DATA
REPLICATION?
www.sensaran.wordpress.com
18. HADOOP DISTRIBUTED FS (HDFS)
www.sensaran.wordpress.com
HDFS is a specially designed file system for storing huge data set with
cluster of commodities hardware and streaming access pattern.
As in Java – write once and run in N number of platform like that in
Hadoop also “WORM” concept is used WORM – Write Once Read
Many times without changing the data’s once file has been updated
in HDFS.
19. HADOOP DISTRIBUTED FS (HDFS)
www.sensaran.wordpress.com
HDFS provides interfaces for applications to move themselves closer
to where the data is located.
HDFS supports a traditional hierarchical file organization. A user or an
application can create directories and store files inside these
directories.
Detection of faults and quick, automatic recovery from file system’s
data is a core architectural goal of HDFS.
20. FILE SYSTEM Vs HDFS
www.sensaran.wordpress.com
File System HDFS
Each block of data is small in size;
approximately 51 bytes
Each block of data is very large in size;
64MB by default
Large data access suffers from disk I/O
problems; mainly because of multiple
seek operation
Reads huge data sequentially after a
single seek
21. HDFS - CHARACTERISTICS
www.sensaran.wordpress.com
High fault-tolerance.
High throughput.
Suitable for applications with large data sets.
Suitable for applications with streaming access to file system data.
Can be built on commodity hardware and heterogeneous platforms.
22. Hadoop Distributed File Syste
m
HDFS
Google File System
GFS
Cross Platform Linux
Developed in Java
environment
Developed in c,c++
environment
At first its developed by
Yahoo and now its an open
source Framework
Its developed by Google
It has Name node and Data
Node
It has Master-node and Chunk
server
128 MB will be the default
block size
64 MB will be the default
block size
Name node receive heartbeat
from Data node
Master node receive
heartbeat from Chunk server
Commodities hardware
were used
Commodities hardware
werused
WORM – Write Once and
Read Many times
Multiple writer , multiple
reader model
Deleted files are renamed into
particular folder and then it
will removed via garbage
Deleted files are not
reclaimed immediately and
are renamed in hidden name
space and it will deleted after
three days if it's not in use
No Network stack issue Network stack Issue
Journal ,editlog Oprational log
only append is possible random file write possible
HDFS Vs GFS
www.sensaran.wordpress.com
Hadoop Distributed File System
HDFS
Google File System
GFS
Cross Platform
Linux
Developed in Java environment Developed in c,c++ environment
At first its developed by Yahoo and
now its an open source Framework
Its developed by Google
It has Name node and Data Node It has Master-node and Chunk server
128 MB will be the default block size 64 MB will be the default block size
Name node receive heartbeat from
Data node
Master node receive heartbeat from
Chunk server
23. www.sensaran.wordpress.com
HDFS GFS
WORM – Write Once and Read Many
times
Multiple writer , multiple reader
model
Deleted files are renamed into
particular folder and then it will
removed via garbage
Deleted files are not reclaimed
immediately and are renamed in
hidden name space and it will
deleted after three days if it's not in
use
No Network stack issue Network stack Issue
Journal ,editlog Oprational log
only append is possible random file write possible
24. What was its
“Big Data” limit?
www.sensaran.wordpress.com
LET SEE HDFS AND HOW IT’S USED
25. What was its
“Big Data” limit?
www.sensaran.wordpress.com
WHEN TO USE HDFS
HDFS designed for storing very large file with streaming data access
patterns cluster with commodity hardware.
Very large file means ?
Files == hundreds of Mega,Giga or Terabytes.
Streaming data access WORM – Write once and Read Many pattern
Dataset is typically generated or copied from source.
Commodity Hardware does not require expensive , highly reliable
hardware.
26. What was its
“Big Data” limit?
www.sensaran.wordpress.com
WHEN NOT TO USE HDFS
Lots of Small file? Limited size of file is not advisable to use in
HDFS name node.
Multiple writes is not allowed File in HDFS
may be written to by a single writer No
support for multiple writers.
30. SECONDARY NODE ( SINGLE INSTANCE ) :
It will act as back up for Name node server keeps the Namespace image
through edit log.
31. Name node contains file system Namespace i.e metadata .
If there is any change in file system or in storage pattern , this will be
tracked in Name node say for eg . if the files is deleted from HDFS or
else any change or modification then name node will change in their
EDIT log.
It will initiate the Data node to perform the actions
It maintain the record how the files in HDFS is splited and stored.
It will receive the heartbeat and black report from the data node .
based on that the communication replication factor will happen.
NAME NODE ( SINGLE INSTANCE ) :
32. DATA NODE ( MULTIPLE INSTANCE )
The Data-node is responsible for storing the files in HDFS.
It manages the file blocks within the node. It sends information
to the Name Node about the files and blocks stored in that node
and responds to the Name Node for all file system operations.
Data nodes send heartbeats to the Name Node once every 3
seconds, to report the overall health of HDFS.
Data nodes also enables pipelining of data and it's forward data
to other nodes.
The data nodes can talk to each other to rebalance data, move
and copy data around and keep the replication high.
33. www.sensaran.wordpress.com
Each file is split into one or more blocks stored and replicated in Data
Nodes.
Each block is typically 64Mb or 124 Mb size.
Each block is replicated multiple times. Default is to replicate each
block three times Replicas are stored on different nodes.
DATA BLOCKS
34. JOB TRACKER & TASK TRACKER
JOB TRACKER:
The Job Tracker is the service within Hadoop that farms out Map
Reduce tasks to specific nodes in the cluster, ideally the nodes that have
the data, or at least are in the same rack.
35. JOB TRACKER
www.sensaran.wordpress.com
The Job Tracker will communicate with Name Node to determine
the location of the data.
The Job Tracker locates Task Tracker nodes with available slots at or
near the data.
The Job Tracker submits the work to the chosen Task Tracker nodes.
And monitored how i’s worked.
If Job Tracker do not submit heartbeat signals often enough, they
are deemed to have failed and the work is scheduled on a different
Task Tracker.
A Task Tracker will notify the Job Tracker when a task fails.
36. www.sensaran.wordpress.com
Task Trackers which run on Data Nodes; Task Trackers run the tasks
and report the status of task to Job Tracker.
The Job Tracker runs on Master Node aka Name Node whereas Task
Trackers run on Data Nodes.
Mapper and Reducer tasks are executed on DataNodes administered
by Task Trackers.
Task Trackers will be assigned Mapper and Reducer tasks to execute
by Job Tracker.
Task Tracker will be in constant communication with the Job Tracker
signaling the progress of the task in execution
TASK TRACKER
37. www.sensaran.wordpress.com
Task Trackers which run on Data Nodes; Task Trackers run the tasks
and report the status of task to Job Tracker.
The Job Tracker runs on Master Node aka Name Node whereas Task
Trackers run on Data Nodes.
Mapper and Reducer tasks are executed on Data Nodes
administered by Task Trackers.
Task Trackers will be assigned Mapper and Reducer tasks to execute
by Job Tracker.
Task Tracker will be in constant communication with the Job Tracker
signaling the progress of the task in execution.
TASK TRACKER
38. FS IMAGE IN HADOOP?
In hadoop name node command creates fs image file and it store the
details about name space i.e mapping of blocks to files and file system
properties
This will stored in both memory and local disk . while restart the
hadoop at that time it will update in memory ,during run time all the
operation will updated in Edit log.
It is always synchronized with name node. So there is no need to copy
FS Image & log file from name node.
www.sensaran.wordpress.com
39. Edit log file may increase drastically, which will be challenging to
manage.
Longer Name node restarting due to lot of changes needs to be
merged.
In the case of crash, we will lost huge amount of metadata since fs
image is very old.
Therefore, restarting of name node is going to take longer even.
Secondary name node is solution for this issue. This is another
machine having connectivity with name node.
www.sensaran.wordpress.com
CHALLENGES
40. www.sensaran.wordpress.com
HDFS COMMANDS
How to upload files/directory to HDFS
hadoop fs -put <file or dir> <hdfs path>
copying the file from local directory to HDFS location .
in below example we are copying the Sample.txt file to HDFS location.
hadoop fs -put Sample.txt /
How to find list of files/directory available in HDFS location
hadoop fs -ls /
How to count the files Number in HDFS Folder
hadoop fs -count /
41. HDFS COMMANDS
www.sensaran.wordpress.com
Remove a file or directory in HDFS
hadoop fs -rm <HDFSpath>
hadoop fs -rmr <HDFSpath>
hadoop fs -rm /testfile
hadoop fs -rmr /file_or_folder
How to view the content of files
hadoop fs -cat <path>
hadoop fs -cat /Sample.txt
How to display the last few lines of file in HDFS
hadoop fs -tail <path>
hadoop fs -tail /Sample.txt
42. www.sensaran.wordpress.com
How to create a directory in HDFS
hadoop fs -mkdir <HDFS dir Name>
hadoop fs -mkdir /HDFSfolder
How to move a file/directory within HDFS
hadoop fs -mv <Hdfs files to move > <Hdfs files to move
mention location>
hadoop fs -mv/samplefile/testfolder/samplefile
hadoop fs -ls /
hadoop fs -ls /testfolder
HDFS COMMANDS
43. www.sensaran.wordpress.com
How to download a file from HDFS to local
hadoop fs -get <hdfs path> <local path>
hadoop fs -get /samplefile .
How to copy a file/directory within HDFS
hadoop fs -cp <hdfs path> <Hdfs files to copy mention location>
hadoop fs -cp /samplefile /copyHDFSfolder/samplefile
hadoop fs -ls /copyHDFSfolder
HDFS COMMANDS
45. www.sensaran.wordpress.com
1 . What are the two major components of Hadoop cluster ?
a) Hadoop file system, TaskTracker
b) MapReduce, Hadoop Distributed file system
c) JobTracker, MapReduce
d) JobTracker, TaskTracker
46. www.sensaran.wordpress.com
2 . Which of the following services run in the Master node of Apache
Hadoop in cluster mode (fully distributed mode) ?
a) JobTracker
b) TaskTracker
c) JobTracker, MapReduce
d) JobTracker, TaskTracker
47. www.sensaran.wordpress.com
3 . Which are the single instance critical tasks?
a) NameNode, DataNode
b) NameNode, Secondary NameNode
c) JobTracker, DataNode
d) JobTracker, NameNode
48. www.sensaran.wordpress.com
4 . Which of the following services are used by MapReduce programs ?
a) JobTracker and TaskTracker
b) Secondary NameNode
c) TaskTracker
d) NameNode
49. www.sensaran.wordpress.com
5. The NameNode uses RAM for the following purpose
a) To store the file contents in HDFS
b) To store filenames, list of blocks, and other meta information
c) To store the edits log that keeps track of changes in HDFS
d) To manage distributed read and write locks on files in HDFS
50. www.sensaran.wordpress.com
6. Which of the following commands is used to start all Hadoop
services?
a) start-dfs.sh
b) stop-dfs.sh
c) start-all.sh
d) stop-all.sh
51. www.sensaran.wordpress.com
7. Which of the following files in Hadoop configuration is
responsible for maintaining information about the JobTracker?
a) core-site.xml
b) mapred-site.xml
c) hadoop-env.sh
d) slaves file
54. www.sensaran.wordpress.com
Map Reduce is a programming model and software framework first
developed by Google (Google’s Map Reduce paper submitted in
2004).
This simplifies the processing of vast amounts of data in parallel on
large clusters of commodity hardware in a reliable, fault-tolerant
manner.
Handles both structured and unstructured data.
WHAT IS MAP REDUCE ?
55. www.sensaran.wordpress.com
MAP REDUCE
The key reason to perform mapping and reducing is to speed up the
execution of a specific process by splitting the process into a number of
tasks, thus enabling parallel work.
57. www.sensaran.wordpress.com
JobTracker and the TaskTracker are responsible for performing Map
Reduce operations.
Generally, the JobTracker is present in the master node of the
Hadoop and the TaskTracker service is present in the slave nodes.
The JobTracker service is responsible for assigning the jobs to the
DataNode. The DataNode consists of the TaskTracker which performs
the tasks that are submitted by the JobTracker and provides the
results back to the JobTracker.
MAP REDUCE ENGINE
58. www.sensaran.wordpress.com
Map Step
– Master node takes large problem input and slices it into
smaller sub problems; distributes these to worker nodes.
– Worker node may do this again; leads to a multilevel
tree structure.
– Worker processes smaller problem and hands
back to master.
Reduce step
– Master node takes the answers to the sub problems and
combines them in a predefined way to get the
output/answer to original problem.
MAP REDUCE STEPS
59. www.sensaran.wordpress.com
Let us take a file which has 2 blocks of data BLOCK1 has some text
data and BLOCK2 has some more text data
MAP REDUCE EXAMPLE – WORD COUNT
61. www.sensaran.wordpress.com
It handle very large scale data: peta, exa bytes, and so on
It works well on Write once and read many (WORM) data.
It allows parallelism without mutexes.
The runtime takes care of splitting and moving data for operations.
The operations are provisioned near the data i.e., data locality is
preferred.
The commodity hardware and storage is leveraged.
CHARACTERISTICS OF MAP REDUCE
62. www.sensaran.wordpress.com
Simple algorithms such as grep, text-indexing, reverse indexing
Search engine operations like keyword indexing, ad rendering,
page rank.
It allows parallelism without mutexes.
Enterprise analytics .
Data-intensive computing such as sorting.
REAL-TIME USES OF MAPREDUCE
65. www.sensaran.wordpress.com
2. What is the sequence of Map Reduce Job ?
a) Map, input split, reduce
b) Input split, map, reduce
c) Map and then reduce
d) Map, split, map
66. www.sensaran.wordpress.com
3 . Which one is the responsibility of the Map Reduce developer ?
a) Map, reduce
b) Reduce, input split
c) Input split, map
d) Sort, shuffle