Introduction to hadoop and hdfs

By: Akhil Arora & Shrey Mehrotra

Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult
to process them using traditional data processing applications.
 Domains with Large Datasets:
• Meteorology
• Complex physics simulations
• Biological and environmental research
• Internet search
 Key challenges
• Capture & Store
• Search
• Sharing & Transfer
• Analysis

Dec 2004 : Google GFS paper published
July 2005 : Nutch uses MapReduce
Feb 2006 : Becomes Lucene subproject
Apr 2007 : Yahoo! on 1000-node cluster
Jan 2008 : An Apache Top Level Project
April 2009 : Won the minute sort by sorting 500 GB in 59 seconds (on 1400 nodes)
April 2009 : 100 terabyte sort in 173 minutes (on 3400 nodes)

Advertising Improve effectiveness of advertising and promotions
Financial Services Mitigate risk while creating opportunity
Government Decrease Budget Pressures by Offloading Expensive SQL Workloads
Healthcare Deliver better care and streamline operations
Manufacturing Increase production, reduce costs, and improve quality
Oil & Gas Maximize yields and reduce risk in the supply chain
Retail Boost sales in-store and online
Telcoms Telcos and Cable Companies Use Hortonworks for Service, Security and Sales
Projects Powered by Hadoop

 The Apache™ Hadoop® project develops open-source software for reliable,
scalable, distributed computing.
 The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using simple
programming models.
 It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
 The library itself is designed to detect and handle failures at the application layer,
so delivering a highly-available service.

•Distributed, Scalable, and Portable file
system written in Java for the Hadoop
framework
•Fault‐Tolerant Storage System
Hadoop Distributed
File System
•High-Performance Parallel Data
Processing
•Employs the Divide-Conquer principle
Map-Reduce
Programming Model

HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
Map-Reduce Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
• Job Tracker & Task Trackers

NameNode
 Maps a block to the Datanodes
 Controls read/write access to files
 Manages Replication Engine for Blocks
DataNode
 Responsible for serving read and write
requests (block creation, deletion, and
replication)
JobTracker
 Accepts Map-Reduce tasks from the Users
 Assigns tasks to the Task Trackers &
monitors their status
TaskTracker
 Runs Map-Reduce tasks
 Sends heart-beat to Job Tracker
 Retrieves Job resources from HDFS
NameNode DataNode
JobTracker TaskTracker
Hadoop
Daemons

 Scalability
 Batch-Processing only
 Reliability & Availability
 Partitioning of Resources
 Coupling with MapReduce only

•High-Performance Parallel Data Processing
•Employs the Divide-Conquer principle
Map-Reduce
Programming Model
•Yet Another Resource Negotiator
•A framework for cluster’s resource management
•Efficient task schedulers
YARN
•Distributed, Scalable, and Portable file system
written in Java for the Hadoop framework
•Fault‐Tolerant Storage System
Hadoop Distributed File
System

HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
YARN Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
• Resource Manager & Node
Manager

NameNode
 Maps a block to the Datanodes
 Controls read/write access to files
 Manages Replication Engine for Blocks
DataNode
 Responsible for serving read and write
requests (block creation, deletion, and
replication)
ResourceManager
 Accepts Map-Reduce or Application tasks
from the Users
 Assigns tasks to the NodeManager &
monitors their status
NodeManager
 Runs Application tasks
 Sends heart-beat to ResourceManager
 Retrieves Application resources from HDFS
NameNode DataNode
Resource
Manager
Node
Manager
Hadoop
Daemons

HDFS Design Goals
 Hardware Failure - Detection of faults and quick, automatic recovery
 Streaming Data Access - High throughput of data access (Batch Processing)
 Large Data Sets - Gigabytes to terabytes in size.
 Simple Coherency Model - Write-once-read-many access model for files
 Moving computation is cheaper than moving data

Namenode
Datanode_1 Datanode_2 Datanode_3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Storage & Replication of Blocks in HDFS
Filedividedintoblocks
Block 1
Block 2
Block 3
Block 4

NameNode and DataNodes : Java Processes responsible for HDFS operations
Data Replication : Blocks of a file are replicated for fault tolerance
Replica Placement : A rack-aware replica placement policy
Replica Selection : Minimize global bandwidth consumption and read latency
File System Namespace : Hierarchical file organization
Safemode : File system consistency check

Blocks
 Minimum amount of data that can be read or write - 128 MB by default
 Minimize the cost of seeks
 A file can be larger than any single disk in the network
 Simplifies the storage subsystem – Same size & eliminating metadata concerns
 Provides fault tolerance and availability

Rack Awareness
 Get maximum performance out of
Hadoop
 Resolution of the slave's DNS name
(also IP address) to a rack id.
 Interface DNSToSwitchMapping
Rack Topology - /rack1 & /rack2

Replica Placement
 Critical to HDFS reliability and
performance
 Improve data reliability,
availability, and network
bandwidth utilization
Distance b/w Nodes

Replica Placement cont..
Default Strategy :
a) First replica on the same node as the client.
b) Second replica is placed on a different rack from the first (off-rack) chosen at random
c) Third replica is placed on the same rack as the second, but on a different node chosen at random.
d) Further replicas are placed on random nodes on the cluster
Replica Selection - HDFS tries to satisfy a read request from a replica that is closest to the
reader.

FileSystem Image and Edit Logs
 fsimage file is a persistent checkpoint of the filesystem metadata
 When a client performs a write operation, it is first recorded in the edit log.
 The namenode also has an in-memory representation of the filesystem metadata,
which it updates after the edit log has been modified
 Secondary NameNode is used to produce checkpoints of the primary’s in-memory
filesystem metadata

FileSystem Image Structure
<FS_IMAGE>
<IMAGE_VERSION>-47</IMAGE_VERSION>
<NAMESPACE_ID>415263518</NAMESPACE_ID>
<GENERATION_STAMP>1000</GENERATION_STAMP>
<GENERATION_STAMP_V2>6953</GENERATION_STAMP_V2>
<GENERATION_STAMP_V1_LIMIT>0</GENERATION_STAMP_V1_LIMIT>
<LAST_ALLOCATED_BLOCK_ID>1073747777</LAST_ALLOCATED_BLOCK_ID>
<TRANSACTION_ID>62957</TRANSACTION_ID>
<LAST_INODE_ID>24606</LAST_INODE_ID>
<SNAPSHOT_COUNTER>0</SNAPSHOT_COUNTER>
<NUM_SNAPSHOTS_TOTAL>0</NUM_SNAPSHOTS_TOTAL>
<IS_COMPRESSED>false</IS_COMPRESSED>
<INODES NUM_INODES="1076">
<INODE>
<INODE_PATH>/</INODE_PATH>
<INODE_ID>16385</INODE_ID>
<REPLICATION>0</REPLICATION>
<MODIFICATION_TIME>2014-10-20 16:35</MODIFICATION_TIME>
<ACCESS_TIME>1970-01-01 05:30</ACCESS_TIME>
<BLOCK_SIZE>0</BLOCK_SIZE>
<BLOCKS NUM_BLOCKS="-1"></BLOCKS>
<NS_QUOTA>9223372036854775807</NS_QUOTA>
<DS_QUOTA>-1</DS_QUOTA>
<IS_SNAPSHOTTABLE_DIR>true</IS_SNAPSHOTTABLE_DIR>
<PERMISSIONS>
<USER_NAME>hduser</USER_NAME>
<GROUP_NAME>supergroup</GROUP_NAME>
<PERMISSION_STRING>rwxrwxrwx</PERMISSION_STRING>
</PERMISSIONS>
</INODE>
<SNAPSHOTS NUM_SNAPSHOTS="0">
<SNAPSHOT_QUOTA>0</SNAPSHOT_QUOTA>
</SNAPSHOTS>
<INODE>
<INODE_PATH>/data_in/stock1gbdata</INODE_PATH>
<INODE_ID>24568</INODE_ID>
<REPLICATION>3</REPLICATION>
<MODIFICATION_TIME>2014-10-28 15:58</MODIFICATION_TIME>
<ACCESS_TIME>2014-10-28 15:58</ACCESS_TIME>
<BLOCK_SIZE>134217728</BLOCK_SIZE>
<BLOCKS NUM_BLOCKS="81">
<BLOCK>
<BLOCK_ID>1073747677</BLOCK_ID>
<NUM_BYTES>134217670</NUM_BYTES>
</BLOCK>
<BLOCK>
<BLOCK_ID>1073747678</BLOCK_ID>
<NUM_BYTES>134217646</NUM_BYTES>
</BLOCK>
</BLOCKS>
<INODE>
<INODES>
<INODES_UNDER_CONSTRUCTION
NUM_INODES_UNDER_CONSTRUCTION="0"></INODES_UNDER_CONSTRUCTION>
<CURRENT_DELEGATION_KEY_ID>0</CURRENT_DELEGATION_KEY_ID>
<DELEGATION_KEYS NUM_DELEGATION_KEYS="0"></DELEGATION_KEYS>
<DELEGATION_TOKEN_SEQUENCE_NUMBER>0</DELEGATION_TOKEN_SEQU
ENCE_NUMBER>
<DELEGATION_TOKENS
NUM_DELEGATION_TOKENS="0"></DELEGATION_TOKENS>
</FS_IMAGE>

Safe Mode
 On start-up, NameNode loads its image file (fsimage) into memory and applies the edits from the edit
log (edits).
 It does the check pointing process itself. without recourse to the Secondary NameNode.
 Namenode is running in safe mode (offers only a read-only view to clients)
 The locations of blocks in the system are not persisted by the NameNode - this information resides with
the DataNodes, in the form of a list of the blocks it is storing.
 Safe mode is needed to give the DataNodes time to check in to the NameNode with their block lists
 Safe mode is exited when the minimal replication condition is reached, plus an extension time of 30
seconds.

Administration
 HDFS Trash
 HDFS Quotas
 Safe Mode
 FS Shell
 dfsadmin Command

HDFS Trash – Recycle Bin
When a file is deleted by a user, it is not immediately removed from HDFS. HDFS moves it to a file in the /trash directory.
A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the
file from the HDFS namespace.
Undelete a file: User needs to navigate the /trash directory and retrieve the file by using mv command.
File : core-site.xml
Property : fs.trash.interval
Description : Number of minutes after which the checkpoint gets deleted.
File : core-site.xml
Property : fs.trash.checkpoint.interval
Description : Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval.

HDFS Quotas
Name Quota - a hard limit on the number of file and directory names in the tree rooted at that directory.
Space Quota - a hard limit on the number of bytes used by files in the tree rooted at that directory.
Reporting Quota - count command of the HDFS shell reports quota values and the current count of names and
bytes in use. With the -q option, also report the name quota value set for each directory, the available name quota
remaining, the space quota value set, and the available space quota remaining.
fs -count -q <directory>..
dfsadmin -setQuota <N> <directory>... Set the name quota to be N for each directory.
dfsadmin -clrQuota <directory>... Remove any name quota for each directory.
dfsadmin -setSpaceQuota <N> directory>.. Set the space quota to be N bytes for each directory.
dfsadmin -clrSpaceQuota <directory>... Remove any spce quota for each directory.

DfsAdmin Command
 bin/hadoop dfsadmin [Generic Options] [Command Options]
-safemode enter /
leave / get / wait
Safe mode maintenance command. Safe mode can also be entered manually, but then it can only be
turned off manually as well.
-report Reports basic filesystem information and statistics.
-refreshNodes Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the
Namenode and those that should be decommissioned or recommissioned.
-metasave filename Save Namenode's primary data structures to filename in the directory specified by hadoop.log.dir
property. filename is overwritten if it exists. filename will contain one line for each of the following
1. Datanodes heart beating with Namenode
2. Blocks waiting to be replicated
3. Blocks currrently being replicated
4. Blocks waiting to be deleted

FS Shell – Some Basic Commands
 cat
 hadoop fs -cat URI [URI …]
 Copies source paths to stdout.
 cp
 hadoop fs -chgrp [-R] GROUP URI [URI …]
 Change group association of files. With -R, make the change recursively through the directory structure.
chmod
hadoop fs -chmod -R 777 hdfs://nn1.example.com/file1
Change the permissions of files. With -R, make the change recursively through the directory structure.
copyFromLocal / put
hadoop fs -copyFromLocal <localsrc> URI
Copy single src, or multiple srcs from local file system to the destination filesystem
copyToLocal / get
hadoop fs -copyToLocal <localdst>
Copy files to the local file system.

FS Shell – Commands Continued…
 expunge
 hadoop fs –expunge
 Empty the Trash.
 mkdir
 hadoop fs -mkdir <paths>
 Takes path uri's as argument and creates directories.
rmr
 hadoop fs –rmr /user/hadoop/dir
 Recursive version of delete.
Touchz
 hadoop -touchz pathname
 Create a file of zero length.
 du
 hadoop fs -du URI [URI …]
 Displays aggregate length of files contained in the directory or the length of a file in case its just a file.

Modes
Local
Standalone
Pseudo
Distributed
Fully
Distributed

 Local Standalone (Non-distributed)
• All Hadoop daemons run as a single Java process on a single system
• Useful for debugging
 Pseudo Distributed
• Daemons run on a single-node
• Each Hadoop daemon runs in a separate Java process
 Fully Distributed
• Master-Slave Architecture
• One machine is designated as the NameNode and other as JobTracker (can be within same machine as well)
• Rest of the machines in the cluster act as both Datanode and TaskTracker

(i)
Create
Dedicated User
& Group
(ii)
Establish
Authentication
among Nodes
(iii)
Create Hadoop
folder
(iv)
Hadoop
Configuration
(v)
Remote Copy
Hadoop folder
to Slave Nodes
(vi)
Start Hadoop
Cluster
(vii)
Testing Hadoop
(viii)
Run Simple
WordCount
Program

Introduction to hadoop and hdfs

Introduction to hadoop and hdfs

More Related Content

What's hot

Viewers also liked

Similar to Introduction to hadoop and hdfs

Recently uploaded

Introduction to hadoop and hdfs