Hadoop File system (HDFS)

 What is Big Data?
 Limitations of the existing solutions
 Solving the problem with Hadoop
 Introduction to Hadoop
 Hadoop Eco-System
 Hadoop Core Components
 HDFS Architecture
 Anatomy of a File Write and Read
Topics of the Day
Slide 2

What is Big Data?
Slide 3
 Lots of Data (Terabytes or Petabytes)
 Big data is the term for a collection of data sets so
large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications.
 The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and
visualization.
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
mobile
processing
Big Data

What is Big Data?
Slide 4
 Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information.
NYSE generates about one terabyte of new
trade data per day to perform stock trading
analytics to determine trends for optimal
trades.

Un – Structured Data is Exploding
Slide 5

 IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Web
logs
Videos
Images
Audios
Sensor
Data
Slide 6
Volume Velocity Variety
IBM’s Definition

 Web and e-tailing
 Recommendation Engines
 Ad Targeting
 Search Quality
 Abuse and Click Fraud Detection
 Telecommunications
 Customer Churn Prevention
 Network Performance Optimization
 Calling Data Record (CDR) Analysis
 Analyzing Network to Predict Failure
Common Big Data Customer Scenarios
Slide 7

 Government
 Fraud Detection and Cyber Security
 Welfare Schemes
 Justice
 Healthcare & Life Sciences
 Health Information Exchange
 Gene Sequencing
 Serialization
 Healthcare Service Quality Improvements
 Drug Safety
Common Big Data Customer Scenarios(Contd.)
Slide 8

Common Big Data Customer Scenarios Contd.)
Slide 9
 Banks and Financial services
 Modeling True Risk
 Threat Analysis
 Fraud Detection
 Trade Surveillance
 Credit Scoring and Analysis
 Retail
 Point of Sales Transaction Analysis
 Customer Churn Analysis
 Sentiment Analysis

Hidden Treasure
Case Study: Sears Holding Corporation
X
*Sears was using traditional systems such as Oracle Exadata,
Teradata and SAS etc. to store and process the customer activity
and sales data.
Slide 17
 Insight into data can provide Business Advantage.
 Some key early indicators can mean Fortunes to Business.
 More Precise Analysis with more data.

Mostly Append
Collection
Instrumentation
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
90% of
the ~2PB
Archived
Storage
Processing
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
3. Premature data
death
1. Can’t explore original
high fidelity raw data
2. Moving data to compute
doesn’t scale
A meagre
10% of the
~2PB Data is
available for
BI
Storage only Grid (original Raw Data)
Limitations of Existing Data Analytics
Architecture
Slide 11

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
RDBMS (Aggregated Data)
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
Mostly Append
Collection
Instrumentation
BI Reports + Interactive Apps
Hadoop : Storage + Compute Grid
Entire ~2PB
Data is
available for
processing
Both
Storage
And
Processing
Solution: A Combined Storage Computer Layer
Slide 12

Accessible
RobustSimple
Scalable
Differentiating
Factors
Hadoop Differentiating Factors
Slide 13

RDBMS EDW
Hadoop – It’s about Scale and Structure
Slide 14
MPP NoSQL HADOOP
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On write Schema Required On Read
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
Interactive OLAP Analytics
Operational Data Store
Best Fit Use Massive Storage/Processing

Read 1 TB Data
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machines
4 I/O Channels
Why DFS?
Slide 15

Read 1 TB Data
10 Machines
4 I/O Channels
Why DFS?
Slide 16
45 Minutes
1 Machine
4 I/O Channels

Read 1 TB Data
10 Machines
4 I/O Channels
Why DFS?
Slide 17
45 Minutes 4.5 Minutes
1 Machine
4 I/O Channels

What is Hadoop?
Slide 18
 Apache Hadoop is a framework that allows for the distributed processing of large data sets across
clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.

Apache Oozie (Workflow)
MapReduce Framework
HBase
HDFS (Hadoop Distributed File System)
Flume Sqoop
Import Or Export
Pig Latin
Data Analysis
Mahout
Machine Learning
Hive
DW System
Unstructured or
Semi-Structured data Structured Data
Hadoop Eco-System
Slide 19

Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
MapReduce
Engine
HDFS
Cluster
Job Tracker
Admin Node
Name node
Hadoop Core Components
Slide 20

Hadoop Core Components
Slide 21
Hadoop is a system for large scale data processing.
It has two main components:
 HDFS – Hadoop Distributed File System (Storage)
 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations.
 MapReduce (Processing)
 Splits a task across processors
 “near” the data & assembles results
 Self-Healing, High Bandwidth
 Clustered storage
 JobTracker manages the TaskTrackers

HDFS Definition
Slide 22
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework.
It has many similarities with existing distributed file systems.
Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications.
HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to
enable reliable, extremely rapid computations.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets
• HDFS consists of following components (daemons)
– Name Node
– Data Node
– Secondary Name Node

HDFS Components
Slide 23
Namenode:
NameNode, a master server, manages the file system namespace and regulates
access to files by clients.
Maintains and Manages the blocks which are present on the datanode.
 Meta-data in Memory
– The entire metadata is in main memory
 Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
 A Transaction Log
– Records file creations, file deletions. Etc

HDFS Components
Slide 24
Data Node: Slaves which are deployed on each mechine and provide
the actual storage.
DataNodes, one per node in the cluster, manages storage attached to the
nodes that they run on
 A Block Server
– Stores data in the local file system (e.g. ext3)
– Stores meta-data of a block (e.g. CRC)
– Serves data and meta-data to Clients
– Block Report
– Periodically sends a report of all existing blocks to the NameNode
 Facilitates Pipelining of Data
– Forwards data to other specified DataNodes

Understanding the File system
Block placement
• Current Strategy
 One replica on local node
 Second replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly placed
• Clients read from nearest replica
Data Correctness
• Use Checksums to validate data
 Use CRC32
• File Creation
 Client computes checksum per 512 byte
 DataNode stores the checksum
• File access
 Client retrieves the data and checksum from DataNode

Understanding the File system
Data pipelining
 Client retrieves a list of DataNodes on which to place replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to the next DataNode in the Pipeline
 When all replicas are written, the Client moves on to write the next block in file
Rebalancer
• Goal: % of disk occupied on Datanodes should be similar
 Usually run when new Datanodes are added
 Cluster is online when Rebalancer is active
 Rebalancer is throttled to avoid network congestion
 Command line tool

Secondary NameNode
Slide 27
 Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode

Metadata (Name, replicas,…)
Rack 1
Blocks
DataNodes
Block ops
Replication
Write
DataNodes
Metadata ops
Client
Read
NameNode
Rack 2
Client
HDFS Architecture
Slide 28

NameNode
DataNode
DataNode
DataNode
DataNode
2. Create
7. Complete
5. ack Packet4. Write Packet
Pipeline of
Data nodes
DataNode
DataNode
Distributed
File System
HDFS
Client
1. Create
3. Write
6. Close
Anatomy of A File Write
Slide 29

NameNode
NameNode
DataNode
DataNode
DataNode
DataNode
2. Get Block locations
4. Read
DataNode
DataNode
Distributed
File System
HDFS
Client
1. Open
3. Read
5. Read
FS Data
Input Stream
6. Close
Client JVM
Client Node
Anatomy of A File Read
Slide 30

Replication and Rack Awareness
Slide 31

Task TrackerTask Tracker
Task TrackerTask TrackerData Node
Task Tracker
Client
HDFS
Name Node
Task Tracker Task Tracker Data Node Data Node
Map Reduce
Job Tracker
Hadoop Cluster Architecture (Contd.)
Data Node
Data Node
Task Tracker
Data Node
Data Node Data Node

Data Node
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 64-bit CentOS
Name Node
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
OS: 64-bit CentOS
Power: Redundant Power Supply
Secondary Name Node
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
OS: 64-bit CentOS
Power: Redundant Power Supply
Data Node
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
OS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case

http://wiki.apache.org/hadoop/PoweredBy
Hadoop Cluster: Facebook

Hadoop can run in any of the following three modes:
Standalone (or Local) Mode
 No daemons, everything runs in a single JVM.
 Suitable for running MapReduce programs during development.
 Has no DFS.
Pseudo-Distributed Mode
 Hadoop daemons run on the local machine.
Fully-Distributed Mode
 Hadoop daemons run on a cluster of machines.
Hadoop Cluster Modes

Core
HDFS
core-site.xml
hdfs-site.xml
mapred-site.xml
Map
Reduce
Hadoop 1.x: Core Configuration Files

Hadoop Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh Environment variables that are used in the scripts to run Hadoop.
core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and
MapReduce.
hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data
nodes.
mapred-site.xml Configuration settings for MapReduce daemons : the job-tracker and the task-trackers.
masters A list of machines (one per line) that each run a secondary namenode.
slaves A list of machines (one per line) that each run a datanode and a task-tracker.

core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
<?xml version - "1.0"?> <?xml version ="1.0"?>
 
<configuration> <configuration>
<property> <property>
<name>dfs.replication</name> <name>fs.default.name</name>
<value>1</value> <value>hdfs://localhost:8020/</value>
</property> </property>
</configuration> </configuration>

Defining HDFS Details In hdfs-site.xml
Property Value Description
dfs.data.dir <value>
/disk1/hdfs/data,
/disk2/hdfs/data
</value>
A list of directories where the datanode
stores blocks. Each block is stored in only
one of these directories.
${hadoop.tmp.dir}/dfs/data
fs.checkpoint.dir <value>
/disk1/hdfs/namesecondary,
/disk2/hdfs/namesecondary
</value>
A list of directories where the secondary
namenode stores checkpoints. It stores a
copy of the checkpoint in each directory in
the list
${hadoop.tmp.dir}/dfs/namesecondary

mapred-site.xml
mapred-site.xml
<?xml version=“1.0”?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<property>
</configuration>

Defining mapred-site.xml
Property Value Description
mapred.job.tracker <value>
localhost:8021
</value>
The hostname and the port that the jobtracker
RPC server runs on. If set to the default value of
local, then the jobtracker runs in-process on
demand when you run a MapReduce job.
mapred.local.dir ${hadoop.tmp.dir}/mapred/local A list of directories where MapReduce stores
intermediate data for jobs. The data is cleared
out when the job ends.
mapred.system.dir ${hadoop.tmp.dir}/mapred/syste
m
The directory relative to fs.default.name where
shared files are stored, during a job run.
mapred.tasktracker.map.tasks.
maximum
2 The number of map tasks that may be run on a
tasktracker at any one time
mapred.tasktracker.reduce.tasks
.maximum
2 The number of reduce tasks tat may be run on a
tasktracker at any one time.

1. http://hadoop.apache.org/docs/r1.1.2/core-default.html
2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html
3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
All Properties

Masters
 Contains a list of hosts, one per line, that are to host Secondary
NameNode servers.
Slaves and Masters
Two files are used by the startup and shutdown commands:
Slaves
 Contains a list of hosts, one per line, that are to host DataNode and
TaskTracker servers.

 This file also offers a way to provide custom parameters for each of the servers.
 Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/ directory of the
installation.
 Examples of environment variables that you can specify:
export HADOOP_DATANODE_HEAPSIZE="128"
export HADOOP_TASKTRACKER_HEAPSIZE="512"
Set parameter JAVA_HOME
JVMhadoop-env.sh
Per-Process Runtime Environment

 NameNode status: http://localhost:50070/dfshealth.jsp
 JobTracker status: http://localhost:50030/jobtracker.jsp
 TaskTracker status: http://localhost:50060/tasktracker.jsp
 DataBlock Scanner Report: http://localhost:50075/blockScannerReport
Web UI URLs

Open a terminal window to the current working directory.
# /home/notroot
-----------------------------------------------------------------
# 1. Print the Hadoop version
hadoop version
-----------------------------------------------------------------
# 2. List the contents of the root directory in HDFS
hadoop fs -ls /
-----------------------------------------------------------------
# 3. Report the amount of space used and available on currently mounted filesystem
hadoop fs -df hdfs:/
-----------------------------------------------------------------
# 4. Count the number of directories,files and bytes under the paths that match the
specified file pattern
hadoop fs -count hdfs:/

5. Run a DFS filesystem checking utility
hadoop fsck – /
-----------------------------------------------------------------
# 6. Run a cluster balancing utility
hadoop balancer
-----------------------------------------------------------------
# 7. Create a new directory named “hadoop” below the /user/training directory in
HDFS.
hadoop fs -mkdir /user/training/hadoop
-----------------------------------------------------------------
# 8. Add a sample text file from the local directory named “data” to the new
directory you created in HDFS
hadoop fs -put data/sample.txt /user/training/hadoop

9. List the contents of this new directory in HDFS.
hadoop fs -ls /user/training/hadoop
-----------------------------------------------------------------
# 10. Add the entire local directory called “retail” to the /user/training directory in
HDFS.
hadoop fs -put data/retail /user/training/hadoop
-----------------------------------------------------------------
# 11. Since /user/training is your home directory in HDFS, any command that does not
have an absolute path is interpreted as relative to that directory.
# The next command will therefore list your home directory, and should show the
items you’ve just added there.
hadoop fs -ls
-----------------------------------------------------------------
# 12. See how much space this directory occupies in HDFS.
hadoop fs -du -s -h hadoop/retail

13. Delete a file ‘customers’ from the “retail” directory.
hadoop fs -rm hadoop/retail/customers
-----------------------------------------------------------------
# 14. Ensure this file is no longer in HDFS.
hadoop fs -ls hadoop/retail/customers
-----------------------------------------------------------------
# 15. Delete all files from the “retail” directory using a wildcard.
hadoop fs -rm hadoop/retail/*
-----------------------------------------------------------------
# 16. To empty the trash
hadoop fs –expunge
-----------------------------------------------------------------
# 17. Finally, remove the entire retail directory and all of its contents in HDFS.
hadoop fs -rm -r hadoop/retail

# 18. Add the purchases.txt file from the local directory named “/home/training/” to
the hadoop directory you created in HDFS
hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/
-----------------------------------------------------------------
# 19. To view the contents of your text file purchases.txt which is present in your
hadoop directory.
hadoop fs -cat hadoop/purchases.txt
-----------------------------------------------------------------
# 20. Move a directory from one location to other
hadoop fs -mv hadoop apache_hadoop
-----------------------------------------------------------------
# 21. Add the purchases.txt file from “hadoop” directory which is present in HDFS
directory to the directory “data” which is present in your local directory
hadoop fs -copyToLocal hadoop/purchases.txt /home/training/data

# 26. Default names of owner and group are training,training
# Use ‘-chown’ to change owner name and group name simultaneously
hadoop fs -ls hadoop/purchases.txt
sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt
-----------------------------------------------------------------
# 27. Default name of group is training
# Use ‘-chgrp’ command to change group name
hadoop fs -ls hadoop/purchases.txt
sudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt
-----------------------------------------------------------------
# 28. Default replication factor to a file is 3.
# Use ‘-setrep’ command to change replication factor of a file
hadoop fs -setrep -w 2 apache_hadoop/sample.txt

29. Copy a directory from one node in the cluster to another
# Use ‘-distcp’ command to copy,
# Use '-overwrite' option to overwrite in an existing files
# Use '-update' command to synchronize both directories
hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop
-----------------------------------------------------------------
# 30. Command to make the name node leave safe mode
hadoop fs -expunge
sudo -u hdfs hdfs dfsadmin -safemode leave
-----------------------------------------------------------------
# 31. List all the hadoop file system shell commands
hadoop fs
-----------------------------------------------------------------
# 33. Last but not least, always ask for help!
hadoop fs -help

Hadoop File system (HDFS)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop File system (HDFS)

Similar to Hadoop File system (HDFS) (20)

More from Prashant Gupta

More from Prashant Gupta (8)

Recently uploaded

Recently uploaded (20)

Hadoop File system (HDFS)