HDFS_architecture.ppt

INTRODUCTION TO HADOOP
and HDFS

Contents
• Distributed System
• DFS
• Hadoop
• HDFS Architecture
• Why its is needed?
• Issues
• Mutate / lease

Operating systems
• Operating system - Software that supervises
and controls tasks on a computer. Individual
OS:
– Batch processing  jobs are collected, placed in
a queue, no interaction with job during processing
– Time shared  computing resources are provided
to different users, interaction with program during
execution
– RT systems  fast response, can be interrupted

Distributed Systems
• Consists of a number of computers that are connected and
managed so that they automatically share the job processing
load among the constituent computers.
• A distributed operating system is one that appears to its users as
a traditional uniprocessor system, even though it is actually
composed of multiple processors.
• It gives a single system view to its users and provides a single
service.
• Users are transparent to location of files. It provides a virtual
computing env.
Eg The Internet, ATM banking networks, mobile computing
networks, Global Positioning Systems and Air Traffic Control
DISTRIBUTED SYSTEM IS A COLLECTION OF INDEPENDENT
COMPUTERS THAT APPEARS TO IS USERS AS A SINGLE
COHERENT SYSTEM

Network Operating System
• In a network operating system the users are aware
of the existence of multiple computers.
• The operating system of individual computers must
have facilities to have communication and
functionality.
• Each machine runs its own OS and has its own user.
• Remote login and file access
• Less transparent but more independency
Applicatio
n
Applicatio
n
Applicatio
n
Distributed Operating System Services
Application Application Application
Network
OS
Network
OS
Network
OS
Distributed OS Networked OS

DFS
• Resource sharing is the motivation behind distributed
Systems. To share files  file system
• File System is responsible for the organization, storage,
retrieval, naming, sharing, and protection of files.
• The file system is responsible for controlling access to
the data and for performing low-level operations such as
buffering frequently used data and issuing disk I/O
requests
• The goal is to allow users of physically distributed
computers to share data and storage resources by
using a common file system.

Hadoop
What is Hadoop?
 It's a framework for running applications on large clusters of
commodity hardware which produces huge data and to
process it
 Apache Software Foundation Project
 Open source
 Amazon’s EC2
 alpha (0.18) release available for download
Hadoop Includes
 HDFS a distributed filesystem
 Map/Reduce HDFS implements this programming model. It
is an offline computing engine
Concept
Moving computation is more efficient than moving large
data

• Data intensive applications with Petabytes of data.
• Web pages - 20+ billion web pages x 20KB = 400+
terabytes
– One computer can read 30-35 MB/sec from disk
~four months to read the web
– same problem with 1000 machines, < 3 hours
• Difficulty with a large number of machines
– communication and coordination
– recovering from machine failure
– status reporting
– debugging
– optimization
– locality

FACTS
Single-thread performance doesn’t matter
We have large problems and total throughput/price more
important than peak performance
Stuff Breaks – more reliability
• If you have one server, it may stay up three years (1,000 days)
• If you have 10,000 servers, expect to lose ten a day
“Ultra-reliable” hardware doesn’t really help
At large scales, super-fancy reliable hardware still fails, albeit
less often
– software still needs to be fault-tolerant
– commodity machines without fancy hardware give better
perf/price
DECISION : COMMODITY HARDWARE.
DFS : HADOOP – REASONS?????

HDFS Why? Seek vs Transfer
• CPU & transfer speed, RAM & disk size double every 18
- 24 months
• Seek time nearly constant (~5%/year)
• Time to read entire drive is growing vs transfer rate.
• Moral: scalable computing must go at transfer rate
• BTree (Relational DBS)
– operate at seek rate, log(N) seeks/access
-- memory / stream based
• sort/merge flat files (MapReduce)
– operate at transfer rate, log(N) transfers/sort
-- Batch based

Characteristics
• Fault tolerant, scalable, Efficient, reliable distributed
storage system
• Moving computation to place of data
• Single cluster with computation and data.
• Process huge amounts of data.
• Scalable: store and process petabytes of data.
• Economical:
– It distributes the data and processing across clusters of
commonly available computers.
– Clusters PCs into a storage and computing platform.
– It minimises no of CPU cycles, RAM on individual
machines etc.
• Efficient:
– By distributing the data, Hadoop can process it in parallel on
the nodes where the data is located. This makes it extremely
rapid.
– Computation is moved to place where data is present.
• Reliable:
– Hadoop automatically maintains multiple copies of data
– Automatically redeploys computing tasks based on failures.

Cluster node runs both DFS and MR

• Data Model
– Data is organized into files and directories
– Files are divided into uniform sized blocks and
distributed across cluster nodes
– Replicate blocks to handle hardware failure
– Checksums of data for corruption detection
and recovery
– Expose block placement so that computes
can be migrated to data
• large streaming reads and small random reads
• Facility for multiple clients to append to a file

• Assumes commodity hardware that fails
– Files are replicated to handle hardware
failure
– Checksums for corruption detection and
recovery
– Continues operation as nodes / racks added
/ removed
• Optimized for fast batch processing
– Data location exposed to allow computes to
move to data
– Stores data in chunks/blocks on every node
in the cluster
– Provides VERY high aggregate bandwidth

• Files are broken in to large blocks.
– Typically 128 MB block size
– Blocks are replicated for reliability
– One replica on local node,
another replica on a remote rack,
Third replica on local rack,
Additional replicas are randomly placed
• Understands rack locality
– Data placement exposed so that computation can be
migrated to data
• Client talks to both NameNode and DataNodes
– Data is not sent through the namenode, clients
access data directly from DataNode
– Throughput of file system scales nearly linearly with
the number of nodes.

Components
• DFS Master “Namenode”
– Manages the file system namespace
– Controls read/write access to files
– Manages block replication
– Checkpoints namespace and journals
namespace changes for reliability
Metadata of Name node in Memory
– The entire metadata is in main memory
– No demand paging of FS metadata
Types of Metadata:
List of files, file and chunk namespaces; list of
blocks, location of replicas; file attributes etc.

DFS SLAVES or DATA NODES
• Serve read/write requests from clients
• Perform replication tasks upon instruction by
namenode
Data nodes act as:
1) A Block Server
– Stores data in the local file system
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
2) Block Report: Periodically sends a report of all
existing blocks to the NameNode
3) Periodically sends heartbeat to NameNode (detect
node failures)
4) Facilitates Pipelining of Data (to other specified
DataNodes)

• Map/Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to Tasktrackers
– Monitors task and tasktracker status, re-
executes tasks upon failure
• Map/Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction
from the Jobtracker
– Manage storage and transmission of
intermediate output.

SECONDARY NAME NODE
• Copies FsImage and Transaction Log from
NameNode to a temporary directory
• Merges FSImage and Transaction Log into
a new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged

HDFS Architecture
• NameNode: filename, offset> blockid, block > datanode
• DataNode: maps block > local disk
• Secondary NameNode: periodically merges edit logs
Block is also called chunk

JOBTRACKER, TASKTACKER AND JOBCLIENT

HDFS API
• Most common file and directory operations
supported:
– Create, open, close, read, write, seek, list,
delete etc.
• Files are write once and have exclusively
one writer
• Some operations peculiar to HDFS:
– set replication, get block locations
• Support for owners, permissions

DATA CORRECTNESS
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from
DataNode
– If Validation fails, Client tries other replicas

MUTATION ORDER AND LEASES
• A mutation is an operation that changes the
contents / metadata of a chunk such as append /
write operation.
• Each mutation is performed at all replicas.
• Leases (order of mutations) are used to maintain
consistency
• Master grants chunk lease to one replica
(primary)
• Primary picks the serial order for all mutations to
the chunk
• All replicas follow this order (consistency)

Software Model - ???
• Parallel programming improves performance and
efficiency.
• In a parallel program, the processing is broken up into
parts, each of which can be executed concurrently
• Identify whether the problem can be parallelised (fib)
• Matrix operations with independency

Master/Worker
• The MASTER:
– initializes the array and splits it up according
to the number of available WORKERS
– sends each WORKER its subarray
– receives the results from each WORKER
• The WORKER:
– receives the subarray from the MASTER
– performs processing on the subarray
– returns results to MASTER

The area of the square, denoted
As = (2r)^2 or 4r^2.
The area of the circle, denoted
Ac, is pi * r2.
• pi = Ac / r^2
• As = 4r^2
• r^2 = As / 4
• pi = 4 * Ac / As
• pi= 4 * No of pts on
the circle / num of
points on the square
CALCULATING PI

• Randomly generate points in the square
• Count the number of generated points that are
both in the circle and in the square  MAP
(find ra = No of pts on the circle / num of points
on the square)
• ra = the number of points in the circle divided
by the number of points in the square 
gather all ra
• PI = 4 * r  REDUCE
Parallelised calculation of points on the circle
(MAP)
Then merged in to find PI  REDUCE

WHAT IS MAP REDUCE PROGRAMMING
• Restricted parallel programming model meant
for large clusters
– User implements Map() and Reduce()
• Parallel computing framework (HDFS lib)
– Libraries take care of EVERYTHING else
(abstraction)
• Parallelization
• Fault Tolerance
• Data Distribution
• Load Balancing
• Useful model for many practical tasks

HDFS_architecture.ppt

More Related Content

Similar to HDFS_architecture.ppt

More from vijayapraba1

Recently uploaded

HDFS_architecture.ppt