INTRODUCTION TO HADOOP
and HDFS
Contents
• Distributed System
• DFS
• Hadoop
• HDFS Architecture
• Why its is needed?
• Issues
• Mutate / lease
Operating systems
• Operating system - Software that supervises
and controls tasks on a computer. Individual
OS:
– Batch processing  jobs are collected, placed in
a queue, no interaction with job during processing
– Time shared  computing resources are provided
to different users, interaction with program during
execution
– RT systems  fast response, can be interrupted
Distributed Systems
• Consists of a number of computers that are connected and
managed so that they automatically share the job processing
load among the constituent computers.
• A distributed operating system is one that appears to its users as
a traditional uniprocessor system, even though it is actually
composed of multiple processors.
• It gives a single system view to its users and provides a single
service.
• Users are transparent to location of files. It provides a virtual
computing env.
Eg The Internet, ATM banking networks, mobile computing
networks, Global Positioning Systems and Air Traffic Control
DISTRIBUTED SYSTEM IS A COLLECTION OF INDEPENDENT
COMPUTERS THAT APPEARS TO IS USERS AS A SINGLE
COHERENT SYSTEM
Network Operating System
• In a network operating system the users are aware
of the existence of multiple computers.
• The operating system of individual computers must
have facilities to have communication and
functionality.
• Each machine runs its own OS and has its own user.
• Remote login and file access
• Less transparent but more independency
Applicatio
n
Applicatio
n
Applicatio
n
Distributed Operating System Services
Application Application Application
Network
OS
Network
OS
Network
OS
Distributed OS Networked OS
DFS
• Resource sharing is the motivation behind distributed
Systems. To share files  file system
• File System is responsible for the organization, storage,
retrieval, naming, sharing, and protection of files.
• The file system is responsible for controlling access to
the data and for performing low-level operations such as
buffering frequently used data and issuing disk I/O
requests
• The goal is to allow users of physically distributed
computers to share data and storage resources by
using a common file system.
Hadoop
What is Hadoop?
 It's a framework for running applications on large clusters of
commodity hardware which produces huge data and to
process it
 Apache Software Foundation Project
 Open source
 Amazon’s EC2
 alpha (0.18) release available for download
Hadoop Includes
 HDFS a distributed filesystem
 Map/Reduce HDFS implements this programming model. It
is an offline computing engine
Concept
Moving computation is more efficient than moving large
data
• Data intensive applications with Petabytes of data.
• Web pages - 20+ billion web pages x 20KB = 400+
terabytes
– One computer can read 30-35 MB/sec from disk
~four months to read the web
– same problem with 1000 machines, < 3 hours
• Difficulty with a large number of machines
– communication and coordination
– recovering from machine failure
– status reporting
– debugging
– optimization
– locality
FACTS
Single-thread performance doesn’t matter
We have large problems and total throughput/price more
important than peak performance
Stuff Breaks – more reliability
• If you have one server, it may stay up three years (1,000 days)
• If you have 10,000 servers, expect to lose ten a day
“Ultra-reliable” hardware doesn’t really help
At large scales, super-fancy reliable hardware still fails, albeit
less often
– software still needs to be fault-tolerant
– commodity machines without fancy hardware give better
perf/price
DECISION : COMMODITY HARDWARE.
DFS : HADOOP – REASONS?????
HDFS Why? Seek vs Transfer
• CPU & transfer speed, RAM & disk size double every 18
- 24 months
• Seek time nearly constant (~5%/year)
• Time to read entire drive is growing vs transfer rate.
• Moral: scalable computing must go at transfer rate
• BTree (Relational DBS)
– operate at seek rate, log(N) seeks/access
-- memory / stream based
• sort/merge flat files (MapReduce)
– operate at transfer rate, log(N) transfers/sort
-- Batch based
HDFS ARCHITECTURE
Characteristics
• Fault tolerant, scalable, Efficient, reliable distributed
storage system
• Moving computation to place of data
• Single cluster with computation and data.
• Process huge amounts of data.
• Scalable: store and process petabytes of data.
• Economical:
– It distributes the data and processing across clusters of
commonly available computers.
– Clusters PCs into a storage and computing platform.
– It minimises no of CPU cycles, RAM on individual
machines etc.
• Efficient:
– By distributing the data, Hadoop can process it in parallel on
the nodes where the data is located. This makes it extremely
rapid.
– Computation is moved to place where data is present.
• Reliable:
– Hadoop automatically maintains multiple copies of data
– Automatically redeploys computing tasks based on failures.
Cluster node runs both DFS and MR
• Data Model
– Data is organized into files and directories
– Files are divided into uniform sized blocks and
distributed across cluster nodes
– Replicate blocks to handle hardware failure
– Checksums of data for corruption detection
and recovery
– Expose block placement so that computes
can be migrated to data
• large streaming reads and small random reads
• Facility for multiple clients to append to a file
• Assumes commodity hardware that fails
– Files are replicated to handle hardware
failure
– Checksums for corruption detection and
recovery
– Continues operation as nodes / racks added
/ removed
• Optimized for fast batch processing
– Data location exposed to allow computes to
move to data
– Stores data in chunks/blocks on every node
in the cluster
– Provides VERY high aggregate bandwidth
• Files are broken in to large blocks.
– Typically 128 MB block size
– Blocks are replicated for reliability
– One replica on local node,
another replica on a remote rack,
Third replica on local rack,
Additional replicas are randomly placed
• Understands rack locality
– Data placement exposed so that computation can be
migrated to data
• Client talks to both NameNode and DataNodes
– Data is not sent through the namenode, clients
access data directly from DataNode
– Throughput of file system scales nearly linearly with
the number of nodes.
Block Placement
Hadoop Cluster Architecture:
Components
• DFS Master “Namenode”
– Manages the file system namespace
– Controls read/write access to files
– Manages block replication
– Checkpoints namespace and journals
namespace changes for reliability
Metadata of Name node in Memory
– The entire metadata is in main memory
– No demand paging of FS metadata
Types of Metadata:
List of files, file and chunk namespaces; list of
blocks, location of replicas; file attributes etc.
DFS SLAVES or DATA NODES
• Serve read/write requests from clients
• Perform replication tasks upon instruction by
namenode
Data nodes act as:
1) A Block Server
– Stores data in the local file system
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
2) Block Report: Periodically sends a report of all
existing blocks to the NameNode
3) Periodically sends heartbeat to NameNode (detect
node failures)
4) Facilitates Pipelining of Data (to other specified
DataNodes)
• Map/Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to Tasktrackers
– Monitors task and tasktracker status, re-
executes tasks upon failure
• Map/Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction
from the Jobtracker
– Manage storage and transmission of
intermediate output.
SECONDARY NAME NODE
• Copies FsImage and Transaction Log from
NameNode to a temporary directory
• Merges FSImage and Transaction Log into
a new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
HDFS Architecture
• NameNode: filename, offset> blockid, block > datanode
• DataNode: maps block > local disk
• Secondary NameNode: periodically merges edit logs
Block is also called chunk
JOBTRACKER, TASKTACKER AND JOBCLIENT
HDFS API
• Most common file and directory operations
supported:
– Create, open, close, read, write, seek, list,
delete etc.
• Files are write once and have exclusively
one writer
• Some operations peculiar to HDFS:
– set replication, get block locations
• Support for owners, permissions
DATA CORRECTNESS
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from
DataNode
– If Validation fails, Client tries other replicas
MUTATION ORDER AND LEASES
• A mutation is an operation that changes the
contents / metadata of a chunk such as append /
write operation.
• Each mutation is performed at all replicas.
• Leases (order of mutations) are used to maintain
consistency
• Master grants chunk lease to one replica
(primary)
• Primary picks the serial order for all mutations to
the chunk
• All replicas follow this order (consistency)
Software Model - ???
• Parallel programming improves performance and
efficiency.
• In a parallel program, the processing is broken up into
parts, each of which can be executed concurrently
• Identify whether the problem can be parallelised (fib)
• Matrix operations with independency
Master/Worker
• The MASTER:
– initializes the array and splits it up according
to the number of available WORKERS
– sends each WORKER its subarray
– receives the results from each WORKER
• The WORKER:
– receives the subarray from the MASTER
– performs processing on the subarray
– returns results to MASTER
The area of the square, denoted
As = (2r)^2 or 4r^2.
The area of the circle, denoted
Ac, is pi * r2.
• pi = Ac / r^2
• As = 4r^2
• r^2 = As / 4
• pi = 4 * Ac / As
• pi= 4 * No of pts on
the circle / num of
points on the square
CALCULATING PI
• Randomly generate points in the square
• Count the number of generated points that are
both in the circle and in the square  MAP
(find ra = No of pts on the circle / num of points
on the square)
• ra = the number of points in the circle divided
by the number of points in the square 
gather all ra
• PI = 4 * r  REDUCE
Parallelised calculation of points on the circle
(MAP)
Then merged in to find PI  REDUCE
Cluster node runs both DFS and MR
WHAT IS MAP REDUCE PROGRAMMING
• Restricted parallel programming model meant
for large clusters
– User implements Map() and Reduce()
• Parallel computing framework (HDFS lib)
– Libraries take care of EVERYTHING else
(abstraction)
• Parallelization
• Fault Tolerance
• Data Distribution
• Load Balancing
• Useful model for many practical tasks

HDFS_architecture.ppt

  • 1.
  • 2.
    Contents • Distributed System •DFS • Hadoop • HDFS Architecture • Why its is needed? • Issues • Mutate / lease
  • 3.
    Operating systems • Operatingsystem - Software that supervises and controls tasks on a computer. Individual OS: – Batch processing  jobs are collected, placed in a queue, no interaction with job during processing – Time shared  computing resources are provided to different users, interaction with program during execution – RT systems  fast response, can be interrupted
  • 4.
    Distributed Systems • Consistsof a number of computers that are connected and managed so that they automatically share the job processing load among the constituent computers. • A distributed operating system is one that appears to its users as a traditional uniprocessor system, even though it is actually composed of multiple processors. • It gives a single system view to its users and provides a single service. • Users are transparent to location of files. It provides a virtual computing env. Eg The Internet, ATM banking networks, mobile computing networks, Global Positioning Systems and Air Traffic Control DISTRIBUTED SYSTEM IS A COLLECTION OF INDEPENDENT COMPUTERS THAT APPEARS TO IS USERS AS A SINGLE COHERENT SYSTEM
  • 5.
    Network Operating System •In a network operating system the users are aware of the existence of multiple computers. • The operating system of individual computers must have facilities to have communication and functionality. • Each machine runs its own OS and has its own user. • Remote login and file access • Less transparent but more independency Applicatio n Applicatio n Applicatio n Distributed Operating System Services Application Application Application Network OS Network OS Network OS Distributed OS Networked OS
  • 6.
    DFS • Resource sharingis the motivation behind distributed Systems. To share files  file system • File System is responsible for the organization, storage, retrieval, naming, sharing, and protection of files. • The file system is responsible for controlling access to the data and for performing low-level operations such as buffering frequently used data and issuing disk I/O requests • The goal is to allow users of physically distributed computers to share data and storage resources by using a common file system.
  • 7.
    Hadoop What is Hadoop? It's a framework for running applications on large clusters of commodity hardware which produces huge data and to process it  Apache Software Foundation Project  Open source  Amazon’s EC2  alpha (0.18) release available for download Hadoop Includes  HDFS a distributed filesystem  Map/Reduce HDFS implements this programming model. It is an offline computing engine Concept Moving computation is more efficient than moving large data
  • 8.
    • Data intensiveapplications with Petabytes of data. • Web pages - 20+ billion web pages x 20KB = 400+ terabytes – One computer can read 30-35 MB/sec from disk ~four months to read the web – same problem with 1000 machines, < 3 hours • Difficulty with a large number of machines – communication and coordination – recovering from machine failure – status reporting – debugging – optimization – locality
  • 9.
    FACTS Single-thread performance doesn’tmatter We have large problems and total throughput/price more important than peak performance Stuff Breaks – more reliability • If you have one server, it may stay up three years (1,000 days) • If you have 10,000 servers, expect to lose ten a day “Ultra-reliable” hardware doesn’t really help At large scales, super-fancy reliable hardware still fails, albeit less often – software still needs to be fault-tolerant – commodity machines without fancy hardware give better perf/price DECISION : COMMODITY HARDWARE. DFS : HADOOP – REASONS?????
  • 10.
    HDFS Why? Seekvs Transfer • CPU & transfer speed, RAM & disk size double every 18 - 24 months • Seek time nearly constant (~5%/year) • Time to read entire drive is growing vs transfer rate. • Moral: scalable computing must go at transfer rate • BTree (Relational DBS) – operate at seek rate, log(N) seeks/access -- memory / stream based • sort/merge flat files (MapReduce) – operate at transfer rate, log(N) transfers/sort -- Batch based
  • 11.
  • 12.
    Characteristics • Fault tolerant,scalable, Efficient, reliable distributed storage system • Moving computation to place of data • Single cluster with computation and data. • Process huge amounts of data. • Scalable: store and process petabytes of data. • Economical: – It distributes the data and processing across clusters of commonly available computers. – Clusters PCs into a storage and computing platform. – It minimises no of CPU cycles, RAM on individual machines etc. • Efficient: – By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid. – Computation is moved to place where data is present. • Reliable: – Hadoop automatically maintains multiple copies of data – Automatically redeploys computing tasks based on failures.
  • 13.
    Cluster node runsboth DFS and MR
  • 14.
    • Data Model –Data is organized into files and directories – Files are divided into uniform sized blocks and distributed across cluster nodes – Replicate blocks to handle hardware failure – Checksums of data for corruption detection and recovery – Expose block placement so that computes can be migrated to data • large streaming reads and small random reads • Facility for multiple clients to append to a file
  • 15.
    • Assumes commodityhardware that fails – Files are replicated to handle hardware failure – Checksums for corruption detection and recovery – Continues operation as nodes / racks added / removed • Optimized for fast batch processing – Data location exposed to allow computes to move to data – Stores data in chunks/blocks on every node in the cluster – Provides VERY high aggregate bandwidth
  • 16.
    • Files arebroken in to large blocks. – Typically 128 MB block size – Blocks are replicated for reliability – One replica on local node, another replica on a remote rack, Third replica on local rack, Additional replicas are randomly placed • Understands rack locality – Data placement exposed so that computation can be migrated to data • Client talks to both NameNode and DataNodes – Data is not sent through the namenode, clients access data directly from DataNode – Throughput of file system scales nearly linearly with the number of nodes.
  • 17.
  • 18.
  • 19.
    Components • DFS Master“Namenode” – Manages the file system namespace – Controls read/write access to files – Manages block replication – Checkpoints namespace and journals namespace changes for reliability Metadata of Name node in Memory – The entire metadata is in main memory – No demand paging of FS metadata Types of Metadata: List of files, file and chunk namespaces; list of blocks, location of replicas; file attributes etc.
  • 20.
    DFS SLAVES orDATA NODES • Serve read/write requests from clients • Perform replication tasks upon instruction by namenode Data nodes act as: 1) A Block Server – Stores data in the local file system – Stores metadata of a block (e.g. CRC) – Serves data and metadata to Clients 2) Block Report: Periodically sends a report of all existing blocks to the NameNode 3) Periodically sends heartbeat to NameNode (detect node failures) 4) Facilitates Pipelining of Data (to other specified DataNodes)
  • 21.
    • Map/Reduce Master“Jobtracker” – Accepts MR jobs submitted by users – Assigns Map and Reduce tasks to Tasktrackers – Monitors task and tasktracker status, re- executes tasks upon failure • Map/Reduce Slaves “Tasktrackers” – Run Map and Reduce tasks upon instruction from the Jobtracker – Manage storage and transmission of intermediate output.
  • 22.
    SECONDARY NAME NODE •Copies FsImage and Transaction Log from NameNode to a temporary directory • Merges FSImage and Transaction Log into a new FSImage in temporary directory • Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged
  • 23.
    HDFS Architecture • NameNode:filename, offset> blockid, block > datanode • DataNode: maps block > local disk • Secondary NameNode: periodically merges edit logs Block is also called chunk
  • 24.
  • 25.
    HDFS API • Mostcommon file and directory operations supported: – Create, open, close, read, write, seek, list, delete etc. • Files are write once and have exclusively one writer • Some operations peculiar to HDFS: – set replication, get block locations • Support for owners, permissions
  • 26.
    DATA CORRECTNESS • UseChecksums to validate data – Use CRC32 • File Creation – Client computes checksum per 512 byte – DataNode stores the checksum • File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas
  • 27.
    MUTATION ORDER ANDLEASES • A mutation is an operation that changes the contents / metadata of a chunk such as append / write operation. • Each mutation is performed at all replicas. • Leases (order of mutations) are used to maintain consistency • Master grants chunk lease to one replica (primary) • Primary picks the serial order for all mutations to the chunk • All replicas follow this order (consistency)
  • 29.
    Software Model -??? • Parallel programming improves performance and efficiency. • In a parallel program, the processing is broken up into parts, each of which can be executed concurrently • Identify whether the problem can be parallelised (fib) • Matrix operations with independency
  • 30.
    Master/Worker • The MASTER: –initializes the array and splits it up according to the number of available WORKERS – sends each WORKER its subarray – receives the results from each WORKER • The WORKER: – receives the subarray from the MASTER – performs processing on the subarray – returns results to MASTER
  • 31.
    The area ofthe square, denoted As = (2r)^2 or 4r^2. The area of the circle, denoted Ac, is pi * r2. • pi = Ac / r^2 • As = 4r^2 • r^2 = As / 4 • pi = 4 * Ac / As • pi= 4 * No of pts on the circle / num of points on the square CALCULATING PI
  • 32.
    • Randomly generatepoints in the square • Count the number of generated points that are both in the circle and in the square  MAP (find ra = No of pts on the circle / num of points on the square) • ra = the number of points in the circle divided by the number of points in the square  gather all ra • PI = 4 * r  REDUCE Parallelised calculation of points on the circle (MAP) Then merged in to find PI  REDUCE
  • 33.
    Cluster node runsboth DFS and MR
  • 34.
    WHAT IS MAPREDUCE PROGRAMMING • Restricted parallel programming model meant for large clusters – User implements Map() and Reduce() • Parallel computing framework (HDFS lib) – Libraries take care of EVERYTHING else (abstraction) • Parallelization • Fault Tolerance • Data Distribution • Load Balancing • Useful model for many practical tasks