• Save
Hadoop HDFS Detailed Introduction
Upcoming SlideShare
Loading in...5
×
 

Hadoop HDFS Detailed Introduction

on

  • 4,685 views

Introduction of HDFS, for training.

Introduction of HDFS, for training.

Statistics

Views

Total Views
4,685
Views on SlideShare
4,683
Embed Views
2

Actions

Likes
15
Downloads
0
Comments
0

1 Embed 2

https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop HDFS Detailed Introduction Hadoop HDFS Detailed Introduction Presentation Transcript

  • Detailed Intro. to HDFS July 10, 2012 Clay Jiang Big Data Engineering Team Hanborq Inc.
  • HDFS Intro.• Overview• HDFS Internal• HDFS O&M, Tools• HDFS Future 2
  • Overview 3
  • What is HDFS?• Hadoop Distributed FileSystem• Good For:  Large Files  Streaming Data Access• NOT For: x Lots of Small Files x Random Access x Low-latency Access 4
  • Design of HDFS• GFS-like – http://research.google.com/archive/gfs.html• Master-slave design – Master • Single NameNode for managing FS meta – Slaves • Multiple DataNode s for storing data – One more: • SecondaryNameNode for checkpointing 5
  • HDFS Architecture• 6
  • HDFS Storage• HDFS Files are broken into Blocks – Basic unit of reading/writing like disk block – Default to 64MB, may be larger in product env. – Make HDFS good for large file & high throughput• Block may have multiple Replicas – One block stored as multiple locations – Make HDFS storage fault tolerant 7
  • HDFS Storage 8
  • HDFS Internal 9
  • HDFS Internal• NameNode• SecondaryNameNode• DataNode 10
  • NameNode• Filesystem Meta – FSNames – FSName  Blocks – Block  Replicas• Interact With – Client – DataNode – SecondaryNameNode 11
  • NameNode FS Meta• FSImage – FSNames & FSName  Blocks – Saved replicas in multiple name directory – Recover on startup• EditLog – Log every FS modification• Block  Replicas (DataNodes) – Only in memory – rebuilt from block reports on startup 12
  • NameNode Interface• Through different protocol interface – ClientProtocol: • Create, addBlock, delete, rename, fsync … – NameNodeProtocol: • rollEditLog, rollFsImage, … – DataNodeProtocol: • sendHeartbeat, blockReceived, blockReport, … –… 13
  • NameNode Startup• On Startup – Load fsimage – Check safe mode – Start Daemons • HeartbeatMonitor • LeaseManager • ReplicationMonitor • DecommissionManager – Start RPC services – Start HTTP info server – Start Trash Emptier 14
  • Load FSImage• Name Directory – dfs.name.dir: can be multiple dirs – Check consistence of all name dirs – Load fsimage file – Load edit logs – Save namespace • Mainly setup dirs & files properly 15
  • Check Safemode• Safemode – Fsimage loaded but locations of blocks not known yet! – Exit when minimal replication condition meet • dfs.safemode.threshold.pct • dfs.replication.min • Default case: 99.9% of block have 1 replicas – Start SafeModeMonitor to periodically check to leave safe mode – Leave safe mode manually • hadoop dfsadmin -safemode leave • (or enter it /get status by: hadoop dfsadmin -safemode enter/get) 16
  • Start Daemons• HeartbeatMonitor – Check lost DN & schedule necessary replication• LeaseManager – Check lost lease• ReplicationMonitor – computeReplicationWork – computeInvalidateWork – dfs.replication.interval, defautl to 3 secs• DecommissionManager – Check and set node decommissioned 17
  • Trash Emptier• /user/{user.name}/.Trash – fs.trash.interval > 0 to enable – When delete, file moved to .Trash• Trash.Empiter – Run every fs.trash.interval mins – Delete old checkpoint (fs.trash interval mins ago) 18
  • HDFS Internal• NameNode• SecondaryNameNode• DataNode 19
  • SecondaryNameNode• Not Standby/Backup NameNode – Only for checkpointing – Though, has a NON-Realtime copy of FSImage• Need as much memory as NN to do the checkpointing – Estimation: 1GB for every one million blocks 20
  • SecondaryNameNode• Do the checkpointing – Copy NN’s fsimage & editlogs – Merge them to a new fsimage – Replace NN’s fsimage with new one & clean editlogs• Timing – Size of editlog > fs.checkpoint.size (poll every 5 min) – Every fs.checkpoint.period secs 21
  • HDFS Internal• NameNode• SecondaryNameNode• DataNode 22
  • DataNode• Store data blocks – Have no knowledge about FSName• Receive blocks from Client• Receive blocks from DataNode peer – Replication – Pipeline writing• Receive delete command from NameNode 23
  • Block Placement PolicyOn Cluster Level• replication = 3 – First replica local with Client – Second & Third on two nodes of same remote rack 24
  • Block Placement PolicyOn one single node• Write each disk in turn – No balancing is considered !• Skip a disk when it’s almost full or failed• DataNode may go offline when disks failed – dfs.datanode.failed.volumes.tolerated 25
  • DataNode Startup• On DN Startup: – Load data dirs – Register itself to NameNode – Start IPC Server – Start DataXceiverServer • Transfer blocks – Run the main loop … • Start BlockScanner • Send heartbeats • Process command from NN • Send block report 26
  • DataXceiverServer• Accept data connection & start DataXceiver – Max num: dfs.datanode.max.xcievers [256]• DataXceiver – Handle blocks • Read block • Write block • Replace block • Copy block • … 27
  • HDFS Routines Analysis• Write File• Read File• Decrease Replication Factor• One DN down 28
  • Write File• Sample Code: DFSClient dfsclient = …; outputStream = dfsclient.create(…); outputStream.write(someBytes); … outputStream.close(); dfsclient.close(); 29
  • Write File 30
  • Write File• DFSClient.create – NameNode.create • Check existence • Check permission • Check and get Lease • Add new INode to rootDir 31
  • Write File• outputStream.write – Get DNs to write to From NN – Break bytes into packets – Write packets to First DataNode’s DataXceiver – DN mirror packet to downstream DNs (Pipeline) – When complete, confirm NN blockReceived 32
  • Write File• outputStream.close – NameNode.complete • Remove lease • Change file from “under construction” to “complete” 33
  • Lease 34
  • Lease• What is lease ? – Write lock for file modification – No lease for reading files• Avoid concurrent write on the same file – Cause inconsistent & undefined behavior 35
  • Lease• LeaseManager – Lease is managed in NN – When file create (or append), lease added• DFSClient.LeaseChecker – Client start thread to renew lease periodically 36
  • Lease Expiration• Soft Limit – No renewing for 1 min – Other client compete for the lease• Hard Limit – No renewing for 60 min (60 * softLimit) – No competition for the lease 37
  • Read File• Sample Code: DFSClient dfsclient = … FSDataInputStream is = dfsclient.open(…) is.read(…) is.close() 38
  • Read File 39
  • Read File• DFClient.open – Create FSDataInputStream • Get block locations of file from NN• FSDataInputStream.read – Read data from DNs block by block • Read the data • Do the checksum 40
  • Desc Repl• Code Sample DFSClient dfsclient = …; dfsclient.setReplication(…, 2) ;• Or use the CLI hadoop fs -setrep -w 2 /path/to/file 41
  • Desc Repl• 42
  • Desc Repl• Change FSName replication factor• Choose excess replicas – Rack number do not decrease – Get block from least available disk space node• Add to invalidateSets(to-be-deleted block set)• ReplicationMonitor compute blocks to be deleted for each DN• On next DN’s heartbeat, give delete block command to DN• DN delete specified blocks• Update blocksMap when DN send blockReport 43
  • One DN down• DataNode stop sending heartbeat• NameNode – HeartbeatMonitor find DN dead when doing heartbeat check – Remove all blocks belong to DN – Update neededReplications (block set need one or more replication) – ReplicationMonitor compute block to be replicated for each DN – On next DN’s heartbeat, NameNode send replication block command• DataNode – Replicate block 44
  • O&M, Tools 45
  • High Availability• NameNode SPOF – NameNode hold all the meta – If NN crash, the whole cluster unavailable• Though fsimage can recover from SNN – It’s not a up-to-date fsimage• Need HA solutions 46
  • HA Solutions• DRBD• Avatar Node• Backup Node 47
  • HA - DRBD• DRBD (http://www.drbd.org) – block devices designed as a building block to form high availability (HA) clusters. – Like network based raid-1• Use DRBD to backup NN’s fsimage & editlogs – A cold backup for NN – Restart NN cost no more than 10 minutes 48
  • HA - DRBD• Mirror one of NN’s name dir to a remote node – All name dir is the same• When NN fail – Copy mirrored name dir to all name dir – Restart NN – All will be done in no more than 20 mins 49
  • HA Solutions• DRBD• Avatar Node• Backup Node 50
  • HA - AvatarNode• Complete Hot Standby – NFS for storage of fsimage and editlogs – Standby node Consumes transactions from editlogs on NFS continuously – DataNodes send message to both primary and standby node• Fast Switchover – Less than a minute 51
  • HA - AvatarNode• Active-Standby Pair Client – Coordinated via ZooKeeper – Failover in few seconds Client retrieves block location from Primary – Wrapper over NameNode or Standby• Active AvatarNode Active Write transaction Read transaction Standby AvatarNode – Writes transaction log to NFS AvatarNode (NameNode) NFS (NameNode) filer Filer• Standby AvatarNode Block Block – Reads/Consumes Location Location transactions from NFS filer messages messages – Processes all messages from DataNodes DataNodes – Latest metadata in memory 52
  • HA - AvatarNode• Four steps to failover – Wipe ZooKeeper entry. Clients will know the failover is in progress. (0 seconds) – Stop the primary NameNode. Last bits of data will be flushed to Transaction Log and it will die. (Seconds) – Switch Standby to Primary. It will consume the rest of the Transaction log and get out of SafeMode ready to serve traffic. (Seconds) – Update the entry in ZooKeeper. All the clients waiting for failover will pick up the new connection (0 seconds)• After: Start the first node in the Standby Mode – Takes a while, but the cluster is up and running 53
  • HA - AvatarNode 54
  • HA Solutions• DRBD• Avatar Node• Backup Node 55
  • HA - BackupNode• NN synchronously streams Client transaction log to Client retrieves block location from BackupNode NN• BackupNode applies log NN Synchronous stream transacton logs to to in-memory and disk (NameNode) BN image BN• BN always commit to disk Block Location (BackupNode) messages before success to NN• If BN restarts, it has to catch up with NN DataNodes 56
  • Tools• More Tools … – Balancer – Fsck – Distcp 57
  • Tools - Balancer• Need Re-Balancing – When new node is add to cluster• bin/start-balancer.sh – Move block from over-utilized node to under-utilized node• dfs.balance.bandwidthPerSec – Control the impact on business• -t <threshold> – Default 10% – stop when difference from average utilization is less than threshold 58
  • Tools - Fsck• hadoop fsck /path/to/file• Check HDFS’s healthy – Missing blocks, corrupt blocks, mis-replicated blocks …• Get blocks & locations of files – hadoop fsck /path/to/file -files -blocks -locations 59
  • Tools - Distcp• Inter-cluster copy – hadoop distcp -i –pp -log /logdir hdfs://srcip/srcpath/ /destpath – Use map-reduce(actually maps) to start a distributed-fashion copy• Also fast copy in the same cluster 60
  • HDFS Future 61
  • Hadoop Future• Short-circuit local reads – dfs.client.read.shortcircuit = true – Available in hadoop-1.x or cdh3u4• Native checksums (HDFS-2080)• BlockReader keepalive to DN (HDFS-941)• “Zero-copy read” support (HDFS-3051)• NN HA (HDFS-3042)• HDFS Federation• HDFS RAID 62
  • References• Tom White, Hadoop The definitive guide• http://hadoop.apache.org/hdfs/• Hadoop WiKi – HDFS – http://wiki.apache.org/hadoop/HDFS• Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, The Google File System – http://research.google.com/archive/gfs.html• Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler , The Hadoop Distributed File System – http://storageconference.org/2010/Papers/MSST/Shvachko.pdf 63
  • The EndThank You Very Much! chiangbing@gmail.com 64