HDFS User Reference

3,140 views

Published on

Reference for HDFS users

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,140
On SlideShare
0
From Embeds
0
Number of Embeds
308
Actions
Shares
0
Downloads
138
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

HDFS User Reference

  1. 1. HDFS User Reference Biju Nair
  2. 2. Local File System FileA FileB FileC Inode-­‐n Inode-­‐m Inode-­‐p Inode-­‐n File A0ributes Block 0 Address Block 1 Address Block 2 Address Block 3 Address Inode-­‐m File A0ributes Block 0 Address Block 1 Address Block 2 Address Inode-­‐m File A0ributes Block 0 Address Block 1 Address Block 2 Address Block 3 Address DISK Directory MBR Par@@on Table Boot block Super block Free Space Trk i-­‐nodes Root dir File block size is based on what is used when FS is defined 2
  3. 3. Hadoop Distributed File System Master Host (NN) FileA FileB FileC HDFS Directory H1:blk0, H2:blk1 H3:blk0,H1:blk1 H2:blk0;H3:blk1 Local File System File DISK Local FS Directory FileA0 FileB1 Inode-­‐x Inode-­‐y Host 1 Local FS Directory FileA1 FileC0 Inode-­‐a Inode-­‐n Host 2 Local FS Directory FileB0 FileC1 Inode-­‐r Inode-­‐c Host 3 In-­‐x In-­‐y In-­‐a In-­‐n In-­‐r In-­‐c DISK DISK DISK Files created are of size equal to the HDFS blksize 3
  4. 4. HDFS HDFS Data Transfer Protocol Date Node ${dfs.data.dir}/current/VERSION /blk_<id_1>,/blk_<id_1>.meta /... /subdir2/ HTTP/S Data Node ${dfs.data.dir}/current/VERSION /blk_<id_1>,/blk_<id_1>.meta /... /subdir2/ Data Node ${dfs.data.dir}/current/VERSION /blk_<id_1>,/blk_<id_1>.meta /... /subdir2/ Name Node ${dfs.name.dir}/current/VERSION /edits,/fsimage,/fs@me Secondary Name Node ${fs.checkpoint.dir}/current/VERSION /edits,/fsimage,/fs@me Hadoop CLI WebHDFS HDFS UI Data Nodes RPC HTTP RPC 4
  5. 5. HDFS Config Files and Ports • Default configuraJon – core-­‐default.xml, hdfs-­‐default.xml • Site specific configuraJon – core-­‐site.xml, hdfs-­‐site.xml under conf • ConfiguraJon of daemon processes – hadoop-­‐env.sh under conf • List of slave/data nodes – “slaves” file under conf • Ports – Default NN UI port 50070 (HTTP), 50470 (HTTPS) – Default NN Port 8020/9000 – Default DN UI port 50075 (HTTP), 50475(HTTPS) 5
  6. 6. HDFS -­‐ Write Flow Client Name Node Namespace MetaData Blockmap (Fsimage Edit files) Data Node Data Node Data Node 1 2 3 4 5 8 7 7 6 6 1. Client requests to open a file to write through fs.create() call. This will overwrite exisJng file. 2. Name node responds with a lease to the file path 3. Client writes to local and when data reaches block size, requests Name Node for write 4. Name Node responds with a new blockid and the desJnaJon data nodes for write and replicaJon 5. Client sends the first data node the data and the checksum generated on the data to be wriaen 6. First data node writes the data and checksum and in parallel pipelines the replicaJons to other DN 7. Each data node where the data is replicated responds back with success /failure to the first DN 8. First data node in turn informs to the Name node that the write request for the block is complete which in turn will update its block map Note: There can be only one write at a Jme on a file 6
  7. 7. HDFS -­‐ Read Flow Client Name Node Namespace MetaData Blockmap (Fsimage Edit files) Data Node Data Node Data Node 1 2 3 4 5 6 7 1. Client requests to open a file to read through fs.open() call 2. Name node responds with a lease to the file path 3. Client requests for read the data in the file 4. Name Node responds with block ids in sequence and the corresponding data nodes 5. Client reaches out directly to the DNs for each block of data in the file 6. When DNs sends back data along with check sum, client performs a checksum verificaJon by generaJng a checksum 7. If the checksum verificaJon fails client reaches out to other DNs where the re is a replicaJon 7
  8. 8. HDFS -­‐ Name Node Fsimage (MetaData) Namespace Ownership Permissions Create/mod/Access Jme, Is hidden EditFile (Journal) Changes to metadata BlockMap (In-­‐memory) Details on File blocks and where they are stored 1. Name node manages the HDFS file system using the fsimage/edifile and block-­‐map data structures 2. Fsimage and edifile data are stored on disk. When hdfs starts they are read, merged and stored in-­‐memory 3. Data nodes sends details about the blocks they are storing when it starts and also at regular intervals 4. Name node uses the block map send by data nodes to build the BlockMap data structure data 5. The BlockMap data is used when requests for reads on files comes to the FileSystem 6. Also the BlockMap data is used to idenJfy the under/over replicated files which requires correcJon 7. At no point Name node stores data locally or directly involved in transferring data from files to client 8. The client reading/wriJng data receives meta data details from NN and then directly works with DNs 9. Name nodes require large memory since it needs to hold all the in-­‐memory data structures 10. If the NN is lost the data in the file systems can’t be accessed 8
  9. 9. FS Meta Data Change Management At Start-­‐up Periodically Fsimage (MetaData) EditFile (Journal) Secondary NameNode Fsimage_1 (MetaData) EditFile_1 (MetaData) Fsimage (MetaData) EditFile (Journal) NameNode Fsimage_1 (MetaData) EditFile_1 (MetaData) 1. When HDFS is up and running changes to file system metadata are stored in Edit files 2. When NN starts it looks for EditFiles in the system and merges the content with the fsimage on the disk 3. The merging process creates new fsimage and edifile. Also the process discards the old fsimage & edit files. 4. Since the edit files can be large for a very acJve HDFS cluster, the NN start-­‐up will take a long Jme 5. Secondary name node at regular interval or aier a certain edifile size, merges the edit file and fsimage file 6. The merge process creates a new fsimage file and an edit file. The secondary NN copies the new fsimage file back to NN 7. This will reduce the NN start-­‐up process and also the fsimage can be used if there is a failure in the NN server to restore 9
  10. 10. HDFS -­‐ Data Node Name Node MetaData BlockMap Data Node Heart Beat / Block map Data Node Data Node 1. Data nodes stores blocks of data for each file stored in HDFS and the default clock size is 128 MB 2. Blocks of data is replicated n Jmes and by default it is 3 Jmes 3. Data node periodically sends a heartbeat to the name node to inform NN that it is alive 4. If NN doesn’t receive a heart beat , it will mark the DN as dead and stops sending further requests to the DN 5. Also in periodic intervals, data node sends out a block map which includes all the file blocks it stores 6. When a DN is dead, all the files for which blocks were stored in the DN will get marked as under replicated 7. NN will recJfy under replicaJon by replicaJng the blocks to other data nodes 10
  11. 11. Ensuring Data Integrity • Through replicaJon/replicaJon assurance – First replica closer to client node – Second replica on a different rack – Third replica on the rack as the second replica • File system checks run manually • Block scanning over a period of Jme • Storing checksums along with block data 11
  12. 12. Permission and Quotas • File and directories use much of POSIX model – Associated with an owner and a group – Permission for owner, group and others – r for read, w for append to files – r for lisJng files, w for delete/create files in dirs – x to access child directories – Stciky bit on dirs prevents deleJons by others – User idenJficaJon can be simple (OS) or Kerberos 12
  13. 13. Permission and Quotas • Quota for number of files – Name quota – dfsadmin -­‐setQuota <N> <dir>...<dir> – dfsadmin -­‐clrSpaceQuota <dir>...<dir> • Quota on the size of data – Space quota can be set to restrict space usage – dfsadmin -­‐setSpaceQuota <N> <dir>...<dir> • Replicated data also consumes quota – dfsadmin -­‐clrSpaceQuota <dir>...<dir> • ReporJng – fs -­‐count -­‐q <dir>...<dir> 13
  14. 14. HDFS snapshot • No copy of data blocks. Only the metadata (block list and file names) are copied • Allow snapshot on a directory – hdfs dfsadmin –allowSnapshot <path> • Create snapshot – hdfs dfs –createSnapshot <path> [<name>] – Default name is ‘s’+Jmestamp • Verify snapshot – hadoop fs –ls <path>/.snapshot • Directory with snapshot can’t be deleted or renamed • Disallow snapshot – hdfs dfsadmin –disallowSnapshot <path> – All exisJng snapshot need to be deleted before disallow • Delete snapshot – hdfs dfs –deleteSnapshot <path> <name> • Rename snapshot – hdfs dfs –renameSnapshot <path> <oldname> <newname> • Snapshot differences – hdfs snapshotDiff <path> <starJng snapshot name> <ending snapshot name> • List all snap shoaable directories – hdfs lsSnapshoaableDir 14
  15. 15. HDFS back-­‐up using snapshot • Create a snapshot on the source cluster • Perform a “distcp” of the snapshot to backup cluster • Create a snapshot of the copy on the backup cluster • Cleanup any old back-­‐up copies to comply with the enterprise retenJon policy • The reverse can be followed to recover data from the backup – Data need to be removed on the producJon cluster before the restore – During deleJon –skipTrash opJon of “rm” will help reduce space usage 15
  16. 16. distcp • Tool to perform inter and intra cluster copy of data • UJlizes mapreduce to perform the copy • It can be used to – Copy data with in a cluster – Copy data between clusters – Copy files or directories – Copy data from mulJple sources • Can be used to create a backup cluster • Starts up containers on both source and target • Consumes network traffic between clusters • Need to be scheduled at appropriate Jme • Can control resource uJlizaJon using parameters 16
  17. 17. distcp • Hadoop distcp [opJons] <srcURL> … <srcURL> <destURL> – Source path need to be obsolute – DesJnaJon directory will be created if not present – “update” opJon will update only the changed files – “skipcrccheck” opJon to disable checksum – “overwrite” opJon is to overwrite exisJng files which is by default skipped if present – “delete” opJon to delete files in desJnaJon which are not in source – “hip” fs need to be used to copy between different versions of HDFS – “m” opJon to specify the number of mappers 17
  18. 18. distcp – “atomic” opJon to commit all changes or none – “async” to run distcp async i.e. non blocking – “i” opJon to ignore failures during copy – “log” directory on DFS where logs to be saved – “p [rbugp]” preserve file status as source – “strategy [staJc|dynamic]” – “bandwidth [MB]” bandwidth per map in MB 18
  19. 19. HDFS JAVA APIs Func@on API Directory Create FileSystem.mkdirs(path, permission) Directory Rename/Move FileSystem.rename(oldpath, newpath) Directory Delete FileSystem.delete(path, true) File Create FileSystem.createNewFile(path) File Open FileSystem.open(path) File Read FSDataInputStream.read* File Write FSDataOutputStream.write* File Rename/Move FileSystem.rename(oldpath, newpath) File Delete FileSystem.delete(path, false) File Append FileSystem.append(path) File Seek FSDataInputStream.seek(int) File System FileSystem.get(conf) 19
  20. 20. HDFS FederaJon HDFS without Federa@on Diagram source: hadoop.apache.org – JIRA HDFS-­‐1052 HDFS with Federa@on -­‐ Namespace management and block management together -­‐ Supports one name space -­‐ Hinders scalability above 400 0 nodes -­‐ Doesn’t support some of mulJ-­‐tenancy requirements -­‐ Namespace management and block management seperated -­‐ Block management can be on its node on its own -­‐ Supports more than one name space/NN -­‐ Scalable beyond 4000 nodes and millions of rows -­‐ Can deploy mulJ-­‐tenancy requirements like NN for specific departments and isoloaJon -­‐ A namespace and block pool is called namespace volume 20
  21. 21. Enabling HDFS federaJon • IdenJfy an unique cluster id • IdenJfy nameservices ids for name nodes • Add dfs.nameservices to hdfs-­‐site.xml – Comma separated nameservice(ns) names • Update hdfs-­‐site.xml on all NNs and DNs – dfs.namenode.rpc-­‐address.ns – dfs.namenode.hap-­‐address.ns – dfs.namenode.servicerpc-­‐address.ns – dfs.namenode.haps-­‐address.ns – dfs.namenode.secondaryhap-­‐address.ns – dfs.namenode.backup.address.ns • Format all name nodes using the cluster id – hdfs namenode –format –clusterId <cluster id> 21
  22. 22. HDFS Rack Awareness • Rack awareness enables efficient data placement – Data writes – Balancer – Decommissioning/commissioning of nodes • Each node is assigned to a rack (rack id) – Rack id is used in the path names • Data placement – First block is placed near client or random node/rack – Second replica of block placed in a second rack node – Third replica is placed in a different node in second rack – If HDFS is not rack aware, second and third replicas are placed at random nodes 22
  23. 23. Enabling HDFS Rack Awareness • Update core-­‐site.xml with topology properJes – topology.script.file.name • Script can be shell script, Python, Java – topology.script.number.args • Copy the script to the conf directory • Distribute the script and core-­‐site.xml • Stop and start the name node • Verify that the racks are recognized by HDFS – hdfs fsck -­‐racks 23
  24. 24. HDFS NFS Gateway • Allows HDFS HDFS to be mounted as part of local FS • Stateless daemon translates NFS to HDFS access protocol • DFSClient is part of the gateway daemon – Averages 30 MB/S for writes • MulJple gateways can be used for scalability • Gateway machine requires all soiware and configs like HDFS client – Gateway can be run on HDFS cluster nodes • Random writes are not supported HDFS Cluster NN DN DN DN NFS Gateway (DFSClient) RPC HDFS Client NFSv3 24
  25. 25. HDFS NFS Gateway ConfiguraJon • Consists of two daemons – portmap and nfs3 • ConfiguraJon – dfs.nodename.access.precision; 3600000 (1 Hr) • Name node restart – dfs.nfs3.dump.dir; dir to store out of seq data • Enough space to store data for all concurrent file writes • Use NFS for smaller file transfers in the order of 1 GB – dfs.nfs.exports.allowed.hosts; Host access • client*.abc.com r;client*.xyc.com rw – Update log4j.properJes file • log4j.logger.org.apache.hadoop.hdfs.nfs=DEBUG • log4j.logger.org.apache.hadoop.oncrpc=DEBUG 25
  26. 26. HDFS NFS Gateway ConfiguraJon • Stop nfs & rpcbind services provided by OS – service nfs stop – service rpcbind stop • Start hadoop portmap as root – hadoop-­‐daemon.sh start portmap – To stop use “stop” instead of “start” as parameter • Start mountd and nfsd as user starJng HDFS – hadoop-­‐daemon.sh start nfs3 – To stop use “stop” instead of “start” as parameter 26
  27. 27. HDFS NFS Gateway ConfiguraJon • Validate NFS services are running – rpcinfo –p $nfs_server_ip – Should see entries for mountd, portmapper & nfs • Verify HDFS namespace is exported for mount – showmount –e $nfs_server_ip – Should see the export list • Mount HDFS on client – Create a mount point as root; – Change ownership of mount point to user running HDFS cluster – mount -­‐t nfs -­‐o vers=3,proto=tcp,nolock $nfs_server:/ $mount_point – Client sends UID of user to NFS – NFS looks up the username for UID and uses it to access HDFS – User name and UID should be the same on client and NFS 27
  28. 28. HDFS Name Node HA Ac@ve Name Node Passive Name Node Shared Storage ZKFC ZKFC Zookeeper Quorum ZK ZK ZK HB HB Data Node Data Node Data Node • Zookeeper does failure detecJon and helps acJve name node elecJon • ZKFC ZooKeeper Failover Controller • monitors the health of name node • Holds a session open on ZK and a lock for acJve NN • If no other NN holds zlock, it tries to acquire it to make NN acJve • Share storage can be NFS mount or quorum of journal storage • Fencing is defined to prevent split brain scenario of two NN wriJng 28
  29. 29. HDFS NN HA ConfiguraJon • Define dfs.nameservices – Nameservice Id • Define dfs.namenodes.[nameservice id] – Comma separated list of name nodes • Define dfs.namenode.rpc-­‐address.[Nameservice Id].[Name node Id] – Fully qualified machine name and port • Define dfs.namenode.hap-­‐address.[nameservice ID].[name node ID] – Fully qualified machine name and port • Define dfs.namenode.shared.edits.dir – For nfs: file:///mnt/... – For Journal nodes: qjournal://node1:8485;node2. com:8485; • Define dfs.client.failover.proxy.provider.[nameservice ID] – org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider • Define dfs.ha.fencing.methods – sshfence; requires password less ssh into name nodes from one another – shell • Define fs.defaultFS the HA enabled logical URI • For journal nodes – Define dfs.journalnode.edits.dir where edits and other local states used by JNs will be stored 29
  30. 30. HDFS NN HA ConfiguraJon • Define dfs.ha.automaJc-­‐failover.enabled – Set to true • Define ha.zookeeper.quorum – Host and port of ZK • To enable HA in an exisJng cluster – Run hdfs dfsadmin –safemode enter – Run hdfs dfsadmin –saveNamespace – Stop HDFS cluster dfs-­‐stop.sh – Start journal node daemons hdfs-­‐daemon.sh journalnode – Run hdfs zkfc –formatZK on exisJng NN – Run hdfs –iniEalizeSharedEdits on exisJng NN – Run hdfs namenode –bootstrapStandBy on new NN – Delete secondary name node – Start HDFS cluster dfs-­‐start.sh 30
  31. 31. hdfs haadmin • -­‐ns <nameserviceId> • -­‐transiJonToAcJve <serviceId> • -­‐transiJonToStandby <serviceId> • -­‐failover <serviceId> <serviceId> – [-­‐-­‐forcefence] [-­‐-­‐forceacJve] • -­‐getServiceState <serviceId> • -­‐checkHealth <serviceId> • -­‐help <command> 31
  32. 32. hdfs dfsadmin • -­‐report • -­‐safemode [enter|leave|get|wait] • -­‐finalizeUpgrade • -­‐refreshNodes uses files defined in dfs.hosts & dfs.host.exclude • -­‐report • -­‐lsr • -­‐upgradeProgress status • -­‐metasave • -­‐setQuota <quota>/-­‐clrQuota <dirname>…<dirname> • -­‐setRep [-­‐w] <w> <path/file> 32
  33. 33. hdfs fsck • hdfs fsck [opJons] path – move – delete – openforwrite – files – blocks – locaJons – racks 33
  34. 34. Balancer • start-­‐balancer.sh – policy datanode|blockpool – threshold <percentage>; default 10% – dfs.balancer.bandwidthPerSec specified in bytes • Default 1 MB/sec 34
  35. 35. Adding New Nodes • Add node address to dfs.hosts file – Update mapred.hosts file if using mapred • Update namenode with the new set of nodes – hadoop dfsadmin –refreshNodes – Update jobtracker with the new set of nodes • hadoop mradmin –refreshNodes • Update “slaves” file with the new node names • Start new datanodes (and tasktrackers) • Check the availability of the new nodes in UI • Run balancer so that data is distributed 35
  36. 36. Decommissioning Nodes • Add node address to exclude file – dfs.hosts.exclude – mapred.hosts.exclude • Update namenode (and jobtracker) – hadoop dfsadmin –refreshNodes – hadoop mradmin –refreshNodes • Verify all the nodes are decommissioned (UI) • Remove nodes from dfs.hosts (and mapred.hosts) file • Update namenode (and jobtracker) • Remove nodes from the “slaves” file 36
  37. 37. HDFS Upgrade • No file system layout change – Install new version of HDFS (and MapReduce) – Stop the old daemons – Update the configuraJon files – Start the new daemons – Update clients to use the new libraries – Remove the old install and the configuraJon files – Update applicaJon code for deprecated APIs 37
  38. 38. HDFS Upgrade • With file system layout changes – When there is a layout change NN will not start – Run FSCK to make sure that the FS is healthy – Keep a copy of the FSCK output for verificaJon – Clear HDFS and map reduce temporary files – Make sure that any previous upgrade is finalized – Shutdown map reduce and kill orphaned task – Shutdown HDFS and make a copy of NN directories – Install new versions of HDFS and Map Reduce – Start HDFS with –upgrade opJon • Start-­‐dfs.sh –upgrade – Once the upgrade is complete perform manual spot checks • hadoop dfsadmin –upgradeProcess status – Start Map Reduce – Rollback or Finalize the upgrade • stop-­‐dfs.sh; start-­‐dfs.sh –rollback • hadoop dfsadmin -­‐finalizeUpgrade 38
  39. 39. Key Parameters Parameter Descrip@on Default Value dfs.blocksize File block size 128 MB dfs.replicaJon File block replicaJon count 3 dfs.datanode.numblocks No of blocks aier which new sub directory gets created in DN io.bytes.per.checksum Number of data bytes for which check sum is calculated 512 dfs.datanode.scan.period.hours Timeframe in hours to complete block scanning 504 (3 weeks) 39
  40. 40. 40 bnair@asquareb.com blog.asquareb.com https://github.com/bijugs @gsbiju

×