HBase and HDFS: Understanding FileSystem Usage in HBase


Published on

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

HBase and HDFS: Understanding FileSystem Usage in HBase

  1. 1. © Hortonworks Inc. 2011HBase and HDFSUnderstanding file system usage in HBaseEnis Söztutarenis [ at ] apache [dot] org@enissozPage 1
  2. 2. © Hortonworks Inc. 2011About MePage 2Architecting the Future of Big Data• In the Hadoop space since 2007• Committer and PMC Member in Apache HBase and Hadoop• Working at Hortonworks as member of Technical Staff• Twitter: @enissoz
  3. 3. © Hortonworks Inc. 2011Motivation• HBase as a database depends on FileSystem for many things• HBase has to work over HDFS, linux & windows• HBase is the most advanced user of HDFS• For tuning for IO performance, you have to understand how HBase doesIOPage 3Architecting the Future of Big DataMapReduceLarge filesFew random seekBatch orientedHigh throughputFailure handling at task levelComputation moves to dataHBaseLarge filesA lot of random seekLatency sensitiveDurability guarantees with syncComputation generates local dataLarge number of open files
  4. 4. © Hortonworks Inc. 2011Agenda• Overview of file types in Hbase• Durability semantics• IO Fencing / Lease recovery• Data locality– Short circuit reads (SSR)– Checksums– Block Placement• Open topicsPage 4Architecting the Future of Big Data
  5. 5. © Hortonworks Inc. 2011HBase file typesArchitecting the Future of Big DataPage 5
  6. 6. © Hortonworks Inc. 2011Overview of file types• Mainly three types of files in Hbase– Write Ahead Logs (a.k.a. WALs, logs)– Data files (a.k.a. store files, hfiles)– References / symbolic or logical links (0 length files)• Every file is 3-way replicatedPage 6Architecting the Future of Big Data
  7. 7. © Hortonworks Inc. 2011Overview of file types/hbase/.archive/hbase/.logs//hbase/.logs/server1,60020,1370043265148//hbase/.logs/server1,60020,1370043265148/server1%2C60020%2C1370043265148.1370050467720/hbase/.logs/server1,60020,1370043265105/server1%2C60020%2C1370043265105.1370046867591…/hbase/.oldlogs/hbase/usertable/0711fd70ce0df641e9440e4979d67995/family/449e2fa173c14747b9d2e5../hbase/usertable/0711fd70ce0df641e9440e4979d67995/family/9103f38174ab48aa898a4b../hbase/table1/565bfb6220ca3edf02ac1f425cf18524/f1/49b32d3ee94543fb9055../hbase/.hbase-snapshot/usertable_snapshot/0ae3d2a93d3cf34a7cd30../family/12f114..…Page 7Architecting the Future of Big DataWrite Ahead LogsData filesLinks
  8. 8. © Hortonworks Inc. 2011Data Files (HFile)• Immutable once written• Generated by flush or compactions (sequential writes)• Read randomly (preads), or sequentially• Big in size (flushsize -> tens of GBs)• All data is in blocks (Hfile blocks not to be confused by HDFS blocks)• Data blocks have target size:– BLOCKSIZE in column family descriptor– 64K by default– Uncompressed and un-encoded size• Index blocks (leaf, intermediate, root) have target size:– hfile.index.block.max.size, 128K by default• Bloom filter blocks have target size:– io.storefile.bloom.block.size, 128K by defaultPage 8Architecting the Future of Big Data
  9. 9. © Hortonworks Inc. 2011Data Files (HFile version 2.x)Page 9Architecting the Future of Big Data
  10. 10. © Hortonworks Inc. 2011Data Files• IO happens at block boundaries– Random reads => disk seek + read whole block sequentially– Read blocks are put into the block cache– Leaf index blocks and bloom filter blocks also go to the block cache• Use smaller block sizes for faster random-access– Smaller read + faster in-block search– Block index becomes bigger, more memory consumption• Larger block sizes for faster scans• Think about how many key values will fit in an average block• Try compression and Data Block Encoding (PREFIX, DIFF, FAST_DIFF,PREFIX_TREE)– Minimizes file sizes + on disk block sizesPage 10Architecting the Future of Big DataKeylengthValuelengthRowlengthRow key FamilylengthFamily ColumnqualifierTimestampKeyType ValueInt (4) Int (4) Short(2) Byte[] byte Byte[] Byte[] Long(8) byte Byte[]
  11. 11. © Hortonworks Inc. 2011Reference Files / Links• When region is split, “reference files” are created referring to the top orbottom half of the parent store file according to splitkey• HBase does not delete data/WAL files just “archives” them/hbase/.oldlogs/hbase/.archive• Logs/hfiles are kept until TTL, and replication or snapshots are notreferring to them– (hbase.master.logcleaner.ttl, 10min)– (hbase.master.hfilecleaner.ttl, 5min)• HFileLink: kind of hard / soft links that is application specific• HBase snapshots are logical links to files (with backrefs)Page 11Architecting the Future of Big Data
  12. 12. © Hortonworks Inc. 2011Write Ahead Logs• One logical WAL per region / one physical per regionserver• Rolled frequently– hbase.regionserver.logroll.multiplier (0.95)– hbase.regionserver.hlog.blocksize (default file system block size)• Chronologically ordered set of files, only last one is open for writing• Exceeding hbase.regionserver.maxlogs (32) will cause force flush• Old log files are deleted as a whole• Every edit is appended• Sequential writes from WAL, sync very frequently (hundreds of timesper sec)• Only sequential reads from replication, and crash recovery• One log file per region server limits the write throughput per RegionServerPage 12Architecting the Future of Big Data
  13. 13. © Hortonworks Inc. 2011Durability(as in ACID)Architecting the Future of Big DataPage 13
  14. 14. © Hortonworks Inc. 2011Overview of Write Path1. Client sends the operations over RPC (Put/Delete)2. Obtain row locks3. Obtain the next mvcc write number4. Tag the cells with the mvcc write number5. Add the cells to the memstores (changes not visible yet)6. Append a WALEdit to WAL, do not sync7. Release row locks8. Sync WAL9. Advance mvcc, make changes visiblePage 14Architecting the Future of Big Data
  15. 15. © Hortonworks Inc. 2011Durability• 0.94 and before:– HTable property “DEFERRED_LOG_FLUSH” and– Mutation.setWriteToWAL(false)• 0.94 and 0.96:Page 15Architecting the Future of Big DataDurability SemanticsUSE_DEFAULT Use global hbase default, OR table default (SYNC_WAL)SKIP_WAL Do not write updates to WALASYNC_WAL Write entries to WAL asynchronously(hbase.regionserver.optionallogflushinterval, 1 sec default)SYNC_WAL Write entries to WAL, flush to datanodesFSYNC_WAL Write entries to WAL, fsync in datanodes
  16. 16. © Hortonworks Inc. 2011Durability• 0.94 Durability setting per Mutation (HBASE-7801) / per table (HBASE-8375)• Allows intermixing different durability settings for updates to the sametable• Durability is chosen from the mutation, unless it is USE_DEFAULT, inwhich case Table’s Durability is used• Limit the amount of time an edit can live in the memstore (HBASE-5930)– hbase.regionserver.optionalcacheflushinterval– Default 1hr– Important for SKIP_WAL– Cause a flush if there are unflushed edits that are older thanoptionalcacheflushintervalPage 16Architecting the Future of Big Data
  17. 17. © Hortonworks Inc. 2011DurabilityPage 17Architecting the Future of Big Datapublic enum Durability {USE_DEFAULT,SKIP_WAL,ASYNC_WAL,SYNC_WAL,FSYNC_WAL}Per Table:HTableDescriptor htd = new HTableDescriptor("myTable");htd.setDurability(Durability.ASYNC_WAL);admin.createTable(htd);Shell:hbase(main):007:0> create t12, f1, DURABILITY=>ASYNC_WAL’Per mutation:Put put = new Put(rowKey);put.setDurability(Durability.ASYNC_WAL);table.put(put);
  18. 18. © Hortonworks Inc. 2011Durability (Hflush / Hsync)• Hflush() : Flush the data packet down the datanode pipeline. Wait forack’s.• Hsync() : Flush the data packet down the pipeline. Have datanodesexecute FSYNC equivalent. Wait for ack’s.• hflush is currently default, hsync() usage in HBase is not implemented(HBASE-5954). Also not optimized (2x slow) and only Hadoop 2.0.• hflush does not lose data, unless all 3 replicas die without syncing todisk (datacenter power failure)• Ensure that log is replicated 3 timeshbase.regionserver.hlog.tolerable.lowreplicationdefaults to FileSystem default replication count (3 for HDFS)Page 18Architecting the Future of Big Datapublic interface Syncable {public void hflush() throws IOException;public void hsync() throws IOException;}
  19. 19. © Hortonworks Inc. 2011Page 19Architecting the Future of Big Data
  20. 20. © Hortonworks Inc. 2011IO FencingFencing is the process of isolating a node of a computercluster or protecting shared resources when a node appearsto be malfunctioningPage 20Architecting the Future of Big Data
  21. 21. © Hortonworks Inc. 2011IO FencingPage 21Architecting the Future of Big DataRegion1ClientRegion Server A(dying)WALRegion1Region Server BAppend+syncackediteditWALAppend+syncackMasterZookeeperRegionServer A znode deletedassignRegion1 Region Server ARegion 2 …… …YouAreDeadExceptionabortRegionServer A session timeout--BRegionServer A session timeoutClient
  22. 22. © Hortonworks Inc. 2011IO Fencing• Split Brain• Ensure that a region is only hosted by a single region server at any time• If master thinks that region server no longer hosts the region, RSshould not be able to accept and sync() updates• Master renames the region server logs directory on HDFS:– Current WAL cannot be rolled, new log file cannot be created– For each WAL, before replaying recoverLease() is called– recoverLease => lease recovery + block recovery– Ensure that WAL is closed, and all data is visible (file length)• Guarantees for region data files:– Compactions => Remove files + add files– Flushed => Allowed since resulting data is idempotent• HBASE-2231, HBASE-7878, HBASE-8449Page 22Architecting the Future of Big Data
  23. 23. © Hortonworks Inc. 2011Data LocalityShort circuit reads, checksums, block placementArchitecting the Future of Big DataPage 23
  24. 24. © Hortonworks Inc. 2011HDFS local reads (short circuit reads)• Bypasses the datanode layer and directlygoes to the OS files• Hadoop 1.x implementation:– DFSClient asks for local paths for a block to thelocal datanode– Datanode checks whether the user haspermission– Client gets the path for the block, opens the filewith FileInputStreamhdfs-site.xmldfs.block.local-path-access.user = hbasedfs.datanode.data.dir.perm = 750hbase-site.xmldfs.client.read.shortcircuit = truePage 24Architecting the Future of Big DataRegionServerHadoop FileSystemDFSClientDatanodeOS Filesystem (ext3)DisksDisksDisksHBase ClientRPCRPCBlockReader
  25. 25. © Hortonworks Inc. 2011HDFS local reads (short circuit reads)• Hadoop 2.0 implementation (HDFS-347)– Keep the legacy implementation– Use Unix Domain sockets to pass the File Descriptor (FD)– Datanode opens the block file and passes FD to the BlockReaderLocal running inRegionserver process– More secure than previous implementation– Windows also supports domain sockets, need to implement native APIs• Local buffer size dfs.client.read.shortcircuit.buffer.size– BlockReaderLocal will fill this whole buffer everytime HBase will try to read anHfileBlock– dfs.client.read.shortcircuit.buffer.size = 1MB vs 64KB Hfile block size– SSR buffer is a direct buffer (in Hadoop 2, not in Hadoop 1)– # regions x # stores x #avg store files x # avg blocks per file x SSR buffer size– 10 regions x 2 x 4 x (1GB / 64MB) x 1 MB = 1.28GBnon-heap memory usagePage 25Architecting the Future of Big Data
  26. 26. © Hortonworks Inc. 2011Checksums• HDFS checksums are not inlined.• Two files per block, one for data, one forchecksums (HDFS-2699)• Random positioned read causes 2 seeks• HBase checksums comes with 0.94 (HDP1.2+). HBASE-5074.Page 26Architecting the Future of Big Datablk_123456789.blk_123456789.meta: Data chunk (dfs.bytes-per-checksum, 512 bytes): Checksum chunk (4 bytes)
  27. 27. © Hortonworks Inc. 2011ChecksumsPage 27Architecting the Future of Big Data• HFile version 2.1 writes checksums perHfile block• HDFS checksum verification is bypassedon block read, will be done by HBase• If checksum fails, we go back to readingchecksums from HDFS for “some time”• Due to double checksum bug(HDFS-3429)in remote reads in Hadoop 1, not enabledby default for now. Benchmark it yourselfhbase.regionserver.checksum.verify = truehbase.hstore.bytes.per.checksum = 16384hbase.hstore.checksum.algorithm = CRC32CNever set this:dfs.client.read.shortcircuit.skip.checksum = falseHFile: Hfile data block chunk: Checksum chunkHfile block: Block header
  28. 28. © Hortonworks Inc. 2011Rack 1 / Server 1DataNodeDefault Block Placement PolicyPage 28Architecting the Future of Big Datab1RegionServerRegion ARegion BStoreFileStoreFileStoreFileStoreFileStoreFileb2 b2b9 b1b1b2b3b2b1 b2b1Rack N / Server MDataNodeb2b1b1Rack L / Server KDataNodeb2b1Rack X / Server YDataNodeb1b2 b2b3RegionServer RegionServer RegionServer
  29. 29. © Hortonworks Inc. 2011Data locality for HBase• Poor data locality when the region is moved:– As a result of load balancing– Region server crash + failover• Most of the data won’t be local unless the files are compacted• Idea (from Facebook): Regions have affiliated nodes (primary,secondary, tertiary), HBASE-4755• When writing a data file, give hints to the NN that we want theselocations for block replicas (HDFS-2576)• LB should assign the region to one of the affiliated nodes on servercrash– Keep data locality– SSR will still work• Reduces data loss probabilityPage 29Architecting the Future of Big Data
  30. 30. © Hortonworks Inc. 2011Rack X / Server YRegionServerRack L / Server KRegionServerRack N / Server MRegionServerRack 1 / Server 1Default Block Placement PolicyPage 30Architecting the Future of Big DataRegionServerRegion AStoreFileStoreFileStoreFileRegion BStoreFileStoreFileDataNodeb1b2 b2b9 b1b1b2b3b2b1 b2b1DataNodeb1b2b2b9b1b2b1DataNodeb1b2b2b9b2b1DataNodeb1b2b3b2b1
  31. 31. © Hortonworks Inc. 2011Other considerations• HBase riding over Namenode HA– Both Hadoop 1 (NFS based) and Hadoop 2 HA (JQM, etc)– Heavily tested with full stack HA• Retry HDFS operations• Isolate FileSystem usage from HBase internals• Hadoop 2 vs Hadoop 1 performance– Hadoop 2 is coming!• HDFS snapshots vs HBase snapshots– HBase DOES NOT use HDFS snapshots– Need hardlinks– Super flush API• HBase security vs HDFS security– All files are owned by HBase principal– No ACL’s in HDFS. Allowing a user to read Hfiles / snapshots directly is hardPage 31Architecting the Future of Big Data
  32. 32. © Hortonworks Inc. 2011Open Topics• HDFS hard links– Rethink how we do snapshots, backups, etc• Parallel writes for WAL– Reduce latency on WAL syncs• SSD storage, cache– SSD storage type in Hadoop or local filesystem– Using SSD’s as a secondary cache– Selectively places tables / column families on SSD• HDFS zero-copy reads (HDFS-3051, HADOOP-8148)• HDFS inline checksums (HDFS-2699)• HDFS Quorum reads (HBASE-7509)Page 32Architecting the Future of Big Data
  33. 33. © Hortonworks Inc. 2011ThanksQuestions?Architecting the Future of Big DataPage 33Enis Söztutarenis [ at ] apache [dot] org@enissoz