SlideShare a Scribd company logo
Submit Search
Upload
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
Report
Cloudera, Inc.
Cloudera, Inc.
Follow
•
41 likes
•
10,947 views
1
of
33
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
•
41 likes
•
10,947 views
Report
Technology
Presented by: Enis Soztutar, Hortonworks
Read more
Cloudera, Inc.
Cloudera, Inc.
Follow
Recommended
Apache HBase Performance Tuning
Lars Hofhansl
39.6K views
•
54 slides
HBase Low Latency
DataWorks Summit
5.1K views
•
38 slides
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
74K views
•
33 slides
Apache Ranger
Rommel Garcia
5.1K views
•
17 slides
Millions of Regions in HBase: Size Matters
DataWorks Summit
2.5K views
•
26 slides
Off-heaping the Apache HBase Read Path
HBaseCon
4.2K views
•
19 slides
More Related Content
What's hot
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon
6.1K views
•
54 slides
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
8.8K views
•
34 slides
Ozone: scaling HDFS to trillions of objects
DataWorks Summit
1.7K views
•
41 slides
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
2.7K views
•
30 slides
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon
14.6K views
•
39 slides
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cloudera, Inc.
13.1K views
•
25 slides
What's hot
(20)
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon
•
6.1K views
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
•
8.8K views
Ozone: scaling HDFS to trillions of objects
DataWorks Summit
•
1.7K views
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
•
2.7K views
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon
•
14.6K views
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
Cloudera, Inc.
•
13.1K views
Introduction to Apache ZooKeeper
Saurav Haloi
•
128.3K views
A Closer Look at Apache Kudu
Andriy Zabavskyy
•
2.1K views
The Impala Cookbook
Cloudera, Inc.
•
90.6K views
Securing Hadoop with Apache Ranger
DataWorks Summit
•
20.6K views
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
•
18.3K views
Kudu Deep-Dive
Supriya Sahay
•
300 views
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
•
7.6K views
Hadoop Security Architecture
Owen O'Malley
•
30.2K views
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
•
2.8K views
Web analytics at scale with Druid at naver.com
Jungsu Heo
•
5.9K views
Apache HBase Improvements and Practices at Xiaomi
HBaseCon
•
4.8K views
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
•
1.7K views
Apache Hadoop YARN: best practices
DataWorks Summit
•
16.8K views
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
•
42.6K views
Viewers also liked
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
4.1K views
•
38 slides
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
11.8K views
•
61 slides
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
28.1K views
•
31 slides
HBaseConEast2016: Splice machine open source rdbms
Michael Stack
762 views
•
17 slides
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
1.2K views
•
42 slides
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
3K views
•
43 slides
Viewers also liked
(20)
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
•
4.1K views
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
•
11.8K views
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
•
28.1K views
HBaseConEast2016: Splice machine open source rdbms
Michael Stack
•
762 views
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
•
1.2K views
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
•
3K views
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Sematext Group, Inc.
•
11.8K views
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
•
3.3K views
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.
•
17.9K views
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
•
7.3K views
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
Cloudera, Inc.
•
4.8K views
HBaseCon 2013: Apache HBase on Flash
Cloudera, Inc.
•
4.3K views
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon
•
2.9K views
HBaseCon 2013: 1500 JIRAs in 20 Minutes
Cloudera, Inc.
•
4.1K views
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.
•
3.9K views
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
Cloudera, Inc.
•
3.4K views
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon
•
3.3K views
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.
•
3.2K views
Tales from the Cloudera Field
HBaseCon
•
4K views
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
Cloudera, Inc.
•
3.5K views
Similar to HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
Ozone and HDFS's Evolution
DataWorks Summit
989 views
•
28 slides
Ozone and HDFS’s evolution
DataWorks Summit
197 views
•
28 slides
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
1.5K views
•
22 slides
Ozone and HDFS’s evolution
DataWorks Summit
2.4K views
•
29 slides
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
19.1K views
•
33 slides
HBase for Architects
Nick Dimiduk
33.7K views
•
21 slides
Similar to HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
(20)
Ozone and HDFS's Evolution
DataWorks Summit
•
989 views
Ozone and HDFS’s evolution
DataWorks Summit
•
197 views
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
•
1.5K views
Ozone and HDFS’s evolution
DataWorks Summit
•
2.4K views
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
•
19.1K views
HBase for Architects
Nick Dimiduk
•
33.7K views
Meet Apache HBase - 2.0
DataWorks Summit
•
1.1K views
Meet hbase 2.0
enissoz
•
5.3K views
Meet HBase 2.0
enissoz
•
675 views
HDFS- What is New and Future
DataWorks Summit
•
3.5K views
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
•
1.1K views
[B4]deview 2012-hdfs
NAVER D2
•
1.9K views
LLAP: Building Cloud First BI
DataWorks Summit
•
1.1K views
Evolving HDFS to a Generalized Distributed Storage Subsystem
DataWorks Summit/Hadoop Summit
•
1.3K views
Apache HBase Internals you hoped you Never Needed to Understand
Josh Elser
•
2.1K views
Storage and-compute-hdfs-map reduce
Chris Nauroth
•
655 views
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
•
564 views
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
•
1.5K views
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
•
95 views
Large-scale Web Apps @ Pinterest
HBaseCon
•
4.1K views
More from Cloudera, Inc.
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
107 views
•
55 slides
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
6.4K views
•
34 slides
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
6.3K views
•
43 slides
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
4.5K views
•
67 slides
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
3.6K views
•
36 slides
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
2.5K views
•
21 slides
More from Cloudera, Inc.
(20)
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
•
107 views
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
•
6.4K views
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
•
6.3K views
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
•
4.5K views
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
•
3.6K views
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
•
2.5K views
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
•
4.9K views
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
•
2.7K views
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
•
1.6K views
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
•
1.6K views
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
•
2.5K views
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
•
1.7K views
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
•
1.3K views
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
•
2.3K views
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
•
1.5K views
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
•
962 views
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
•
2.2K views
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
•
1.4K views
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
•
1K views
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
•
1.2K views
Recently uploaded
MemVerge: Past Present and Future of CXL
CXL Forum
105 views
•
26 slides
ChatGPT and AI for Web Developers
Maximiliano Firtman
152 views
•
82 slides
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM
CXL Forum
96 views
•
7 slides
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10
42 views
•
20 slides
Throughput
Moisés Armani Ramírez
28 views
•
11 slides
.conf Go 2023 - Data analysis as a routine
Splunk
76 views
•
12 slides
Recently uploaded
(20)
MemVerge: Past Present and Future of CXL
CXL Forum
•
105 views
ChatGPT and AI for Web Developers
Maximiliano Firtman
•
152 views
Samsung: CMM-H Tiered Memory Solution with Built-in DRAM
CXL Forum
•
96 views
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10
•
42 views
Throughput
Moisés Armani Ramírez
•
28 views
.conf Go 2023 - Data analysis as a routine
Splunk
•
76 views
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
Fwdays
•
35 views
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver
•
23 views
Micron CXL product and architecture update
CXL Forum
•
23 views
Data-centric AI and the convergence of data and model engineering:opportunit...
Paolo Missier
•
19 views
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure
CXL Forum
•
118 views
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
Fwdays
•
23 views
Five Things You SHOULD Know About Postman
Postman
•
20 views
MemVerge: Gismo (Global IO-free Shared Memory Objects)
CXL Forum
•
108 views
"How we switched to Kanban and how it integrates with product planning", Vady...
Fwdays
•
59 views
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk
•
75 views
Green Leaf Consulting: Capabilities Deck
GreenLeafConsulting
•
170 views
"Fast Start to Building on AWS", Igor Ivaniuk
Fwdays
•
31 views
GigaIO: The March of Composability Onward to Memory with CXL
CXL Forum
•
118 views
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
Splunk
•
172 views
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
1.
© Hortonworks Inc.
2011 HBase and HDFSUnderstanding file system usage in HBase Enis Söztutar enis [ at ] apache [dot] org @enissoz Page 1
2.
© Hortonworks Inc.
2011 About Me Page 2 Architecting the Future of Big Data • In the Hadoop space since 2007 • Committer and PMC Member in Apache HBase and Hadoop • Working at Hortonworks as member of Technical Staff • Twitter: @enissoz
3.
© Hortonworks Inc.
2011 Motivation • HBase as a database depends on FileSystem for many things • HBase has to work over HDFS, linux & windows • HBase is the most advanced user of HDFS • For tuning for IO performance, you have to understand how HBase does IO Page 3 Architecting the Future of Big Data MapReduce Large files Few random seek Batch oriented High throughput Failure handling at task level Computation moves to data HBase Large files A lot of random seek Latency sensitive Durability guarantees with sync Computation generates local data Large number of open files
4.
© Hortonworks Inc.
2011 Agenda • Overview of file types in Hbase • Durability semantics • IO Fencing / Lease recovery • Data locality – Short circuit reads (SSR) – Checksums – Block Placement • Open topics Page 4 Architecting the Future of Big Data
5.
© Hortonworks Inc.
2011 HBase file types Architecting the Future of Big Data Page 5
6.
© Hortonworks Inc.
2011 Overview of file types • Mainly three types of files in Hbase – Write Ahead Logs (a.k.a. WALs, logs) – Data files (a.k.a. store files, hfiles) – References / symbolic or logical links (0 length files) • Every file is 3-way replicated Page 6 Architecting the Future of Big Data
7.
© Hortonworks Inc.
2011 Overview of file types /hbase/.archive /hbase/.logs/ /hbase/.logs/server1,60020,1370043265148/ /hbase/.logs/server1,60020,1370043265148/server1%2C60020%2C1370043265148.1370050467720 /hbase/.logs/server1,60020,1370043265105/server1%2C60020%2C1370043265105.1370046867591 … /hbase/.oldlogs /hbase/usertable/0711fd70ce0df641e9440e4979d67995/family/449e2fa173c14747b9d2e5.. /hbase/usertable/0711fd70ce0df641e9440e4979d67995/family/9103f38174ab48aa898a4b.. /hbase/table1/565bfb6220ca3edf02ac1f425cf18524/f1/49b32d3ee94543fb9055.. /hbase/.hbase-snapshot/usertable_snapshot/0ae3d2a93d3cf34a7cd30../family/12f114.. … Page 7 Architecting the Future of Big Data Write Ahead Logs Data files Links
8.
© Hortonworks Inc.
2011 Data Files (HFile) • Immutable once written • Generated by flush or compactions (sequential writes) • Read randomly (preads), or sequentially • Big in size (flushsize -> tens of GBs) • All data is in blocks (Hfile blocks not to be confused by HDFS blocks) • Data blocks have target size: – BLOCKSIZE in column family descriptor – 64K by default – Uncompressed and un-encoded size • Index blocks (leaf, intermediate, root) have target size: – hfile.index.block.max.size, 128K by default • Bloom filter blocks have target size: – io.storefile.bloom.block.size, 128K by default Page 8 Architecting the Future of Big Data
9.
© Hortonworks Inc.
2011 Data Files (HFile version 2.x) Page 9 Architecting the Future of Big Data
10.
© Hortonworks Inc.
2011 Data Files • IO happens at block boundaries – Random reads => disk seek + read whole block sequentially – Read blocks are put into the block cache – Leaf index blocks and bloom filter blocks also go to the block cache • Use smaller block sizes for faster random-access – Smaller read + faster in-block search – Block index becomes bigger, more memory consumption • Larger block sizes for faster scans • Think about how many key values will fit in an average block • Try compression and Data Block Encoding (PREFIX, DIFF, FAST_DIFF, PREFIX_TREE) – Minimizes file sizes + on disk block sizes Page 10 Architecting the Future of Big Data Key length Value length Row length Row key Family length Family Column qualifier Timesta mp KeyType Value Int (4) Int (4) Short(2) Byte[] byte Byte[] Byte[] Long(8) byte Byte[]
11.
© Hortonworks Inc.
2011 Reference Files / Links • When region is split, “reference files” are created referring to the top or bottom half of the parent store file according to splitkey • HBase does not delete data/WAL files just “archives” them /hbase/.oldlogs /hbase/.archive • Logs/hfiles are kept until TTL, and replication or snapshots are not referring to them – (hbase.master.logcleaner.ttl, 10min) – (hbase.master.hfilecleaner.ttl, 5min) • HFileLink: kind of hard / soft links that is application specific • HBase snapshots are logical links to files (with backrefs) Page 11 Architecting the Future of Big Data
12.
© Hortonworks Inc.
2011 Write Ahead Logs • One logical WAL per region / one physical per regionserver • Rolled frequently – hbase.regionserver.logroll.multiplier (0.95) – hbase.regionserver.hlog.blocksize (default file system block size) • Chronologically ordered set of files, only last one is open for writing • Exceeding hbase.regionserver.maxlogs (32) will cause force flush • Old log files are deleted as a whole • Every edit is appended • Sequential writes from WAL, sync very frequently (hundreds of times per sec) • Only sequential reads from replication, and crash recovery • One log file per region server limits the write throughput per Region Server Page 12 Architecting the Future of Big Data
13.
© Hortonworks Inc.
2011 Durability (as in ACID) Architecting the Future of Big Data Page 13
14.
© Hortonworks Inc.
2011 Overview of Write Path 1. Client sends the operations over RPC (Put/Delete) 2. Obtain row locks 3. Obtain the next mvcc write number 4. Tag the cells with the mvcc write number 5. Add the cells to the memstores (changes not visible yet) 6. Append a WALEdit to WAL, do not sync 7. Release row locks 8. Sync WAL 9. Advance mvcc, make changes visible Page 14 Architecting the Future of Big Data
15.
© Hortonworks Inc.
2011 Durability • 0.94 and before: – HTable property “DEFERRED_LOG_FLUSH” and – Mutation.setWriteToWAL(false) • 0.94 and 0.96: Page 15 Architecting the Future of Big Data Durability Semantics USE_DEFAULT Use global hbase default, OR table default (SYNC_WAL) SKIP_WAL Do not write updates to WAL ASYNC_WAL Write entries to WAL asynchronously (hbase.regionserver.optionallogflushinterval, 1 sec default) SYNC_WAL Write entries to WAL, flush to datanodes FSYNC_WAL Write entries to WAL, fsync in datanodes
16.
© Hortonworks Inc.
2011 Durability • 0.94 Durability setting per Mutation (HBASE-7801) / per table (HBASE- 8375) • Allows intermixing different durability settings for updates to the same table • Durability is chosen from the mutation, unless it is USE_DEFAULT, in which case Table’s Durability is used • Limit the amount of time an edit can live in the memstore (HBASE-5930) – hbase.regionserver.optionalcacheflushinterval – Default 1hr – Important for SKIP_WAL – Cause a flush if there are unflushed edits that are older than optionalcacheflushinterval Page 16 Architecting the Future of Big Data
17.
© Hortonworks Inc.
2011 Durability Page 17 Architecting the Future of Big Data public enum Durability { USE_DEFAULT, SKIP_WAL, ASYNC_WAL, SYNC_WAL, FSYNC_WAL } Per Table: HTableDescriptor htd = new HTableDescriptor("myTable"); htd.setDurability(Durability.ASYNC_WAL); admin.createTable(htd); Shell: hbase(main):007:0> create 't12', 'f1', DURABILITY=>'ASYNC_WAL’ Per mutation: Put put = new Put(rowKey); put.setDurability(Durability.ASYNC_WAL); table.put(put);
18.
© Hortonworks Inc.
2011 Durability (Hflush / Hsync) • Hflush() : Flush the data packet down the datanode pipeline. Wait for ack’s. • Hsync() : Flush the data packet down the pipeline. Have datanodes execute FSYNC equivalent. Wait for ack’s. • hflush is currently default, hsync() usage in HBase is not implemented (HBASE-5954). Also not optimized (2x slow) and only Hadoop 2.0. • hflush does not lose data, unless all 3 replicas die without syncing to disk (datacenter power failure) • Ensure that log is replicated 3 times hbase.regionserver.hlog.tolerable.lowreplication defaults to FileSystem default replication count (3 for HDFS) Page 18 Architecting the Future of Big Data public interface Syncable { public void hflush() throws IOException; public void hsync() throws IOException; }
19.
© Hortonworks Inc.
2011 Page 19 Architecting the Future of Big Data
20.
© Hortonworks Inc.
2011 IO Fencing Fencing is the process of isolating a node of a computer cluster or protecting shared resources when a node appears to be malfunctioning Page 20 Architecting the Future of Big Data
21.
© Hortonworks Inc.
2011 IO Fencing Page 21 Architecting the Future of Big Data Region1Client Region Server A (dying) WAL Region1 Region Server B Append+sync ack edit edit WAL Append+sync ack Master Zookeeper RegionServer A znode deleted assign Region1 Region Server A Region 2 … … … YouAreDeadException abort RegionServer A session timeout -- B RegionServer A session timeout Client
22.
© Hortonworks Inc.
2011 IO Fencing • Split Brain • Ensure that a region is only hosted by a single region server at any time • If master thinks that region server no longer hosts the region, RS should not be able to accept and sync() updates • Master renames the region server logs directory on HDFS: – Current WAL cannot be rolled, new log file cannot be created – For each WAL, before replaying recoverLease() is called – recoverLease => lease recovery + block recovery – Ensure that WAL is closed, and all data is visible (file length) • Guarantees for region data files: – Compactions => Remove files + add files – Flushed => Allowed since resulting data is idempotent • HBASE-2231, HBASE-7878, HBASE-8449 Page 22 Architecting the Future of Big Data
23.
© Hortonworks Inc.
2011 Data Locality Short circuit reads, checksums, block placement Architecting the Future of Big Data Page 23
24.
© Hortonworks Inc.
2011 HDFS local reads (short circuit reads) • Bypasses the datanode layer and directly goes to the OS files • Hadoop 1.x implementation: – DFSClient asks for local paths for a block to the local datanode – Datanode checks whether the user has permission – Client gets the path for the block, opens the file with FileInputStream hdfs-site.xml dfs.block.local-path-access.user = hbase dfs.datanode.data.dir.perm = 750 hbase-site.xml dfs.client.read.shortcircuit = true Page 24 Architecting the Future of Big Data RegionServer Hadoop FileSystem DFSClient Datanode OS Filesystem (ext3) Disks Disks Disks HBase Client RPC RPC BlockReader
25.
© Hortonworks Inc.
2011 HDFS local reads (short circuit reads) • Hadoop 2.0 implementation (HDFS-347) – Keep the legacy implementation – Use Unix Domain sockets to pass the File Descriptor (FD) – Datanode opens the block file and passes FD to the BlockReaderLocal running in Regionserver process – More secure than previous implementation – Windows also supports domain sockets, need to implement native APIs • Local buffer size dfs.client.read.shortcircuit.buffer.size – BlockReaderLocal will fill this whole buffer everytime HBase will try to read an HfileBlock – dfs.client.read.shortcircuit.buffer.size = 1MB vs 64KB Hfile block size – SSR buffer is a direct buffer (in Hadoop 2, not in Hadoop 1) – # regions x # stores x #avg store files x # avg blocks per file x SSR buffer size – 10 regions x 2 x 4 x (1GB / 64MB) x 1 MB = 1.28GB non-heap memory usage Page 25 Architecting the Future of Big Data
26.
© Hortonworks Inc.
2011 Checksums • HDFS checksums are not inlined. • Two files per block, one for data, one for checksums (HDFS-2699) • Random positioned read causes 2 seeks • HBase checksums comes with 0.94 (HDP 1.2+). HBASE-5074. Page 26 Architecting the Future of Big Data blk_123456789 .blk_123456789.meta : Data chunk (dfs.bytes-per-checksum, 512 bytes) : Checksum chunk (4 bytes)
27.
© Hortonworks Inc.
2011 Checksums Page 27 Architecting the Future of Big Data • HFile version 2.1 writes checksums per Hfile block • HDFS checksum verification is bypassed on block read, will be done by HBase • If checksum fails, we go back to reading checksums from HDFS for “some time” • Due to double checksum bug(HDFS-3429) in remote reads in Hadoop 1, not enabled by default for now. Benchmark it yourself hbase.regionserver.checksum.verify = true hbase.hstore.bytes.per.checksum = 16384 hbase.hstore.checksum.algorithm = CRC32C Never set this: dfs.client.read.shortcircuit.skip.checksum = false HFile : Hfile data block chunk : Checksum chunk Hfile block : Block header
28.
© Hortonworks Inc.
2011 Rack 1 / Server 1 DataNode Default Block Placement Policy Page 28 Architecting the Future of Big Data b1 RegionServer Region A Region B StoreFile StoreFile StoreFile StoreFile StoreFile b2 b2 b9 b1 b1 b2 b3 b2 b1 b2b1 Rack N / Server M DataNode b2 b1 b1 Rack L / Server K DataNode b2 b1 Rack X / Server Y DataNode b1b2 b2 b3 RegionServer RegionServer RegionServer
29.
© Hortonworks Inc.
2011 Data locality for HBase • Poor data locality when the region is moved: – As a result of load balancing – Region server crash + failover • Most of the data won’t be local unless the files are compacted • Idea (from Facebook): Regions have affiliated nodes (primary, secondary, tertiary), HBASE-4755 • When writing a data file, give hints to the NN that we want these locations for block replicas (HDFS-2576) • LB should assign the region to one of the affiliated nodes on server crash – Keep data locality – SSR will still work • Reduces data loss probability Page 29 Architecting the Future of Big Data
30.
© Hortonworks Inc.
2011 Rack X / Server Y RegionServer Rack L / Server K RegionServer Rack N / Server M RegionServer Rack 1 / Server 1 Default Block Placement Policy Page 30 Architecting the Future of Big Data RegionServer Region A StoreFile StoreFile StoreFile Region B StoreFile StoreFile DataNode b1 b2 b2 b9 b1 b1 b2 b3 b2 b1 b2b1 DataNode b1 b2 b2 b9b1 b2 b1 DataNode b1 b2 b2 b9 b2 b1 DataNode b1 b2 b3 b2 b1
31.
© Hortonworks Inc.
2011 Other considerations • HBase riding over Namenode HA – Both Hadoop 1 (NFS based) and Hadoop 2 HA (JQM, etc) – Heavily tested with full stack HA • Retry HDFS operations • Isolate FileSystem usage from HBase internals • Hadoop 2 vs Hadoop 1 performance – Hadoop 2 is coming! • HDFS snapshots vs HBase snapshots – HBase DOES NOT use HDFS snapshots – Need hardlinks – Super flush API • HBase security vs HDFS security – All files are owned by HBase principal – No ACL’s in HDFS. Allowing a user to read Hfiles / snapshots directly is hard Page 31 Architecting the Future of Big Data
32.
© Hortonworks Inc.
2011 Open Topics • HDFS hard links – Rethink how we do snapshots, backups, etc • Parallel writes for WAL – Reduce latency on WAL syncs • SSD storage, cache – SSD storage type in Hadoop or local filesystem – Using SSD’s as a secondary cache – Selectively places tables / column families on SSD • HDFS zero-copy reads (HDFS-3051, HADOOP-8148) • HDFS inline checksums (HDFS-2699) • HDFS Quorum reads (HBASE-7509) Page 32 Architecting the Future of Big Data
33.
© Hortonworks Inc.
2011 Thanks Questions? Architecting the Future of Big Data Page 33 Enis Söztutar enis [ at ] apache [dot] org @enissoz