Apache HBase has been widely adopted at many enterprises. In this talk we will cover a few war stories with troubleshooting, tuning and fixing problems with HBase Cluster. We will be covering some of the best practices, tools , utilities and lessons learnt from evaluating deployments at different organizations
4. Intro to HBase
ā¢ Fault Tolerant
ā¢ Horizontally Scalable
ā¢ Real-time Random read-write access to data
stored in HDFS
ā¢ Millions of queries / second
ā¢ Support for transactions at a single row level
ā¢ Bloom filters
ā¢ Automatic Sharding
ā¢ Implemented in Java
5. Data Model
ā¢ Data is stored in Tables
ā¢ Tables contain rows
ā Rows are referenced by a unique key - Rowkey
ā¢ Rows are made of columns which are grouped
in column families
ā¢ Rows are sorted
ā¢ Everything is stored as a sequence of bytes
ā¢ All entries are versioned and timestamped
10. HBase API
ā¢ API is simple
ā¢ Operations
ā Get,Put,Delete,Scan,MapReduce
ā¢ Connection
ā¢ Create this instance only once per application and
share it during its runtime
ā¢ Htable
ā Zookeeper
ā¢ HBase:meta
11. Column Families
ā¢ All columns that are accesed together need to be
grouped into a Column Family
ā¢ No need to access or load data that is not used
ā¢ At the column family we can define the settings
like
ā compression, version retention policy, cache priority
ā Understand the data, access pattern and group
column family
ā¢ Column Family and Column Qualifiers are stored
as bytes
ā Avoid being verbose
13. HBase Compactions
ā¢ HDFS does not support updates
ā HFiles are immutable
ā New HFiles are created
ā¢ Minor Compactions
ā Small HFiles are merged into larger Hfiles
ā Deletes are not applied
ā¢ Major Compactions
ā Hfiles with in column family are merged into Single
Hfile
ā Deletes are applied
18. Pre-Splitting
ā¢ Region splitting
ā Grows untill it needs to be split
ā Region at a time is served by only 1 Region Server
ā¢ Pre-split a table into regions at table creation
time
ā Uniformly distribute write load across region servers
ā Understand the keyspace
ā¢ Risk of uneven load distribution
ā¢ Auto splitting
ā Constant size region split policy
ā IncreasingToUpperBoundRegionSplitPolicy
19. Bulk Loading
ā¢ Native API
ā Disable WAL
ā¢ MapReduce Job to generate Hfile
ā Load using completebulkload / importTSV tool
ā¢ Loads into relevant region
ā Faster than going through normal write path
ā¢ No writes to WAL and Memstore
ā¢ No flushing and compacting
20. Troubleshooting
ā¢ ulimit -n
ā Limits on number of files and processs
ā¢ HBase is database and needs to open a number
of files
ā¢ dfs.datanode.max.transfer.threadsrr.
ā¢ Network
ā¢ OS Parameters
23. Tuning
ā¢ Heavy Writes
ā Flushes, compacting,splitting increase IO and degrade
cluster performance
ā¢ Keep Region sizes larger
ā¢ Keep Hfile size large
ā¢ Heavy Sequential Reads
ā¢ Higher block size
ā¢ Avoid Caching on table
ā¢ Heavy Random Reads
ā¢ Higher Blocklevel cache
ā¢ Lower Memstore limit
ā¢ Smaller block size
24. Apache Phoenix
ā¢ SQL over Hbase
ā Compiles into Hbase Scans
ā Orchetrates parallel execution
ā Aggregate queries
ā¢ JDBC APIās over Native HBase API.
ā¢ Salting Buckets PreSplitting
ā¢ Trafodion
ā Transactional SQL on HBase
25. Hannibal
ā¢ Monitor and maintain HBase Clusters
ā¢ How well regions are balanced over the
cluster?
ā¢ How well regions are split for each table
ā¢ How regions evolve over time
ā¢ How long compactions take
ā¢ Integration with HUE