HBase is a distributed column-oriented database built on top of Hadoop that provides quick random access to large amounts of structured data. It uses a key-value structure to store table data by row key, column family, column, and timestamp. Tables consist of rows, column families, and columns, with a version dimension to store multiple values over time. HBase is well-suited for applications requiring real-time read/write access and is commonly used to store web crawler results or search indexes.
2. Introduction
● HBase is a distributed column-oriented database built on top of the Hadoop
file system.
● It is an open-source project and is horizontally scalable.
● HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data.
● It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.
4. HBase Architecture and Data Model
● An HBase table consists of rows and columns and has a third dimension,
version, to maintain the different values of a row and column intersection over
time
● Example : customer doing online shopping
● For this type of application, real-time access is required
● Thus, the use of the batch processing of Pig, Hive, or Hadoop's MapReduce is
not a reasonable implementation approach
● HBase stores the data and provides real-time read and write access
5. HBase Architecture and Data Model (cont’d)
● HBase uses a key/value structure to store the contents of an HBase table
● (row key, column family, column, timestamp) -> value
● Each value is the data to be stored at the intersection of the row, column, and
version
● Each key consists of the following elements
○ Row length
○ Row (sometimes called the row key)
○ Column family length
○ Column family
○ Column qualifier
○ Version
○ Key type
6. HBase Architecture and Data Model (cont’d)
● Table is a collection of rows.
● Row is a collection of column families.
● Column family is a collection of columns.
● Column is a collection of key value pairs.
7. HBase Architecture and Data Model (cont’d)
● Create HBase Table
$ hbase shell
hbase> create 'my_table', 'cfl', 'cf2',
{SPLITS =>['250000', '500000', '750000']}
● my_ table stored in HBase
$ hadoop fs -ls -R /hbase
● add data to the table
hbase> put ‘my_table’, '000700', 'cfl:cql', 'data1'
hbase> put ‘my_table’, '000700', 'cfl:cq2', 'data2'
hbase> put ‘my_table’, '000700', 'cf2:cq3', 'data3'
8. HBase Architecture and Data Model (cont’d)
● Data retrieved from table
○ hbase> get 'my_table', '000700', 'cf2:cq3'
● Scan function
○ hbase> scan 'my_table', {STARTROW => '000600', STOPROW =>'000800'}
● Delete the oldest entry for column
○ hbase> delete ‘my_table', '000700', 'cf2:cq3', 1393866138714
9. Use Cases for HBase
● a common use case for a data store such as HBase is to store the results from
a web crawler
○ row com.cnn.www corresponds to a website URl, www.cnn.com
○ A column family, called anchor, is defined to capture the website URLs that provide links to the
row's website
○ anchoring website URLs are used as the column qualifiers
○ Additional websites that provide links to www. cnn. com appear as additional column qualifiers.
○ The value stored in the cell is simply the text on the website that provides the link.
○ hbase> get 'web_table', 'com.cnn.www', {VERSIONS=> 2}
10. Use Cases for HBase (cont’d)
● This use case illustrates several important points
1. it is possible to get to a billion rows and millions of columns in an HBase
table.
2. row needs to be defined based on how the data will be accessed
3. it may be advantageous to use the column qualifiers to actually store the
data of interest, rather than simply storing it in a cell
● A second use case is the storage and search access of messages.
○ The row was defined to be the user I D.
○ The column qualifier was set to a word that appears in the message.
○ The version was the message I D.
○ The cell's content was the offset of the word in the message.
● This implementation allowed Facebook to provide auto-complete capability in
the search box and to return the results of the query quickly
11. Use Cases for HBase (cont’d)
● Power of being able to add new column by adding new column qualifiers, on
demand.
● RDBMS implementation, new columns require the involvement of a DBA to
alter the structure of the table.
12. Other HBase Usage Considerations
● Java API
○ The shell commands are useful for exploring the data in an HBase environment and illustrating
their use
○ in a production environment, the HBase Java API could be used to program the desired
operations and the conditions in which to execute the operations.
● Column family and column qualifier names
○ keep the name lengths of the column families and column qualifiers as short as possible
○ column family name and the column qualifier are stored as part of the key of each key/value
pair.
○ three copies of each HDFS block are replicated across the Hadoop cluster, which triples the
storage requirement.
13. Other HBase Usage Considerations (cont’d)
● Defining rows
○ definition of the row is the main mechanism to perform read/write operations on an HBase table
○ The row needs to be constructed in such a way that the requested columns can be easily and
quickly retrieved.
● Avoid creating sequential rows
○ all the new users and their data are being written to just one region, which is not distributing the
workload across the cluster as intended
○ randomly assign a prefix to the sequential number.
14. Other HBase Usage Considerations (cont’d)
● Versioning control
○ control how long a version of a cell's contents will exist
○ TimeTolive (TTL) after which any older versions will be deleted
○ minimum and maximum number of versions to maintain.
● Zookeeper
○ HBase uses Apache Zookeeper to coordinate and manage the various regions running on the
distributed cluster
○ Zookeeper is "a centralized service for maintaining configuration information, naming, providing
distributed synchronization, and providing group services.
○ Instead of building its own coordination service, HBase uses Zookeeper