In this session you will learn:
HBase Introduction
Row & Column storage
Characteristics of a huge DB
What is HBase?
HBase Data-Model
HBase vs RDBMS
HBase architecture
HBase in operation
Loading Data into HBase
HBase shell commands
HBase operations through Java
HBase operations through MR
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
2. Page 2Classification: Restricted
Agenda
•HBase Introduction
•Row & Column storage
•Characteristics of a huge DB
•What is HBase?
•HBase Data-Model
•HBase vs RDBMS
•HBase architecture
•HBase in operation
•Loading Data into HBase
•HBase shell commands
•HBase operations through Java
•HBase operations through MR
3. Page 3Classification: Restricted
What is Hbase?
• Open source project built on top of Apache Hadoop
• NoSQL database
• Distributed, scalable store
• Column-family datastore
4. Page 4Classification: Restricted
How do you pick Sql or NoSql?
• What does your data look like?
• Is your data model likely to change?
• Is your data growing exponentially?
• Will you be doing real-time analytics on operational data?
5. Page 5Classification: Restricted
Inspiration for Hbase
•Google’s BigTable is the inspiration for Hbase
•It is designed to run on a cluster of computers.
Characteristics of Big Table:
•Data is ‘Sparse’
•Data is stored as a ‘Sorted Map’
•‘Distributed’
•‘Multi-dimensional’
•‘Consistent’
6. Page 6Classification: Restricted
Hbase vs RDBMS
HBase RDBMS
Data that is accessed together is stored
together
Data is normalized
Column-oriented Row-oriented(mostly)
Flexible schema, can add columns on
the fly
Fixed schema
Good with Sparse tables Not optimized for sparse tables
No Joins Optimized for joins
Horizontal Scalability Hard to shard and scale
Good for structured, semi-structured
data
Good for structured data
Row-based transactions Distributed transactions
7. Page 7Classification: Restricted
Row & Column - Storage
•Column oriented store – For specific queries, not all values of a table are
needed (analytical databases)
•Advantages of Column-oriented storage:
•Reduced I/O
•Values of columns in the logical rows are similar – better suited for
compression
9. Page 9Classification: Restricted
Hbase Data - Model
Component Description
Table Data organized into tables; comprised rows
Row key Data stored in rows; Rows identified by Rowkeys;
Primary key; Rows are sorted by this value
Column family Columns are grouped into families
Column Qualifier Identifies the column
Cell Combination of the rowkey, column family, colum, timestamp;
contains the value
Version Values within cell versioned by version number timestamp
11. Page 11Classification: Restricted
Hbase Data - Model
• Regions – Horizontal partitions of a Hbase Table.
• A Region is denoted by the Table it belongs to, it’s first row(inclusive), last
row(exclusive)
• Regions are the units that get distributed over an entire cluster.
• Initially, a table comprises a single region, but as the region grows it eventually
crosses a configurable size threshold, at which point it splits at a row boundary
into two new regions of approximately equal size
13. Page 13Classification: Restricted
• Hbase Master – master node
• Regionservers – slave nodes
• Hbase Master
• bootstraps a virgin install,
• assigns regions to registered regionservers,
• recovers regionserver failures
• Regionservers
• carry zero or more regions
• take client read/write requests
• Manage region splits – informs master about the new daughter regions
Hbase Architecture
14. Page 14Classification: Restricted
• ZooKeeper – Authority on the cluster state
• Hbase – location of catalog table & cluster master
• Assignment of regions is mediated via Zookeeper in case servers crash mid-
assignment
• Hbase Client must know the location of the zookeeper ensemble.
• Thereafter, client navigates the zookeeper hierarchy to learn cluster attributes
such as server lcoations.
Hbase Architecture
15. Page 15Classification: Restricted
• hbase:meta – list, state & locations of all regions on the cluster.
• Entries in hbase:meta are keyed by region name
• Region name – table name of the region, region’s start row, time of
creation, and MD5 hash of all of these.
• Eg: TestTable,xyz,1279729913622.1b6e176fb8d8aa88fd4ab6bc80247ece.
• As row keys are sorted, finding the region that hosts a particular key is easy
• Whenever region(s) split, enabled, disabled, deleted etc., the catalog table is
updated.
Hbase in Operation
16. Page 16Classification: Restricted
• Fresh clients connect to Zookeeper cluster to get the location of hbase:meta
To figure out hosting user-space regions and its locations.
• Then, clients interact directly with regionservers.
• Clients cache their previous operations – works fine until there is a fault.
• If fault happens, clients contact hbase:meta again. If this has also moved,
clients will contact Zookeeper.
• Writes arriving at a regionserver are first appended to a commit log and then
added to an in-memory memstore. When a memstore fills, its content is
flushed to the filesystem
Hbase in Operation
17. Page 17Classification: Restricted
• When reading, the region’s memstore is consulted first. If sufficient versions
are found reading memstore alone, the query completes there. Otherwise,
flush files are consulted in order, from newest to oldest, either until versions
sufficient to satisfy the query are found or until we run out of flush files.
Hbase in Operation
23. Page 23Classification: Restricted
Hbase Use cases
•Capturing incremental data – Time series data – High Volume, Velocity
Writes
•eg: Sensor, system metrics, events, stock prices, server logs, rainfall data
•Information Exchange – High Volume, Velocity Write/Read
•eg: email, chat
•Content serving, web Application Backend – High Volume, Velocity Reads
•eg: ebay, groupon