HBase internals

HBase Storage Internals, present and future!
Matteo Bertozzi | @Cloudera
Speaker Name or Subhead Goes Here
March 2013 - Hadoop Summit Europe

1

What is HBase?
• Open source Storage Manager that provides random
read/write on top of HDFS
• Provides Tables with a “Key:Column/Value” interface
• Dynamic columns (qualifiers), no schema needed
• “Fixed” column groups (families)
• table[row:family:column] = value

2

HBase ecosystem
• Apache Hadoop HDFS for data durability and
reliability (Write-Ahead Log)
• Apache ZooKeeper for distributed coordination App MR

• Apache Hadoop MapReduce built-in support
for running MapReduce jobs

ZK HDFS

3

How HBase Works
“View from 10000ft”

4

Master, Region Servers and Regions
Client • Region Server
• Server that contains a set of Regions
• Responsible to handle reads and writes
ZooKeeper
• Region

Master • The basic unit of scalability in HBase
• Subset of the table’s data
• Contiguous, sorted range of rows stored together.
Region Server Region Server Region Server • Master
Region Region Region • Coordinates the HBase Cluster
Region Region Region • Assignment/Balancing of the Regions
Region Region Region • Handles admin operations
• create/delete/modify table, …
HDFS

5

Autosharding and .META. table
• A Region is a Subset of the table’s data
• When there is too much data in a Region…
• a split is triggered, creating 2 regions
• The association “Region -> Server” is stored in a System Table
• The Location of .META. Is stored in ZooKeeper
Table Start Key Region ID Region Server machine01
Region 1 - testTable
testTable Key-00 1 machine01.host Region 4 - testTable

testTable Key-31 2 machine03.host
machine02
testTable Key-65 3 machine02.host
Region 3 - testTable
testTable Key-83 4 machine01.host Region 1 - users

… … … …
machine03
users Key-AB 1 machine03.host Region 2 - testTable

users Key-KG 2 machine02.host Region 2 - users

6

The Write Path – Create a New Table
• The client asks to the master to create a new Table
Client
• hbase> create ‘myTable’, ‘cf’
createTable()

• The Master
Master
• Store the Table information (“schema”) Store Table
“Metadata”
• Create Regions based on the key-splits provided
Assign the Regions
• no splits provided, one single region by default “enable”

• Assign the Regions to the Region Servers
Region Region Region
• The assignment Region -> Server Server Server Server
Region
is written to a system table called “.META.” Region Region

7

The Write Path – “Inserting” data
• table.put(row-key:family:column, value)
Client
Where is
• The client asks ZooKeeper the location of .META. .META.? Scan
.META.

• The client scans .META. searching for the Region Server ZooKeeper Region Server

Region
responsible to handle the Key Insert
KeyValue Region
• The client asks the Region Server to insert/update/delete
Region Server
the specified key/value. Region
• The Region Server process the request and dispatch it to Region
Region
the Region responsible to handle the Key
• The operation is written to a Write-Ahead Log (WAL)
• …and the KeyValues added to the Store: “MemStore”

8

The Write Path – Append Only to Random R/W
• Files in HDFS are RS

WAL
• Append-Only
• Immutable once closed MemStore + Store Files (HFiles)

• HBase provides Random Writes?
• …not really from a storage point of view
• KeyValues are stored in memory and written to disk on pressure
• Don’t worry your data is safe in the WAL!
Key0 – value 0
• (The Region Server can recover data from the WAL is case of crash) Key1 – value 1
Key2 – value 2
Key3 – value 3

• But this allow to sort data by Key before writing on disk Key4 – value 4
Key5 – value 5

• Deletes are like Inserts but with a “remove me flag” Store Files

9

The Read Path – “reading” data
• The client asks ZooKeeper the location of .META.
Client
Where is
• The client scans .META. searching for the Region Server .META.? Scan
.META.
responsible to handle the Key ZooKeeper Region Server

Region
• The client asks the Region Server to get the specified key/value. Get Key
Region
• The Region Server process the request and dispatch it to the
Region Server
Region responsible to handle the Key Region
• MemStore and Store Files are scanned to find the key Region
Region

10

The Read Path – Append Only to Random R/W
Each flush a new file is created
Key0 – value 0.1
•
Key0 – value 0.0
Key2 – value 2.0 Key5 – value 5.0
Key5 – value 5.0 Key5 – [deleted]

Each file have KeyValues sorted by key
Key6 – value 6.0
•
Key8 – value 8.0
Key9 – value 9.0 Key7– value 7.0

• Two or more files can contains the same key
(updates/deletes)
• To find a Key you need to scan all the files
• …with some optimizations
• Filter Files Start/End Key
• Having a bloom filter on each file

11

HFile
HBase Store File Format

12

HFile Format
• Only Sequential Writes, just append(key, value) Blocks
Header
• Large Sequential Reads are better
Record 0
• Why grouping records in blocks? Record 1
…
• Easy to split Record N

• Easy to read Key/Value Header
(record) Record 0
• Easy to cache Key Length : int Record 1
Value Length : int …
• Easy to index (if records are sorted) Key : byte[]
Record N

• Block Compression (snappy, lz4, gz, …) Index 0
…
Value : byte[]
Index N

Trailer

13

Data Block Encoding
• “Be aware of the data”
• Block Encoding allows to compress the Key based on what we know
• Keys are sorted… prefix may be similar in most cases
• One file contains keys from one Family only
“on-disk”
• Timestamps are “similar”, we can store the diff KeyValue
• Type is “put” most of the time… Row Length : short
Row : byte[]
Family Length : byte
Family : byte[]
Qualifier : byte[]
Timestamp : long
Type : byte

14

Compactions
Optimize the read-path

15

Compactions
Reduce the number of files to look into during a scan
Key0 – value 0.1
•
Key0 – value 0.0

• Removing duplicated keys (updated values)

• Removing deleted keys
• Creates a new file by merging the content of two or more files Key0 – value 0.1
Key1 – value 1.0
Key2 – value 2.0
Key4– value 4.0
• Remove the old files Key6 – value 6.0
Key7– value 7.0
Key8– value 8.0
Key9– value 9.0

16

Pluggable Compactions
Try different algorithm
Key0 – value 0.1
•
Key0 – value 0.0

Be aware of the data
Key6 – value 6.0
•
Key8 – value 8.0

• Time Series? I guess no updates from the 80s
• Be aware of the requests Key0 – value 0.1
Key1 – value 1.0
Key2 – value 2.0
Key4– value 4.0
• Compact based on statistics Key6 – value 6.0
Key7– value 7.0
Key8– value 8.0
Key9– value 9.0
• which files are hot and which are not
• which keys are hot and which are not

17

Snapshots
Zero-Copy Snapshots and Table Clones

18

What Is a Snapshot?
• “a Snapshot is not a copy of the table”
• a Snapshot is a set of metadata information
• The table “schema” (column families and attributes)
• The Regions information (start key, end key, …)
• The list of Store Files
ZK ZK

• The list of WALs active Master ZK

RS RS
Region Region Region Region Region Region

WAL

WAL
Store Files (HFiles) Store Files (HFiles)

19

How Taking a Snapshot Works?
• The master orchestrate the RSs
• the communication is done via ZooKeeper
• using a “2-phase commit like” transaction (prepare/commit)
• Each RS is responsible to take its “piece” of snapshot
• For each Region store the metadata information needed
• (list of Store Files, WALs, region start/end keys, …)
ZK ZK

Master ZK

RS RS
Region Region Region Region Region Region

WAL

WAL
Store Files (HFiles) Store Files (HFiles)

20

Cloning a Table from a Snapshots
• hbase> clone_snapshot ‘snapshotName’, ‘tableName’
…

• Creates a new table with the data “contained” in the snapshot
• No data copies involved
• HFiles are immutable, and shared between tables and snapshots
• You can insert/update/remove data from the new table
• No repercussions on the snapshot, original tables or other cloned tables

21

Compactions & Archiving
• HFiles are immutable, and shared between tables and snapshots

• On compaction or table deletion, files are removed from disk
• If one of these files are referenced by a snapshot or a cloned table
• The file is moved to an “archive” directory
• And deleted later, when there’re no references to it

22

Future
What can be improved?

23

0.96 is coming up
• Moving RPC to Protobuf
• Allows rolling upgrades with no surprises
• HBase Snapshots
• Pluggable Compactions
• Remove -ROOT-
• Table Locks

24

0.98 and Beyond
• Transparent Table/Column-Family Encryption
• Cell-level security
• Multiple WALs per Region Server (MTTR)
• Data Placement Awareness (MTTR)
• Data Type Awareness
• Compaction policies, based on the data needs
• Managing blocks directly (instead of files)

25

DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Questions?
Headline Goes Here
Matteo Name or @Cloudera
SpeakerBertozzi | Subhead Goes Here

26

HBase internals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to HBase internals

Similar to HBase internals (20)

HBase internals