HBase
Леонид Налчаджи
leonid.nalchadzhi@gmail.com
HBase
•
•
•
•
•

Overview
Table layout
Architecture
Client API
Key design

2
Overview

3
Overview
•
•
•

NoSQL
Column oriented
Versioned

4
Overview
•
•
•

All rows ordered by row key
All cells are uninterpreted arrays of bytes
Auto-sharding

5
Overview
•
•
•
•

Distributed
HDFS based
Highly fault tolerant
CP/CAP

6
Overview
•
•
•

Coordinates: <RowKey, CF:CN,Version>
Data is grouped by column family
Null values for free

7
Overview
•
•

No actual deletes (tombstones)
TTL for cells

8
Disadvantages
•
•
•
•

No secondary indexes from the box

•

Available through additional table and
coprocessors

No trans...
Table layout

10
Table layout

11
Table layout

12
Table layout

13
Architecture

14
Terms
•
•
•
•
•

Region
Region server
WAL
Memstore
StoreFile/HFile

15
Architecture

16
17
Architecture

18
Architecture
•

One master (with backup) many slaves
(region servers)

•
•

Very tight integration with ZooKeeper
Client c...
Master
•
•
•
•

Handles schema changes
Cluster housekeeping operations
Not SPOF
Check RS status through ZK

20
Region server
•
•
•
•

Stores data in regions
Region is a container of data
Does actual data manipulation
HDFS DataNode

2...
ZooKeeper
•
•
•

Distributed synchronization service
FS-like tree structure contains keys
Provides primitives to achieve a...
Storage
•
•
•

Handled by region server
Memstore: sorted write buffer
2 types of file

•
•

WAL log
Data storage

23
Client connection
•
•
•
•

Ask ZK to get address -ROOT- region
From -ROOT- gets the RS with .META.
From .META. gets the ad...
Write path
•
•
•

Write to HLog (one per RS)
Write to Memstore (one per CF)
If Memstore is full, flush is requested

•

Thi...
On Memstore flush
•
•

On flush Memstore is written to HDFS file

•

Memstore is cleared

The last operation sequence number ...
Region splits
•
•

Region is closed

•

.META. is updated (parent and daughters)

Creates directory structure for new
regi...
Region splits
•
•
•

Master is notified for load balancing
All steps are tracked in ZK
Parent cleaned up eventually

28
Compaction
•

When Memstore is flushed new store file
is created

•

Compaction is a housekeeping mechanism
which merges sto...
Minor compaction
•
•
•
•

Fast
Selects files to include
Params: max size and min number of files
From the oldest to the newe...
Minor compaction

31
Major compaction
•
•

Compact all files into a single file

•

Starts periodically (each 24 hours by
default)

•

Checks tom...
HFile format

33
HFile format
•
•
•
•
•

Immutable (trailer is written once)
Trailer has pointers to other blocks
Indexes store offsets to ...
KeyValue format

35
Write-Ahead log

36
Write-Ahead log
•
•
•

One per region server
Contains data from several regions
Stores data sequentially

37
Write-Ahead log
•
•

Rolled periodically (every hour by default)

•

Split is distributed

Applied on cluster restart and ...
Read path
•
•

There is no single index with logical rows

•

Gets are the same as scans

Data is stored in several files a...
Read path

40
Block cache
•
•

Each region has LRU block cache
Based on priorities
1. Single access
2. Multi access priority
3. Catalog ...
Block cache
•
•
•
•

Size of cache:

•

number of region servers * heap size *
(hfile.block.cache.size) * 0.85

1 RS with 1...
Block cache
•

Always in cache:

•
•
•

Catalog tables
HFiles indexes
Keys (row keys, CF, TS)

43
Client API

44
CRUD
•
•
•
•
•

All operations do through HTable object
All operations per-row atomic
Row lock is available
Batch is avail...
Put
•
•
•
•

Put(byte[] row)
Put(byte[] row, RowLock rowLock)
Put(byte[] row, long ts)
Put(byte[] row, long ts, RowLock ro...
Put
•

Put add(byte[] family, byte[] qualifier, byte[]
value)

•

Put add(byte[] family, byte[] qualifier, long
ts, byte[] v...
Put
•
•

setWriteToWAL()
heapSize()

48
Write buffer
•
•
•
•

There is a client-side write buffer
Sorts data by Region Servers
2Mb by default
Controlled by:
table...
Write buffer

50
What else
•

CAS:
boolean checkAndPut(byte[] row, byte[] family,

•

byte[] qualifier, byte[] value, Put put) throws IOExc...
Get
•

Returns strictly one row
Result res = table.get(new Get(row));

•

Can ask if row exists
boolean e = table.exist(ne...
Row lock
•
•

Region server provide a row lock
Client can ask for explicit lock:
RowLock lockRow(byte[] row) throws IOExce...
Scan
•
•
•
•

Iterator over a number of rows
Can be defined start and end rows
Leased for amount of time
Batching and cachi...
Scan
final Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("colfam1"));
scan.setBatch(5);
scan.setCaching(5);
scan.se...
Scan
•
•
•

Caching is for rows
Batching is for columns
Result.next() will return next set of
batched cols (more than one ...
Scan

57
Counters
•
•
•
•

Any column can be treated as counters
Atomic increment without locking
Can be applied for multiple rows
...
HTable pooling
•
•
•

Creating an HTable instance is expensive
HTable is not thread safe
Should be created on start up and...
HTable pooling
•
•
•

Provided pool: HTablePool
Thread safe
Shares common resources

60
HTable pooling
•

HTablePool(Configuration config, int
maxSize)

•
•
•

HTableInterface getTable(String tableName)
void putT...
Key design

62
Key design
•
•
•

The only index is row key
Row keys are sorted
Key is uninterpreted array of bytes

63
Key design

64
Email example
Approach #1: user id is raw key, all messages
in one cell
<userId> : <cf> : <col> : <ts>:<messages>
12345 : ...
Email example
Approach #1: user id is raw key, all messages
in one cell
<userId> : <cf> : <colval> : <messages>

•
•
•

Al...
Email example
Approach #2: user id is raw key, message id in
column family
<userId> : <cf> : <msgId> :
<timestamp> : <emai...
Email example
Approach #2: user id is raw key, message id in
column family
<userId> : <cf> : <msgId> :
<timestamp> : <emai...
Email example
Approach #3: raw id is a <userId> <msgId>
<userId> - <msgId>: <cf> : <qualifier> :
<timestamp> : <email-messa...
Email example
Approach #3: raw id is a <userId> <msgId>
<userId> - <msgId>: <cf> : <qualifier> :
<timestamp> : <email-messa...
Email example
•

Go further:
<userId>-<date>-<messageId>-<attachmentId>

71
Email example
•

Go further:
<userId>-<date>-<messageId>-<attachmentId>
<userId>-(MAX_LONG - <date>)-<messageId>
-<attachm...
Performance

73
Performance
•
•
•

10 Gb data, 10 parallel clients, 1kb per entry
HDFs replication = 3
3 Region servers, 2CPU (24 cores), ...
Sources
•
•
•

HBase. The defenitive guide (O’Reilly)
HBase in action (Manning)
Official documentation
http://hbase.apache....
HBase
Леонид Налчаджи
leonid.nalchadzhi@gmail.com
Upcoming SlideShare
Loading in …5
×

Базы данных. HBase

1,223 views
1,242 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,223
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
5
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Базы данных. HBase

  1. 1. HBase Леонид Налчаджи leonid.nalchadzhi@gmail.com
  2. 2. HBase • • • • • Overview Table layout Architecture Client API Key design 2
  3. 3. Overview 3
  4. 4. Overview • • • NoSQL Column oriented Versioned 4
  5. 5. Overview • • • All rows ordered by row key All cells are uninterpreted arrays of bytes Auto-sharding 5
  6. 6. Overview • • • • Distributed HDFS based Highly fault tolerant CP/CAP 6
  7. 7. Overview • • • Coordinates: <RowKey, CF:CN,Version> Data is grouped by column family Null values for free 7
  8. 8. Overview • • No actual deletes (tombstones) TTL for cells 8
  9. 9. Disadvantages • • • • No secondary indexes from the box • Available through additional table and coprocessors No transactions • CAS support, row locks, counters Not good for multiple data centers • Master-slave replication No complex query processing 9
  10. 10. Table layout 10
  11. 11. Table layout 11
  12. 12. Table layout 12
  13. 13. Table layout 13
  14. 14. Architecture 14
  15. 15. Terms • • • • • Region Region server WAL Memstore StoreFile/HFile 15
  16. 16. Architecture 16
  17. 17. 17
  18. 18. Architecture 18
  19. 19. Architecture • One master (with backup) many slaves (region servers) • • Very tight integration with ZooKeeper Client calls ZK, RS, Master 19
  20. 20. Master • • • • Handles schema changes Cluster housekeeping operations Not SPOF Check RS status through ZK 20
  21. 21. Region server • • • • Stores data in regions Region is a container of data Does actual data manipulation HDFS DataNode 21
  22. 22. ZooKeeper • • • Distributed synchronization service FS-like tree structure contains keys Provides primitives to achieve a lot of goals 22
  23. 23. Storage • • • Handled by region server Memstore: sorted write buffer 2 types of file • • WAL log Data storage 23
  24. 24. Client connection • • • • Ask ZK to get address -ROOT- region From -ROOT- gets the RS with .META. From .META. gets the address of RS Cache all these data 24
  25. 25. Write path • • • Write to HLog (one per RS) Write to Memstore (one per CF) If Memstore is full, flush is requested • This request is processed async by another thread 25
  26. 26. On Memstore flush • • On flush Memstore is written to HDFS file • Memstore is cleared The last operation sequence number is stored 26
  27. 27. Region splits • • Region is closed • .META. is updated (parent and daughters) Creates directory structure for new regions (multithreaded) 27
  28. 28. Region splits • • • Master is notified for load balancing All steps are tracked in ZK Parent cleaned up eventually 28
  29. 29. Compaction • When Memstore is flushed new store file is created • Compaction is a housekeeping mechanism which merges store files • • Come in two varieties: minor and major The type is determined when the compaction check is executed 29
  30. 30. Minor compaction • • • • Fast Selects files to include Params: max size and min number of files From the oldest to the newest 30
  31. 31. Minor compaction 31
  32. 32. Major compaction • • Compact all files into a single file • Starts periodically (each 24 hours by default) • Checks tombstone markers and removes data Might be promoted from the minor compaction 32
  33. 33. HFile format 33
  34. 34. HFile format • • • • • Immutable (trailer is written once) Trailer has pointers to other blocks Indexes store offsets to data, metadata Block size by default 64K Block contains magic and number of key values 34
  35. 35. KeyValue format 35
  36. 36. Write-Ahead log 36
  37. 37. Write-Ahead log • • • One per region server Contains data from several regions Stores data sequentially 37
  38. 38. Write-Ahead log • • Rolled periodically (every hour by default) • Split is distributed Applied on cluster restart and on server failure 38
  39. 39. Read path • • There is no single index with logical rows • Gets are the same as scans Data is stored in several files and Memstore per column family 39
  40. 40. Read path 40
  41. 41. Block cache • • Each region has LRU block cache Based on priorities 1. Single access 2. Multi access priority 3. Catalog CF: ROOT and META 41
  42. 42. Block cache • • • • Size of cache: • number of region servers * heap size * (hfile.block.cache.size) * 0.85 1 RS with 1 Gb RAM = 217 Mb 20 RS with 8 Gb RAM = 34 Gb 100 RS with 24 Gb RAM with 0.5 = 1 Tb 42
  43. 43. Block cache • Always in cache: • • • Catalog tables HFiles indexes Keys (row keys, CF, TS) 43
  44. 44. Client API 44
  45. 45. CRUD • • • • • All operations do through HTable object All operations per-row atomic Row lock is available Batch is available Looks like this: HTable table = new HTable("table"); table.put(new Put(...)); 45
  46. 46. Put • • • • Put(byte[] row) Put(byte[] row, RowLock rowLock) Put(byte[] row, long ts) Put(byte[] row, long ts, RowLock rowLock) 46
  47. 47. Put • Put add(byte[] family, byte[] qualifier, byte[] value) • Put add(byte[] family, byte[] qualifier, long ts, byte[] value) • Put add(KeyValue kv) 47
  48. 48. Put • • setWriteToWAL() heapSize() 48
  49. 49. Write buffer • • • • There is a client-side write buffer Sorts data by Region Servers 2Mb by default Controlled by: table.setAutoFlush(false); table.flushCommits(); 49
  50. 50. Write buffer 50
  51. 51. What else • CAS: boolean checkAndPut(byte[] row, byte[] family, • byte[] qualifier, byte[] value, Put put) throws IOException Batch: void put(List<Put> puts) throws IOException 51
  52. 52. Get • Returns strictly one row Result res = table.get(new Get(row)); • Can ask if row exists boolean e = table.exist(new Get(row)); • Filter can be applied 52
  53. 53. Row lock • • Region server provide a row lock Client can ask for explicit lock: RowLock lockRow(byte[] row) throws IOException; void unlockRow(RowLock rl) throws IOException; • Do not use if you do not have to 53
  54. 54. Scan • • • • Iterator over a number of rows Can be defined start and end rows Leased for amount of time Batching and caching is available 54
  55. 55. Scan final Scan scan = new Scan(); scan.addFamily(Bytes.toBytes("colfam1")); scan.setBatch(5); scan.setCaching(5); scan.setMaxVersions(1); scan.setTimeStamp(0); scan.setStartRow(Bytes.toBytes(0)); scan.setStartRow(Bytes.toBytes(1000)); ResultScanner scanner = table.getScanner(scan); for (Result res : scanner) { ... } scanner.close(); 55
  56. 56. Scan • • • Caching is for rows Batching is for columns Result.next() will return next set of batched cols (more than one for row) 56
  57. 57. Scan 57
  58. 58. Counters • • • • Any column can be treated as counters Atomic increment without locking Can be applied for multiple rows Looks like this: long cnt = table.incrementColumnValue(Bytes.toBytes("20131028"), Bytes.toBytes("daily"), Bytes.toBytes("hits"), 1); 58
  59. 59. HTable pooling • • • Creating an HTable instance is expensive HTable is not thread safe Should be created on start up and closed on application exit 59
  60. 60. HTable pooling • • • Provided pool: HTablePool Thread safe Shares common resources 60
  61. 61. HTable pooling • HTablePool(Configuration config, int maxSize) • • • HTableInterface getTable(String tableName) void putTable(HTableInterface table) void closeTablePool(String tableName) 61
  62. 62. Key design 62
  63. 63. Key design • • • The only index is row key Row keys are sorted Key is uninterpreted array of bytes 63
  64. 64. Key design 64
  65. 65. Email example Approach #1: user id is raw key, all messages in one cell <userId> : <cf> : <col> : <ts>:<messages> 12345 : data : msgs : 1307097848 : ${msg1}, ${msg2}... 65
  66. 66. Email example Approach #1: user id is raw key, all messages in one cell <userId> : <cf> : <colval> : <messages> • • • All in one request Hard to find anything Users with a lot of messages 66
  67. 67. Email example Approach #2: user id is raw key, message id in column family <userId> : <cf> : <msgId> : <timestamp> : <email-message> 12345 : data : 5fc38314-e290-ae5da5fc375d : 1307097848 : "Hi, ..." 12345 : data : 725aae5f-d72e-f90f3f070419 : 1307099848 : "Welcome, and ..." 12345 : data : cc6775b3-f249-c6dd2b1a7467 : 1307101848 : "To Whom It ..." 12345 : data : dcbee495-6d5e-6ed48124632c : 1307103848 : "Hi, how are ..." 67
  68. 68. Email example Approach #2: user id is raw key, message id in column family <userId> : <cf> : <msgId> : <timestamp> : <email-message> Can load data on demand • • • Find anything only with filter Users with a lot of messages 68
  69. 69. Email example Approach #3: raw id is a <userId> <msgId> <userId> - <msgId>: <cf> : <qualifier> : <timestamp> : <email-message> 12345-5fc38314-e290-ae5da5fc375d : data : : 1307097848 : "Hi, ..." 12345-725aae5f-d72e-f90f3f070419 : data : : 1307099848 : "Welcome, and ..." 12345-cc6775b3-f249-c6dd2b1a7467 : data : : 1307101848 : "To Whom It ..." 12345-dcbee495-6d5e-6ed48124632c : data : : 1307103848 : "Hi, how are ..." 69
  70. 70. Email example Approach #3: raw id is a <userId> <msgId> <userId> - <msgId>: <cf> : <qualifier> : <timestamp> : <email-message> Can load data on demand • • • Find message with row key Users with a lot of messages 70
  71. 71. Email example • Go further: <userId>-<date>-<messageId>-<attachmentId> 71
  72. 72. Email example • Go further: <userId>-<date>-<messageId>-<attachmentId> <userId>-(MAX_LONG - <date>)-<messageId> -<attachmentId> 72
  73. 73. Performance 73
  74. 74. Performance • • • 10 Gb data, 10 parallel clients, 1kb per entry HDFs replication = 3 3 Region servers, 2CPU (24 cores), 48Gb RAM, 10GB for App nobuf, puts/s buf:1000, noWAL MB/s buf:100, noWAL noWAL 12MB/s 53MB/s 11k rows/s 50k rows/s buf:1000, nobuf, WAL buf:100, WAL 48MB/s 5MB/s 11MB/s 31MB/s 45k rows/s 4.7k rows/s 10k rows/s 30k rows/s 74 WAL
  75. 75. Sources • • • HBase. The defenitive guide (O’Reilly) HBase in action (Manning) Official documentation http://hbase.apache.org/0.94/book.html • Cloudera articles http://blog.cloudera.com/blog/category/hbase/ 75
  76. 76. HBase Леонид Налчаджи leonid.nalchadzhi@gmail.com

×