The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the project’s internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just “what” it does from the outside, but “how” it works internally, and “why” it does things a certain way. We’ll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious “HBase Users” into HBase Developer Users”, and give voice to some of the deep knowledge locked in the committers’ heads.
6. In the end it comes down to a
sorting problem.
How do you sort 1PB with 16GB
of memory?
756329184
756 329 184
567 239 148
123456789
7. Overview
• All changes collected/sorted in memory
• Periodically flushed to HDFS
– Writes to disk are batched
• HFiles periodically compacted into larger files
• Scanning/Compacting: Merge Sort
13. Updates
• All changes are written to the WAL first
• All changes are written to memory (the "MemStore")
• MemStores are flushed to disk (creating a new HFile)
• HFiles are periodically and asynchronously compacted into
fewer files.
• HFiles are immutable
14. ROW Atomicity
• Snapshot isolation and locking (per row)
o Row is locked while memory is changed
o Each row-update "creates" a new snapshot
o Each row-read sees exactly one snapshot
• Done with MultiVersionConcurrencyControl (MVCC)
15. Locking
• All KVs for a “row” are co-located in a region
• Locks are per row
• Stored in-memory at region server
16. MultiVersionConcurrencyControl
Wikipedia: "implement updates not by deleting an old piece of data
and overwriting it with a new one, but instead by marking the old
data as obsolete and adding the newer version"
Note that HBase calls it
MultiVersionConsistencyControl
17. MVCC writing
• Each write gets a monotonically increasing
"writenumber“
• Each KV written is tagged with this writenumber
(called memstoreTS in HBase)
• "Committing" a writenumber:
– wait for all prior writes to finish
– set the current readpoint to the writenumber
18. MVCC reading
• Each read reads as of a "readpoint“
– Filters KVs with a newer memstoreTS
• This is per regionserver (cheap in memory data
structures)
– if regionserver dies, current read is lost anyway
• but... "writenumbers" are persisted to disk
– for scanners that outlive a Memstore flush
19.
20. MVCC
• Reader do not lock
• Transaction are committed strictly serially
• HBase has no client demarcated transactions
– A transaction does not outlive an RPC
21. Anatomy of an "update"
– Acquire a MVCC "writenumber"
– Lock the "row" (the row-key)
– Tag all KVs with the writenumber
– Write change to the WAL and sync to file system
– Apply update in memory ("Memstore")
– Unlock "row"
– Roll MVCC forward -> now change is visible
– If Memstore reaches configurable size, take a snapshot, flush it
to disk (asynchronously)
22. Anatomy of an "update"
Puts are optimized
– Acquire a MVCC "writenumber"
– Lock the "row" (the row-key)
– Tag all KVs with the writenumber
– Write change to the WAL and sync to file system
– Apply update in memory ("Memstore")
– Unlock "row"
o 5.5 sync WAL to HDFS without the row lock held
If that fails undo changes in Memstore
that works because changes are not visible, yet
– Roll MVCC forward -> now change is visible
– If Memstore reaches configurable size, take a snapshot, flush it
to disk (asynchronously)
23. Anatomy of an "update"
Puts are optimized, and batched
– Acquire a MVCC "writenumber"
– lock as many "rows" as possible
– Tag all KVs with the writenumber
– write all changes to the WAL and sync to file system
– apply all updates in memory ("Memstore")
– unlock all "rows"
o 5.5 sync WAL to HDFS without the row locks held
If that fails undo changes in Memstore
that works because changes are not visible, yet
– roll MVCC forward -> now changes are visible
– if Memstore reaches configurable size, take a snapshot, flush it
to disk (asynchronously)
24. Undo
• In-memory only
• Changes are not visible until MVCC is rolled
• Changes are tagged with the writenumber
• Undo removes KVs tagged with the writenumber
25. Deletes
• Nothing is deleted in place.
• a Delete sets a "tombstone marker" with a timestamp
• upon compaction deleted KeyValues are removed
• upon major compactions the tombstones are removed
• Three different tombstone marker scopes (all within a row)
o version - mark a specific version of a column as deleted
o column - mark all versions of a column as deleted
o column family - mark all versions of all columns of a column
family as deleted
26. Deletes, cont
• Delete markers always sort before KVs
• A scanner remembers markers it encounters
• Each KV is checked against remembered
markers
• Only one-pass is required for scanning
• Deletes are just KVs stored in HFiles
35. RegionScanner
Get all StoreScanners
...
for (Map.Entry<byte[], NavigableSet<byte[]>> entry :
scan.getFamilyMap().entrySet()) {
Store store = stores.get(entry.getKey());
StoreScanner scanner = store.getScanner(...);
scanners.add(scanner);
}
this.storeHeap = new KeyValueHeap(scanners, comparator);
}
Heap of StoreScanners
36. RegionScanner, cont.
public synchronized boolean next(
List<KeyValue> outResults, int limit) {
... MVCC
MVCC.setThreadReadPoint(this.readPt);
boolean returnResult = nextInternal(limit);
...
}
private boolean nextInternal(int limit) throws IOException {
if (isStopRow(currentRow)) {
...
return false;
} else { Get next KV from StoreScanners
byte [] nextRow;
do {
this.storeHeap.next(results, limit - results.size())
} while (Bytes.equals(currentRow, nextRow =peekRow()));
final boolean stopRow = isStopRow(nextRow);
}
...
}
38. Seek Hints
• Allow skipping many KVs without “touching”
• Seek-to(KV) instead of skip, skip, skip, …
• Used internally (for deletes, skipping older
versions, TTL)
• Used by filters
39. Memstore Scanner
public synchronized KeyValue next() {
if (theNext == null) {
return null;
}
KeyValueSkipListSet.iterator()
final KeyValue ret = theNext;
// Advance one of the iterators Snapshot during
if (theNext == kvsetNextRow) { flushes
kvsetNextRow = getNext(kvsetIt);
} else {
snapshotNextRow = getNext(snapshotIt);
}
// Calculate the next value
theNext = getLowest(kvsetNextRow, snapshotNextRow);
return ret;
}
45. Batch Put
• private long doMiniBatchPut(BatchOperationInProgress<…> batchOp){
WALEdit walEdit = new WALEdit();
...
MultiVersionConsistencyControl.WriteEntry w = null;
...
try {
// STEP 1. Try to acquire as many locks as we can
// STEP 2. Update any LATEST_TIMESTAMP timestamps
Begin transaction
// Acquire the latest mvcc number
w = mvcc.beginMemstoreInsert();
// STEP 3. Write back to memstore
for (int i = firstIndex; i < lastIndexExclusive; i++) {
addedSize += applyFamilyMapToMemstore(familyMaps[i], w);
}
// STEP 4. Build WAL edit
for (int i = firstIndex; i < lastIndexExclusive; i++) {
addFamilyMapToWALEdit(familyMaps[i], walEdit);
}
49. Memstore changes
Tag KV with write number (MVCC)
...
Store store = getStore(family);
for (KeyValue kv: edits) {
kv.setMemstoreTS(localizedWriteEntry.getWriteNumber());
size += store.add(kv);
}
}
} finally {
if (freemvcc) {
mvcc.completeMemstoreInsert(localizedWriteEntry);
}
}
return size;
} This makes the changes visible
52. ScanDeleteTracker (checking markers)
public DeleteResult isDeleted(buffer, qualifierOffset,qualifierLength,
timestamp) {
if (hasFamilyStamp && timestamp <= familyStamp) {
return DeleteResult.FAMILY_DELETED; Family marker case
}
if (deleteBuffer != null) {
int ret = Bytes.compareTo(deleteBuffer, deleteOffset, deleteLength,
buffer, qualifierOffset, qualifierLength);
if (ret == 0) {
if (deleteType == KeyValue.Type.DeleteColumn.getCode()) {
return DeleteResult.COLUMN_DELETED;
}
Column marker case
// If the timestamp is the same, keep this one
if (timestamp == deleteTimestamp) {
return DeleteResult.VERSION_DELETED;
} Version marker case
// different timestamp, let's clear the buffer.
deleteBuffer = null;
} else if(ret < 0){ HFiles scanned newest
// Next column case. TS first
deleteBuffer = null;
} else {
throw new IllegalStateException(...);
}
}
return DeleteResult.NOT_DELETED;
}
53. Questions?
Comments?
More details on http://hadoop-hbase.blogspot.com