Successfully reported this slideshow.

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce

37

Share

Loading in …3
×
1 of 53
1 of 53

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce

37

Share

Download to read offline

The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the project’s internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just “what” it does from the outside, but “how” it works internally, and “why” it does things a certain way. We’ll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious “HBase Users” into HBase Developer Users”, and give voice to some of the deep knowledge locked in the committers’ heads.

The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the project’s internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just “what” it does from the outside, but “how” it works internally, and “why” it does things a certain way. We’ll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious “HBase Users” into HBase Developer Users”, and give voice to some of the deep knowledge locked in the committers’ heads.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce

  1. 1. HBase Internals Lars Hofhansl Architect @ Salesforce.com HBase Committer
  2. 2. HBase Internals A.K.A. Get ready to:
  3. 3. Agenda • Overview • Scanning • Atomicity/MVCC • Updates – Put/Delete • Code!
  4. 4. A sparse, consistent, distributed, multi-dimensional, persistent, sorted map.
  5. 5. In the end it comes down to a sorting problem. How do you sort 1PB with 16GB of memory? 756329184 756 329 184 567 239 148 123456789
  6. 6. Overview • All changes collected/sorted in memory • Periodically flushed to HDFS – Writes to disk are batched • HFiles periodically compacted into larger files • Scanning/Compacting: Merge Sort
  7. 7. HDFS Directory Hierarchy /hbase   /-ROOT-   /.META.   /.logs   /<table1>     /<region>       [/.recovered_logs]       /<column family>         /<HFile> Storage is per CF         /...         /...        /<column family>         /...   /<table2>      ...   
  8. 8. Scanning • "Log Structured Mergetrees" • Multiway mergesort • Efficient scanning/seeking /Region  |  |  +--/CF   |   |   +--/HFile
  9. 9. KeyValueHeap • Maintains PriorityQueue of “lower” Scanners • TopScanner in the Queue has the next KV
  10. 10. Scanning • "Log Structured Mergetrees" • Multiway mergesort • Efficient scanning/seeking /Region  |  |  +--/CF   |   |   +--/HFile
  11. 11. Updates • All changes are written to the WAL first • All changes are written to memory (the "MemStore") • MemStores are flushed to disk (creating a new HFile) • HFiles are periodically and asynchronously compacted into fewer files. • HFiles are immutable
  12. 12. ROW Atomicity • Snapshot isolation and locking (per row) o Row is locked while memory is changed o Each row-update "creates" a new snapshot o Each row-read sees exactly one snapshot • Done with MultiVersionConcurrencyControl (MVCC)
  13. 13. Locking • All KVs for a “row” are co-located in a region • Locks are per row • Stored in-memory at region server
  14. 14. MultiVersionConcurrencyControl Wikipedia: "implement updates not by deleting an old piece of data and overwriting it with a new one, but instead by marking the old data as obsolete and adding the newer version" Note that HBase calls it MultiVersionConsistencyControl
  15. 15. MVCC writing • Each write gets a monotonically increasing "writenumber“ • Each KV written is tagged with this writenumber (called memstoreTS in HBase) • "Committing" a writenumber: – wait for all prior writes to finish – set the current readpoint to the writenumber
  16. 16. MVCC reading • Each read reads as of a "readpoint“ – Filters KVs with a newer memstoreTS • This is per regionserver (cheap in memory data structures) – if regionserver dies, current read is lost anyway • but... "writenumbers" are persisted to disk – for scanners that outlive a Memstore flush
  17. 17. MVCC • Reader do not lock • Transaction are committed strictly serially • HBase has no client demarcated transactions – A transaction does not outlive an RPC
  18. 18. Anatomy of an "update" – Acquire a MVCC "writenumber" – Lock the "row" (the row-key) – Tag all KVs with the writenumber – Write change to the WAL and sync to file system – Apply update in memory ("Memstore") – Unlock "row" – Roll MVCC forward -> now change is visible – If Memstore reaches configurable size, take a snapshot, flush it to disk (asynchronously)
  19. 19. Anatomy of an "update" Puts are optimized – Acquire a MVCC "writenumber" – Lock the "row" (the row-key) – Tag all KVs with the writenumber – Write change to the WAL and sync to file system – Apply update in memory ("Memstore") – Unlock "row" o 5.5 sync WAL to HDFS without the row lock held  If that fails undo changes in Memstore  that works because changes are not visible, yet – Roll MVCC forward -> now change is visible – If Memstore reaches configurable size, take a snapshot, flush it to disk (asynchronously)
  20. 20. Anatomy of an "update" Puts are optimized, and batched – Acquire a MVCC "writenumber" – lock as many "rows" as possible – Tag all KVs with the writenumber – write all changes to the WAL and sync to file system – apply all updates in memory ("Memstore") – unlock all "rows" o 5.5 sync WAL to HDFS without the row locks held  If that fails undo changes in Memstore  that works because changes are not visible, yet – roll MVCC forward -> now changes are visible – if Memstore reaches configurable size, take a snapshot, flush it to disk (asynchronously)
  21. 21. Undo • In-memory only • Changes are not visible until MVCC is rolled • Changes are tagged with the writenumber • Undo removes KVs tagged with the writenumber
  22. 22. Deletes • Nothing is deleted in place. • a Delete sets a "tombstone marker" with a timestamp • upon compaction deleted KeyValues are removed • upon major compactions the tombstones are removed • Three different tombstone marker scopes (all within a row) o version - mark a specific version of a column as deleted o column - mark all versions of a column as deleted o column family - mark all versions of all columns of a column family as deleted
  23. 23. Deletes, cont • Delete markers always sort before KVs • A scanner remembers markers it encounters • Each KV is checked against remembered markers • Only one-pass is required for scanning • Deletes are just KVs stored in HFiles
  24. 24. Let’s look at some code!
  25. 25. MVCC
  26. 26. MultiVersionConsistencyControl.java   public WriteEntry beginMemstoreInsert() {     synchronized (writeQueue) {       long nextWriteNumber = ++memstoreWrite;       WriteEntry e = new WriteEntry(nextWriteNumber);       writeQueue.add(e);       return e;     } Acquire a new Writenumber   }    public void completeMemstoreInsert(WriteEntry e) {     advanceMemstore(e);     waitForRead(e); Roll forward the readpoint   }   Wait for prior transactions to finish
  27. 27. MultiVersionConsistencyControl.java   boolean advanceMemstore(WriteEntry e) {     synchronized (writeQueue) {       e.markCompleted();       long nextReadValue = -1;       while (!writeQueue.isEmpty()) {         WriteEntry queueFirst = writeQueue.getFirst();         ...         if (queueFirst.isCompleted()) {           nextReadValue = queueFirst.getWriteNumber();           writeQueue.removeFirst();         } else { Roll forward completed           break; transactions. Ordering is         } preserved.       }       ...
  28. 28. MultiVersionConsistencyControl.java         ...       if (nextReadValue > 0) {         synchronized (readWaiters) {           memstoreRead = nextReadValue;           readWaiters.notifyAll();         } Notify later transactions       }       if (memstoreRead >= e.getWriteNumber()) {         return true;       }       return false;     }   }
  29. 29. MultiVersionConsistencyControl.java   public void waitForRead(WriteEntry e) {     boolean interrupted = false;     synchronized (readWaiters) {       while (memstoreRead < e.getWriteNumber()) {         try {           readWaiters.wait(0);         } catch (InterruptedException ie) {           interrupted = true;         } Wait until write entry was       } applied.     }     if (interrupted) Thread.currentThread().interrupt();   }
  30. 30. Scanning
  31. 31. RegionScanner, creation        RegionScannerImpl(Scan scan,       List<KeyValueScanner> additionalScanners) {       ...       IsolationLevel isolationLevel = scan.getIsolationLevel();       synchronized(scannerReadPoints) {         if (isolationLevel == IsolationLevel.READ_UNCOMMITTED) {           // This scan can read even uncommitted transactions           this.readPt = Long.MAX_VALUE;           MVCC.setThreadReadPoint(this.readPt);         } else {           this.readPt = MVCC.resetThreadReadPoint(mvcc);         }         scannerReadPoints.put(this, this.readPt);       } MVCC protocol       ...
  32. 32. RegionScanner Get all StoreScanners       ...       for (Map.Entry<byte[], NavigableSet<byte[]>> entry :           scan.getFamilyMap().entrySet()) {         Store store = stores.get(entry.getKey());         StoreScanner scanner = store.getScanner(...);         scanners.add(scanner);       }       this.storeHeap = new KeyValueHeap(scanners, comparator);     } Heap of StoreScanners
  33. 33. RegionScanner, cont.     public synchronized boolean next(         List<KeyValue> outResults, int limit) {         ... MVCC         MVCC.setThreadReadPoint(this.readPt);         boolean returnResult = nextInternal(limit);         ...      }     private boolean nextInternal(int limit) throws IOException {         if (isStopRow(currentRow)) {           ...           return false;         } else { Get next KV from StoreScanners           byte [] nextRow;           do {             this.storeHeap.next(results, limit - results.size())           } while (Bytes.equals(currentRow, nextRow =peekRow()));           final boolean stopRow = isStopRow(nextRow);       }       ...     }
  34. 34. StoreScanner   public synchronized boolean next(…) {     ...     LOOP: while((kv = this.heap.peek()) != null) {       ScanQueryMatcher.MatchCode qcode = matcher.match(kv);       switch(qcode) {         case INCLUDE:           results.add(kv); Heap of Memstore/StoreFile scanners           this.heap.next();           ... What to do with the KV:           continue;         case DONE_SCAN: ... •Versions         case SEEK_NEXT_ROW: •TTL           reseek(matcher.getKeyForNextRow(kv));           break; •Deletes         case SEEK_NEXT_COL: ...         case SKIP:           this.heap.next();           break;         case SEEK_NEXT_USING_HINT:           KeyValue nextKV = matcher.getNextKeyHint(kv);           ...           reseek(nextKV);           break;         ...       }     }   }
  35. 35. Seek Hints • Allow skipping many KVs without “touching” • Seek-to(KV) instead of skip, skip, skip, … • Used internally (for deletes, skipping older versions, TTL) • Used by filters
  36. 36. Memstore Scanner     public synchronized KeyValue next() {       if (theNext == null) {           return null;       } KeyValueSkipListSet.iterator()       final KeyValue ret = theNext;       // Advance one of the iterators Snapshot during       if (theNext == kvsetNextRow) { flushes         kvsetNextRow = getNext(kvsetIt);       } else {         snapshotNextRow = getNext(snapshotIt);       }       // Calculate the next value       theNext = getLowest(kvsetNextRow, snapshotNextRow);       return ret;     }
  37. 37. Memstore Scanner     protected KeyValue getNext(Iterator<KeyValue> it) {       long readPoint = MVCC.getThreadReadPoint();       while (it.hasNext()) {         KeyValue v = it.next(); MVCC         if (v.getMemstoreTS() <= readPoint) {           return v;         }       }       return null;     }
  38. 38. Memstore Scanner     @Override     public synchronized boolean seek(KeyValue key) {       ...       // kvset and snapshot will never be null.       // if tailSet can't find anything, SortedSet is empty (not  null).       kvTail = kvsetAtCreation.tailSet(key);       snapshotTail = snapshotAtCreation.tailSet(key);       return seekInSubLists(key);     } For seek find the right tailSet seekInSubLists() almost identical to next()
  39. 39. StoreFileScanner   private final HFileScanner hfs;   private KeyValue cur = null;   ...    public KeyValue next() throws IOException {     KeyValue retKey = cur;     try {       // only seek if we aren't at the end. // cur == null implies 'end'.       if (cur != null) {         hfs.next();         cur = hfs.getKeyValue();         skipKVsNewerThanReadpoint();       }     } catch(IOException e) {       throw new IOException("Could not iterate " + this, e);     }     return retKey;   }
  40. 40. HFileScanner/Reader     @Override     public boolean next() throws IOException {       ...       blockBuffer.position(...);       ...       if (blockBuffer.remaining() <= 0) {         long lastDataBlockOffset =  Still on current block? reader.getTrailer().getLastDataBlockOffset();         // read the next block         HFileBlock nextBlock = readNextDataBlock();         if (nextBlock == null) {           return false;         }         updateCurrBlock(nextBlock); If not, read the next block         return true;       }       // We are still in the same block.       readKeyValueLen();       return true; Mark the next KV in the buffer     }
  41. 41. Puts
  42. 42. Batch Put •   private long doMiniBatchPut(BatchOperationInProgress<…> batchOp){     WALEdit walEdit = new WALEdit();     ...     MultiVersionConsistencyControl.WriteEntry w = null;     ...     try {       // STEP 1. Try to acquire as many locks as we can         // STEP 2. Update any LATEST_TIMESTAMP timestamps   Begin transaction         // Acquire the latest mvcc number       w = mvcc.beginMemstoreInsert();       // STEP 3. Write back to memstore       for (int i = firstIndex; i < lastIndexExclusive; i++) {         addedSize += applyFamilyMapToMemstore(familyMaps[i], w);       }       // STEP 4. Build WAL edit       for (int i = firstIndex; i < lastIndexExclusive; i++) {         addFamilyMapToWALEdit(familyMaps[i], walEdit);       }      
  43. 43. Batch Put, cont.       // STEP 5. Append the edit to WAL. Do not sync wal.       txid = this.log.appendNoSync(regionInfo, …, walEdit, …);       // STEP 6. Release row locks, etc.       this.updatesLock.readLock().unlock(); Write WAL record       for (Integer toRelease : acquiredLocks) { but don’t sync!          releaseRowLock(toRelease);       } Guard against concurrent flushes       // STEP 7. Sync wal.       this.log.sync(txid);       walSyncSuccessful = true; Sync after locks are released       // STEP 8. Advance mvcc.       // This will make this put visible to scanners and getters.       if (w != null) {         mvcc.completeMemstoreInsert(w);         w = null;       } Commit       ...       return addedSize;     }
  44. 44. Batch Put, something went wrong     } finally {       if (!walSyncSuccessful) {         rollbackMemstore(batchOp, familyMaps,                          firstIndex, lastIndexExclusive);       }       if (w != null) mvcc.completeMemstoreInsert(w);       if (locked) {         this.updatesLock.readLock().unlock();Always complete       } the transaction!       for (Integer toRelease : acquiredLocks) {         releaseRowLock(toRelease);       }      ...     }
  45. 45. Memstore changes   private long applyFamilyMapToMemstore(     Map<byte[], List<KeyValue>> familyMap,     WriteEntry localizedWriteEntry) {     long size = 0;     boolean freemvcc = false; Can pass a write entry that spans mutliple calls     try {       if (localizedWriteEntry == null) {         localizedWriteEntry = mvcc.beginMemstoreInsert();         freemvcc = true;       }       for (Map.Entry<…> e : familyMap.entrySet()) { This begins the         byte[] family = e.getKey(); transaction         List<KeyValue> edits = e.getValue(); (MVCC)         ...
  46. 46. Memstore changes Tag KV with write number (MVCC)         ...         Store store = getStore(family);         for (KeyValue kv: edits) {           kv.setMemstoreTS(localizedWriteEntry.getWriteNumber());           size += store.add(kv);         }       }     } finally {       if (freemvcc) {         mvcc.completeMemstoreInsert(localizedWriteEntry);       }     }      return size;    } This makes the changes visible
  47. 47. Deletes
  48. 48. ScanDeleteTracker (checking markers)   public void add(buffer, qualifierOffset, qualifierLength, ts, type) {     if (!hasFamilyStamp || ts > familyStamp) {       if (type == KeyValue.Type.DeleteFamily.getCode()) {         hasFamilyStamp = true;         familyStamp = ts; Only remember TS for family deletes         return;       }       if (deleteBuffer != null && type < deleteType) {         // same column, so ignore less specific delete         if (Bytes.equals(deleteBuffer, deleteOffset, deleteLength,             buffer, qualifierOffset, qualifierLength)){           return;         } A version delete       } marker can be       // new column, or more general delete type       deleteBuffer = buffer; ignored if there is a       deleteOffset = qualifierOffset; column marker       deleteLength = qualifierLength;       deleteType = type; already.       deleteTimestamp = ts;     } Remember the KV        }
  49. 49. ScanDeleteTracker (checking markers)   public DeleteResult isDeleted(buffer, qualifierOffset,qualifierLength,  timestamp) {     if (hasFamilyStamp && timestamp <= familyStamp) {       return DeleteResult.FAMILY_DELETED; Family marker case     }     if (deleteBuffer != null) {       int ret = Bytes.compareTo(deleteBuffer, deleteOffset, deleteLength,           buffer, qualifierOffset, qualifierLength);       if (ret == 0) {         if (deleteType == KeyValue.Type.DeleteColumn.getCode()) {           return DeleteResult.COLUMN_DELETED;         } Column marker case         // If the timestamp is the same, keep this one         if (timestamp == deleteTimestamp) {           return DeleteResult.VERSION_DELETED;         } Version marker case         // different timestamp, let's clear the buffer.         deleteBuffer = null;       } else if(ret < 0){ HFiles scanned newest         // Next column case. TS first         deleteBuffer = null;       } else {         throw new IllegalStateException(...);       }     }     return DeleteResult.NOT_DELETED;   }
  50. 50. Questions? Comments? More details on http://hadoop-hbase.blogspot.com

×