HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce



The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement ...

The strength of an open source project resides entirely in its developer community; a strong democratic culture of participation and hacking makes for a better piece of software. The key requirement is having developers who are not only willing to contribute, but also knowledgeable about the project’s internal structure and architecture. This session will introduce developers to the core internal architectural concepts of HBase, not just “what” it does from the outside, but “how” it works internally, and “why” it does things a certain way. We’ll walk through key sections of code and discuss key concepts like the MVCC implementation and memstore organization. The goal is to convert serious “HBase Users” into HBase Developer Users”, and give voice to some of the deep knowledge locked in the committers’ heads.



Total Views
Views on SlideShare
Embed Views



6 Embeds 479

http://www.cloudera.com 444
http://marilson.pbworks.com 28
https://si0.twimg.com 4
http://blog.cloudera.com 1
https://twitter.com 1
https://blog.cloudera.com 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce Presentation Transcript

  • HBase Internals Lars HofhanslArchitect @ Salesforce.com HBase Committer
  • HBase Internals A.K.A. Get ready to:
  • Agenda• Overview• Scanning• Atomicity/MVCC• Updates– Put/Delete• Code!
  • A sparse, consistent,distributed, multi-dimensional, persistent, sorted map.
  • In the end it comes down to asorting problem.How do you sort 1PB with 16GBof memory? 756329184 756 329 184 567 239 148 123456789
  • Overview• All changes collected/sorted in memory• Periodically flushed to HDFS – Writes to disk are batched• HFiles periodically compacted into larger files• Scanning/Compacting: Merge Sort
  • HDFS Directory Hierarchy/hbase  /-ROOT-  /.META.  /.logs  /<table1>    /<region>      [/.recovered_logs]      /<column family>        /<HFile> Storage is per CF        /...        /...       /<column family>        /...  /<table2>     ...  
  • Scanning• "Log Structured Mergetrees"• Multiway mergesort• Efficient scanning/seeking/Region | | +--/CF  |  |  +--/HFile
  • KeyValueHeap• Maintains PriorityQueue of “lower” Scanners• TopScanner in the Queue has the next KV
  • Scanning• "Log Structured Mergetrees"• Multiway mergesort• Efficient scanning/seeking/Region | | +--/CF  |  |  +--/HFile
  • Updates• All changes are written to the WAL first• All changes are written to memory (the "MemStore")• MemStores are flushed to disk (creating a new HFile)• HFiles are periodically and asynchronously compacted into fewer files.• HFiles are immutable
  • ROW Atomicity• Snapshot isolation and locking (per row) o Row is locked while memory is changed o Each row-update "creates" a new snapshot o Each row-read sees exactly one snapshot• Done with MultiVersionConcurrencyControl (MVCC)
  • Locking• All KVs for a “row” are co-located in a region• Locks are per row• Stored in-memory at region server
  • MultiVersionConcurrencyControlWikipedia: "implement updates not by deleting an old piece of data and overwriting it with a new one, but instead by marking the old data as obsolete and adding the newer version" Note that HBase calls it MultiVersionConsistencyControl
  • MVCC writing• Each write gets a monotonically increasing "writenumber“• Each KV written is tagged with this writenumber (called memstoreTS in HBase)• "Committing" a writenumber: – wait for all prior writes to finish – set the current readpoint to the writenumber
  • MVCC reading• Each read reads as of a "readpoint“ – Filters KVs with a newer memstoreTS• This is per regionserver (cheap in memory data structures) – if regionserver dies, current read is lost anyway• but... "writenumbers" are persisted to disk – for scanners that outlive a Memstore flush
  • MVCC• Reader do not lock• Transaction are committed strictly serially• HBase has no client demarcated transactions – A transaction does not outlive an RPC
  • Anatomy of an "update"– Acquire a MVCC "writenumber"– Lock the "row" (the row-key)– Tag all KVs with the writenumber– Write change to the WAL and sync to file system– Apply update in memory ("Memstore")– Unlock "row"– Roll MVCC forward -> now change is visible– If Memstore reaches configurable size, take a snapshot, flush it to disk (asynchronously)
  • Anatomy of an "update" Puts are optimized– Acquire a MVCC "writenumber"– Lock the "row" (the row-key)– Tag all KVs with the writenumber– Write change to the WAL and sync to file system– Apply update in memory ("Memstore")– Unlock "row" o 5.5 sync WAL to HDFS without the row lock held  If that fails undo changes in Memstore  that works because changes are not visible, yet– Roll MVCC forward -> now change is visible– If Memstore reaches configurable size, take a snapshot, flush it to disk (asynchronously)
  • Anatomy of an "update" Puts are optimized, and batched– Acquire a MVCC "writenumber"– lock as many "rows" as possible– Tag all KVs with the writenumber– write all changes to the WAL and sync to file system– apply all updates in memory ("Memstore")– unlock all "rows" o 5.5 sync WAL to HDFS without the row locks held  If that fails undo changes in Memstore  that works because changes are not visible, yet– roll MVCC forward -> now changes are visible– if Memstore reaches configurable size, take a snapshot, flush it to disk (asynchronously)
  • Undo• In-memory only• Changes are not visible until MVCC is rolled• Changes are tagged with the writenumber• Undo removes KVs tagged with the writenumber
  • Deletes• Nothing is deleted in place.• a Delete sets a "tombstone marker" with a timestamp• upon compaction deleted KeyValues are removed• upon major compactions the tombstones are removed• Three different tombstone marker scopes (all within a row) o version - mark a specific version of a column as deleted o column - mark all versions of a column as deleted o column family - mark all versions of all columns of a column family as deleted
  • Deletes, cont• Delete markers always sort before KVs• A scanner remembers markers it encounters• Each KV is checked against remembered markers• Only one-pass is required for scanning• Deletes are just KVs stored in HFiles
  • Let’s look at some code!
  • MVCC
  • MultiVersionConsistencyControl.java  public WriteEntry beginMemstoreInsert() {    synchronized (writeQueue) {      long nextWriteNumber = ++memstoreWrite;      WriteEntry e = new WriteEntry(nextWriteNumber);      writeQueue.add(e);      return e;    } Acquire a new Writenumber  }  public void completeMemstoreInsert(WriteEntry e) {    advanceMemstore(e);    waitForRead(e); Roll forward the readpoint  }   Wait for prior transactions to finish
  • MultiVersionConsistencyControl.java  boolean advanceMemstore(WriteEntry e) {    synchronized (writeQueue) {      e.markCompleted();      long nextReadValue = -1;      while (!writeQueue.isEmpty()) {        WriteEntry queueFirst = writeQueue.getFirst();        ...        if (queueFirst.isCompleted()) {          nextReadValue = queueFirst.getWriteNumber();          writeQueue.removeFirst();        } else { Roll forward completed          break; transactions. Ordering is        } preserved.      }      ...
  • MultiVersionConsistencyControl.java        ...      if (nextReadValue > 0) {        synchronized (readWaiters) {          memstoreRead = nextReadValue;          readWaiters.notifyAll();        } Notify later transactions      }      if (memstoreRead >= e.getWriteNumber()) {        return true;      }      return false;    }  }
  • MultiVersionConsistencyControl.java  public void waitForRead(WriteEntry e) {    boolean interrupted = false;    synchronized (readWaiters) {      while (memstoreRead < e.getWriteNumber()) {        try {          readWaiters.wait(0);        } catch (InterruptedException ie) {          interrupted = true;        } Wait until write entry was      } applied.    }    if (interrupted) Thread.currentThread().interrupt();  }
  • Scanning
  • RegionScanner, creation      RegionScannerImpl(Scan scan,      List<KeyValueScanner> additionalScanners) {      ...      IsolationLevel isolationLevel = scan.getIsolationLevel();      synchronized(scannerReadPoints) {        if (isolationLevel == IsolationLevel.READ_UNCOMMITTED) {          // This scan can read even uncommitted transactions          this.readPt = Long.MAX_VALUE;          MVCC.setThreadReadPoint(this.readPt);        } else {          this.readPt = MVCC.resetThreadReadPoint(mvcc);        }        scannerReadPoints.put(this, this.readPt);      } MVCC protocol      ...
  • RegionScanner Get all StoreScanners      ...      for (Map.Entry<byte[], NavigableSet<byte[]>> entry :          scan.getFamilyMap().entrySet()) {        Store store = stores.get(entry.getKey());        StoreScanner scanner = store.getScanner(...);        scanners.add(scanner);      }      this.storeHeap = new KeyValueHeap(scanners, comparator);    } Heap of StoreScanners
  • RegionScanner, cont.    public synchronized boolean next(        List<KeyValue> outResults, int limit) {        ... MVCC        MVCC.setThreadReadPoint(this.readPt);        boolean returnResult = nextInternal(limit);        ...     }    private boolean nextInternal(int limit) throws IOException {        if (isStopRow(currentRow)) {          ...          return false;        } else { Get next KV from StoreScanners          byte [] nextRow;          do {            this.storeHeap.next(results, limit - results.size())          } while (Bytes.equals(currentRow, nextRow =peekRow()));          final boolean stopRow = isStopRow(nextRow);      }      ...    }
  • StoreScanner  public synchronized boolean next(…) {    ...    LOOP: while((kv = this.heap.peek()) != null) {      ScanQueryMatcher.MatchCode qcode = matcher.match(kv);      switch(qcode) {        case INCLUDE:          results.add(kv); Heap of Memstore/StoreFile scanners          this.heap.next();          ... What to do with the KV:          continue;        case DONE_SCAN: ... •Versions        case SEEK_NEXT_ROW: •TTL          reseek(matcher.getKeyForNextRow(kv));          break; •Deletes        case SEEK_NEXT_COL: ...        case SKIP:          this.heap.next();          break;        case SEEK_NEXT_USING_HINT:          KeyValue nextKV = matcher.getNextKeyHint(kv);          ...          reseek(nextKV);          break;        ...      }    }  }
  • Seek Hints• Allow skipping many KVs without “touching”• Seek-to(KV) instead of skip, skip, skip, …• Used internally (for deletes, skipping older versions, TTL)• Used by filters
  • Memstore Scanner    public synchronized KeyValue next() {      if (theNext == null) {          return null;      } KeyValueSkipListSet.iterator()      final KeyValue ret = theNext;      // Advance one of the iterators Snapshot during      if (theNext == kvsetNextRow) { flushes        kvsetNextRow = getNext(kvsetIt);      } else {        snapshotNextRow = getNext(snapshotIt);      }      // Calculate the next value      theNext = getLowest(kvsetNextRow, snapshotNextRow);      return ret;    }
  • Memstore Scanner   protected KeyValue getNext(Iterator<KeyValue> it) {      long readPoint = MVCC.getThreadReadPoint();      while (it.hasNext()) {        KeyValue v = it.next(); MVCC        if (v.getMemstoreTS() <= readPoint) {          return v;        }      }      return null;    }
  • Memstore Scanner    @Override    public synchronized boolean seek(KeyValue key) {      ...      // kvset and snapshot will never be null.      // if tailSet cant find anything, SortedSet is empty (not null).      kvTail = kvsetAtCreation.tailSet(key);      snapshotTail = snapshotAtCreation.tailSet(key);      return seekInSubLists(key);    } For seek find the right tailSetseekInSubLists() almost identical to next()
  • StoreFileScanner  private final HFileScanner hfs;  private KeyValue cur = null;  ...   public KeyValue next() throws IOException {    KeyValue retKey = cur;    try {      // only seek if we arent at the end. // cur == null implies end.      if (cur != null) {        hfs.next();        cur = hfs.getKeyValue();        skipKVsNewerThanReadpoint();      }    } catch(IOException e) {      throw new IOException("Could not iterate " + this, e);    }    return retKey;  }
  • HFileScanner/Reader    @Override    public boolean next() throws IOException {      ...      blockBuffer.position(...);      ...      if (blockBuffer.remaining() <= 0) {        long lastDataBlockOffset =  Still on current block?reader.getTrailer().getLastDataBlockOffset();        // read the next block        HFileBlock nextBlock = readNextDataBlock();        if (nextBlock == null) {          return false;        }        updateCurrBlock(nextBlock); If not, read the next block        return true;      }      // We are still in the same block.      readKeyValueLen();      return true; Mark the next KV in the buffer    }
  • Puts
  • Batch Put•   private long doMiniBatchPut(BatchOperationInProgress<…> batchOp){     WALEdit walEdit = new WALEdit();     ...     MultiVersionConsistencyControl.WriteEntry w = null;     ...     try {       // STEP 1. Try to acquire as many locks as we can        // STEP 2. Update any LATEST_TIMESTAMP timestamps  Begin transaction        // Acquire the latest mvcc number       w = mvcc.beginMemstoreInsert();       // STEP 3. Write back to memstore       for (int i = firstIndex; i < lastIndexExclusive; i++) {        addedSize += applyFamilyMapToMemstore(familyMaps[i], w);       }      // STEP 4. Build WAL edit       for (int i = firstIndex; i < lastIndexExclusive; i++) {         addFamilyMapToWALEdit(familyMaps[i], walEdit);       }     
  • Batch Put, cont.      // STEP 5. Append the edit to WAL. Do not sync wal.      txid = this.log.appendNoSync(regionInfo, …, walEdit, …);      // STEP 6. Release row locks, etc.      this.updatesLock.readLock().unlock(); Write WAL record      for (Integer toRelease : acquiredLocks) { but don’t sync!         releaseRowLock(toRelease);      } Guard against concurrent flushes      // STEP 7. Sync wal.      this.log.sync(txid);      walSyncSuccessful = true; Sync after locks are released      // STEP 8. Advance mvcc.      // This will make this put visible to scanners and getters.      if (w != null) {        mvcc.completeMemstoreInsert(w);        w = null;      } Commit      ...      return addedSize;    }
  • Batch Put, something went wrong    } finally {       if (!walSyncSuccessful) {         rollbackMemstore(batchOp, familyMaps,                          firstIndex, lastIndexExclusive);       }       if (w != null) mvcc.completeMemstoreInsert(w);       if (locked) {         this.updatesLock.readLock().unlock();Always complete       } the transaction!       for (Integer toRelease : acquiredLocks) {         releaseRowLock(toRelease);       }      ...     }
  • Memstore changes  private long applyFamilyMapToMemstore(    Map<byte[], List<KeyValue>> familyMap,    WriteEntry localizedWriteEntry) {    long size = 0;    boolean freemvcc = false; Can pass a write entry that spans mutliple calls    try {      if (localizedWriteEntry == null) {        localizedWriteEntry = mvcc.beginMemstoreInsert();        freemvcc = true;      }      for (Map.Entry<…> e : familyMap.entrySet()) { This begins the        byte[] family = e.getKey(); transaction        List<KeyValue> edits = e.getValue(); (MVCC)        ...
  • Memstore changes Tag KV with write number (MVCC)        ...        Store store = getStore(family);        for (KeyValue kv: edits) {          kv.setMemstoreTS(localizedWriteEntry.getWriteNumber());          size += store.add(kv);        }      }    } finally {      if (freemvcc) {        mvcc.completeMemstoreInsert(localizedWriteEntry);      }    }     return size;   } This makes the changes visible
  • Deletes
  • ScanDeleteTracker (checking markers)  public void add(buffer, qualifierOffset, qualifierLength, ts, type) {    if (!hasFamilyStamp || ts > familyStamp) {      if (type == KeyValue.Type.DeleteFamily.getCode()) {        hasFamilyStamp = true;        familyStamp = ts; Only remember TS for family deletes        return;      }      if (deleteBuffer != null && type < deleteType) {        // same column, so ignore less specific delete        if (Bytes.equals(deleteBuffer, deleteOffset, deleteLength,            buffer, qualifierOffset, qualifierLength)){          return;        } A version delete      } marker can be      // new column, or more general delete type      deleteBuffer = buffer; ignored if there is a      deleteOffset = qualifierOffset; column marker      deleteLength = qualifierLength;      deleteType = type; already.      deleteTimestamp = ts;    } Remember the KV      }
  • ScanDeleteTracker (checking markers)  public DeleteResult isDeleted(buffer, qualifierOffset,qualifierLength, timestamp) {    if (hasFamilyStamp && timestamp <= familyStamp) {      return DeleteResult.FAMILY_DELETED; Family marker case    }    if (deleteBuffer != null) {      int ret = Bytes.compareTo(deleteBuffer, deleteOffset, deleteLength,          buffer, qualifierOffset, qualifierLength);      if (ret == 0) {        if (deleteType == KeyValue.Type.DeleteColumn.getCode()) {          return DeleteResult.COLUMN_DELETED;        } Column marker case        // If the timestamp is the same, keep this one        if (timestamp == deleteTimestamp) {          return DeleteResult.VERSION_DELETED;        } Version marker case        // different timestamp, lets clear the buffer.        deleteBuffer = null;      } else if(ret < 0){ HFiles scanned newest        // Next column case. TS first        deleteBuffer = null;      } else {        throw new IllegalStateException(...);      }    }    return DeleteResult.NOT_DELETED;  }
  • Questions? Comments?More details on http://hadoop-hbase.blogspot.com