Summary of Google
Percolator
Original Paper: “Large-scale Incremental Processing Using Distributed
Transactions and Notifications” by Daniel Peng and Frank Dabek
Available at: Google Research#36726
Introduction
Transaction
Notification
Motivation
• Problem: Given a problem: Data -> Result. Assume that the input data
change over time. Can we update the result without having to run over the
entire repository again?
• Idea: Given a large repository of data, we want to update the computation
result via small, independent mutations, so that the latency is proportional
to the size of an update instead of the size of the repository
• Intended Use Case: dynamically update Google’s web search index (Multi-
petabyte repository of input data)
• Existing infrastructures are lacking:
o Databases: do not provide enough storage or throughput
o MapReduce: batch-processing systems are not efficient with small, individual
updates
Underlying Infrastructure
• BigTable
• Timestamp Oracle
oHands out timestamps in strictly increasing order.
 batches timestamp requests
 ~2 million timestamps per second from a single machine
oGuarantee that Get() returns all committed writes before the transaction’s
start timestamp
• Chubby lock server
Percolator Library
• BigTable operations wrapped in Percolator-specific computations
• To each column named “c”, attach the following columns:
Column Use
c:lock An uncommitted transaction is writing this cell; contains the location of primary lock
c:write Committed data present; stores the Bigtable timestamp of the data
c:data Stores the data itself
c:notify Hint: observers may need to run
c:ack_O Observer “O” has run ; stores start timestamp of successful last run
Evaluation
• Percolator performs significantly better than MapReduce when the
updates are relatively small compared to the size of the repository
Introduction
Transaction
Notification
Concept of Transaction
• Transactions under snapshot isolation:
o reads from a stable snapshot taken at a start_timestamp (represented here by an open square) and
o writes at a different, later, commit_timestamp (closed circle)
o each timestamp represents a consistent snapshot
o If two transactions running concurrently both write to the same cell, at most one will commit
• In the diagram above
o Transaction 2 would not see writes from transaction 1 since transaction 2’s start timestamp is before
transaction 1’s commit timestamp.
o Transaction 3 will see writes from both 1 and 2.
o Transaction 1 and 2 are running concurrently: if they both write the same cell, at least one will abort.
• Provides no serializability
Transaction APIs: Semantics
• Set(<row, col>, val): adds to the list of writes for this transaction
• Get(<row, col>, *ptr): return the current value of <row, col> in *ptr
oGet() returns true upon success
oGet() returns false if there is no data in the indicated <row, col>
oGet() may hang there waiting if another transaction is writing
• Commit(): use two-phase commit sequence to perform the list of
writes for this transaction
oCommit() returns true upon success
oIf Commit() returns false, the transaction has conflicted and should be retried
after a backoff
• Calls to Get() and Commit() are blocking
Transaction APIs: Implementation
• Each transaction, in its constructor, obtains a start_timestamp from
the timestamp oracle
oWrites committed before or at the start timestamp is visible to this
transaction
• Set(<row, col>, val) just adds to the list of writes. Writes are not
visible until Commit().
Transaction APIs: Implementation
• Commit(), Phase#1 (Prewrite)
• One of the writes is designated as primary. The rest are secondaries.
• For each (primary or secondary) write, <row, col, val>, check for conflicts:
o If <row, col:write> column has an entry versioned later than start_timestamp, it aborts.
 It means another transaction has committed a (primary or secondary) write after this
transaction. This is the write-write conflict that snap isolation guards against.
 Checking the lock column does not detect this case: the lock column is cleared after commit
o If <row, col:lock> has an entry, versioned at any timestamp, it aborts
 Another transaction is in progress (phase#1 of another transaction has succeeded)
• FIXME: What if the other transaction crashed before being able to release its locks?
• If there is no conflict
o Write new data to <row, col:data, start_timestamp> <= val
o Write <row, col:lock, start_timestamp> <= <primary.row, primary.column>
 <row, col> is now locked. Another transaction trying to write <row, col> will abort
Transaction APIs: Implementation
• Commit(), Phase#2
• Transaction obtains a commit_timestamp from the timestamp oracle
• For primary,
o if <row, col:lock, start_timestamp> does not have an entry, it aborts
 the lock is lost somehow: maybe this transaction took too long, and another transaction erased the lock
o Write <row, col:write, commit_timestamp> <= start_timestamp
 Recall that <row, col:data, start_timestamp> has the data
o Erase <row, col:lock, 0..commit_timestamp>
 This and the previous write are bundled into a BigTable row transaction to be atomic
 If another transaction is trying to erase this primary lock at the same time as this transaction is trying to commit the
primary write, only one will succeed. If this transaction did not succeed, the Commit() of this transaction has failed.
• (Commit point) If a crash happened before the primary write is successful, this commit will roll
back; if a crash happened after this point, this commit will roll forward.
• For each of the secondaries,
• Write <row, col:write, commit_timestamp> <= start_timestamp
• Erase <row, col:lock, 0..commit_timestamp>
Transaction APIs: Implementation
• Get(<row, col>, *ptr): Reading data from <row, col>
o If there is an entry to <row, col:lock> and the entry is before start_timestamp,
 The entry at some_start_timestamp is <primary.row, primary.col>. (Not documented) it is possible to deduce the
owning Percolator worker which locked <primary.row, primary.col> at some_start_timestamp.
 If the Chubby lock indicates that the owning worker is dead, or if the Chubby lock indicates that the owning
worker is alive but the worker has not updated the Chubby lock for too long, start the cleanup process:
 The cleanup process: Has the primary lock been replaced by a write record?
 If <primary.row, primary.col:lock, 0..some_start_timestamp> is clear and <primary.row, primary.col:write,
some_commit_timestamp> contains some_start_timestamp where some_commit_timestamp happens after
some_start_timestamp, the primary write has taken place and we have to roll forward. Otherwise we roll back. Notice
that a search is necessary in order to identify some_commit_timestamp.
 To roll forward, we replace the stranded lock with a write record as the original transaction would have done, i.e., write
<row, col:write, some_commit_timestamp> <= some_start_timestamp and clear <row, col:lock,
0..some_commit_timestamp>.
 To roll back, we simply erase the primary lock. Erasing the lock has to be done as an atomic BigTable row transaction.
 Otherwise, or if the cleanup process failed (e.g., the locking process resurrects in time), Get() will wait.
o Reads the latest entry to <row, col:write> that is before start_timestamp.
 If there is no entry, then there is no data to read. Return false.
 Otherwise the entry in <row, col:write> should be the start_timestamp of the latest transaction. <row, col:data,
start_timestamp> is the data. Return the data in *ptr and Get() returns true.
Introduction
Transaction
Notification
Notification
• A Percolator system consists of three binaries running on every machine in the cluster: a Percolator
worker, a BigTable tablet server and a GFS chunkserver
• All the observers are linked into the Percolator worker binary
o Each observer is explicitly constructed in the main() of the worker binary
• Percolator applications are structured as a series of observers; each observer completes a task and
creates more work for “downstream” observers by writing to the table
• Each observed column col has a col:notify column
o When a transaction writes to an observed cell, it also sets col:notify
o The notify columns are in a separate BigTable locality group to make the scan (below) more efficient
• Percolator worker does a distributed scan over the notify columns to find dirty cells
• Each observed column col has a col:ack_O column for its observer
o Contains the latest start_timestamp of the observer transaction that has run.
o When a dirty cell is discovered, Percolator starts a transaction to check both col:write and col:ack_O. If col has been
written after col:ack_O, we run the observer and update ack_O with the new start_timestamp.
• The triggered observer runs in a separate transaction from the triggering write, so the triggering write
and the triggered observer’s writes are not atomic.
Notification
• It is possible for multiple observers to observe the same column, but
we avoid this feature so it is clear what observer will run when a
particular column is written
• One guarantee: at most one observer’s transaction will commit for
each change of an observed column
• Percolator does nothing to prevent infinite cycles of notification. User
has to be careful when constructing their observers
Missing From the Paper
• Missing from the paper:
o What does an observer look like?
o How does a client application register an observer for a column?
• My guess is that there is Percolator API to register an observer for a column
because this information has to come from the user
• The Percolator library should be able to hide from the user the
implementation details of the notify column and the ack_O column as well
as the parallel scan for the dirty cells
• An alternative (but less likely) possibility: since each observer is a user-
implemented transaction, the Percolator library may be able to parse the
transaction for all Get() calls and therefore deduce the observed column(s).
Performance Optimizations
• To make the scan for dirty cells more efficient
o The notify columns are in a separate BigTable locality group
o Each Percolator worker dedicates a few threads to do the scan. The starting point (for the scan) of
each thread is randomly selected. To avoid “platooning” or “bus clumping”, when a scanning
thread discovers that it is scanning the same row as another thread, it chooses a new random
location in BigTable to scan
• Many RPCs to check a lock and write a lock
o Solution: Add conditional mutations in BigTable API
• Batch read/write operations
o Conditional mutations destined for the same BigTable tablet server can be batched into a single
RPC
o Delaying lock operations and read operations for several seconds to collect them into batches
• Prefetch
o Percolator predicts what other columns in the same row may be read later
• Get() and Commit() are blocking
o Rely on running thousands of thread to provide enough parallelism

Study Notes: Google Percolator

  • 1.
    Summary of Google Percolator OriginalPaper: “Large-scale Incremental Processing Using Distributed Transactions and Notifications” by Daniel Peng and Frank Dabek Available at: Google Research#36726
  • 2.
  • 3.
    Motivation • Problem: Givena problem: Data -> Result. Assume that the input data change over time. Can we update the result without having to run over the entire repository again? • Idea: Given a large repository of data, we want to update the computation result via small, independent mutations, so that the latency is proportional to the size of an update instead of the size of the repository • Intended Use Case: dynamically update Google’s web search index (Multi- petabyte repository of input data) • Existing infrastructures are lacking: o Databases: do not provide enough storage or throughput o MapReduce: batch-processing systems are not efficient with small, individual updates
  • 4.
    Underlying Infrastructure • BigTable •Timestamp Oracle oHands out timestamps in strictly increasing order.  batches timestamp requests  ~2 million timestamps per second from a single machine oGuarantee that Get() returns all committed writes before the transaction’s start timestamp • Chubby lock server
  • 5.
    Percolator Library • BigTableoperations wrapped in Percolator-specific computations • To each column named “c”, attach the following columns: Column Use c:lock An uncommitted transaction is writing this cell; contains the location of primary lock c:write Committed data present; stores the Bigtable timestamp of the data c:data Stores the data itself c:notify Hint: observers may need to run c:ack_O Observer “O” has run ; stores start timestamp of successful last run
  • 6.
    Evaluation • Percolator performssignificantly better than MapReduce when the updates are relatively small compared to the size of the repository
  • 7.
  • 8.
    Concept of Transaction •Transactions under snapshot isolation: o reads from a stable snapshot taken at a start_timestamp (represented here by an open square) and o writes at a different, later, commit_timestamp (closed circle) o each timestamp represents a consistent snapshot o If two transactions running concurrently both write to the same cell, at most one will commit • In the diagram above o Transaction 2 would not see writes from transaction 1 since transaction 2’s start timestamp is before transaction 1’s commit timestamp. o Transaction 3 will see writes from both 1 and 2. o Transaction 1 and 2 are running concurrently: if they both write the same cell, at least one will abort. • Provides no serializability
  • 9.
    Transaction APIs: Semantics •Set(<row, col>, val): adds to the list of writes for this transaction • Get(<row, col>, *ptr): return the current value of <row, col> in *ptr oGet() returns true upon success oGet() returns false if there is no data in the indicated <row, col> oGet() may hang there waiting if another transaction is writing • Commit(): use two-phase commit sequence to perform the list of writes for this transaction oCommit() returns true upon success oIf Commit() returns false, the transaction has conflicted and should be retried after a backoff • Calls to Get() and Commit() are blocking
  • 10.
    Transaction APIs: Implementation •Each transaction, in its constructor, obtains a start_timestamp from the timestamp oracle oWrites committed before or at the start timestamp is visible to this transaction • Set(<row, col>, val) just adds to the list of writes. Writes are not visible until Commit().
  • 11.
    Transaction APIs: Implementation •Commit(), Phase#1 (Prewrite) • One of the writes is designated as primary. The rest are secondaries. • For each (primary or secondary) write, <row, col, val>, check for conflicts: o If <row, col:write> column has an entry versioned later than start_timestamp, it aborts.  It means another transaction has committed a (primary or secondary) write after this transaction. This is the write-write conflict that snap isolation guards against.  Checking the lock column does not detect this case: the lock column is cleared after commit o If <row, col:lock> has an entry, versioned at any timestamp, it aborts  Another transaction is in progress (phase#1 of another transaction has succeeded) • FIXME: What if the other transaction crashed before being able to release its locks? • If there is no conflict o Write new data to <row, col:data, start_timestamp> <= val o Write <row, col:lock, start_timestamp> <= <primary.row, primary.column>  <row, col> is now locked. Another transaction trying to write <row, col> will abort
  • 12.
    Transaction APIs: Implementation •Commit(), Phase#2 • Transaction obtains a commit_timestamp from the timestamp oracle • For primary, o if <row, col:lock, start_timestamp> does not have an entry, it aborts  the lock is lost somehow: maybe this transaction took too long, and another transaction erased the lock o Write <row, col:write, commit_timestamp> <= start_timestamp  Recall that <row, col:data, start_timestamp> has the data o Erase <row, col:lock, 0..commit_timestamp>  This and the previous write are bundled into a BigTable row transaction to be atomic  If another transaction is trying to erase this primary lock at the same time as this transaction is trying to commit the primary write, only one will succeed. If this transaction did not succeed, the Commit() of this transaction has failed. • (Commit point) If a crash happened before the primary write is successful, this commit will roll back; if a crash happened after this point, this commit will roll forward. • For each of the secondaries, • Write <row, col:write, commit_timestamp> <= start_timestamp • Erase <row, col:lock, 0..commit_timestamp>
  • 13.
    Transaction APIs: Implementation •Get(<row, col>, *ptr): Reading data from <row, col> o If there is an entry to <row, col:lock> and the entry is before start_timestamp,  The entry at some_start_timestamp is <primary.row, primary.col>. (Not documented) it is possible to deduce the owning Percolator worker which locked <primary.row, primary.col> at some_start_timestamp.  If the Chubby lock indicates that the owning worker is dead, or if the Chubby lock indicates that the owning worker is alive but the worker has not updated the Chubby lock for too long, start the cleanup process:  The cleanup process: Has the primary lock been replaced by a write record?  If <primary.row, primary.col:lock, 0..some_start_timestamp> is clear and <primary.row, primary.col:write, some_commit_timestamp> contains some_start_timestamp where some_commit_timestamp happens after some_start_timestamp, the primary write has taken place and we have to roll forward. Otherwise we roll back. Notice that a search is necessary in order to identify some_commit_timestamp.  To roll forward, we replace the stranded lock with a write record as the original transaction would have done, i.e., write <row, col:write, some_commit_timestamp> <= some_start_timestamp and clear <row, col:lock, 0..some_commit_timestamp>.  To roll back, we simply erase the primary lock. Erasing the lock has to be done as an atomic BigTable row transaction.  Otherwise, or if the cleanup process failed (e.g., the locking process resurrects in time), Get() will wait. o Reads the latest entry to <row, col:write> that is before start_timestamp.  If there is no entry, then there is no data to read. Return false.  Otherwise the entry in <row, col:write> should be the start_timestamp of the latest transaction. <row, col:data, start_timestamp> is the data. Return the data in *ptr and Get() returns true.
  • 14.
  • 15.
    Notification • A Percolatorsystem consists of three binaries running on every machine in the cluster: a Percolator worker, a BigTable tablet server and a GFS chunkserver • All the observers are linked into the Percolator worker binary o Each observer is explicitly constructed in the main() of the worker binary • Percolator applications are structured as a series of observers; each observer completes a task and creates more work for “downstream” observers by writing to the table • Each observed column col has a col:notify column o When a transaction writes to an observed cell, it also sets col:notify o The notify columns are in a separate BigTable locality group to make the scan (below) more efficient • Percolator worker does a distributed scan over the notify columns to find dirty cells • Each observed column col has a col:ack_O column for its observer o Contains the latest start_timestamp of the observer transaction that has run. o When a dirty cell is discovered, Percolator starts a transaction to check both col:write and col:ack_O. If col has been written after col:ack_O, we run the observer and update ack_O with the new start_timestamp. • The triggered observer runs in a separate transaction from the triggering write, so the triggering write and the triggered observer’s writes are not atomic.
  • 16.
    Notification • It ispossible for multiple observers to observe the same column, but we avoid this feature so it is clear what observer will run when a particular column is written • One guarantee: at most one observer’s transaction will commit for each change of an observed column • Percolator does nothing to prevent infinite cycles of notification. User has to be careful when constructing their observers
  • 17.
    Missing From thePaper • Missing from the paper: o What does an observer look like? o How does a client application register an observer for a column? • My guess is that there is Percolator API to register an observer for a column because this information has to come from the user • The Percolator library should be able to hide from the user the implementation details of the notify column and the ack_O column as well as the parallel scan for the dirty cells • An alternative (but less likely) possibility: since each observer is a user- implemented transaction, the Percolator library may be able to parse the transaction for all Get() calls and therefore deduce the observed column(s).
  • 18.
    Performance Optimizations • Tomake the scan for dirty cells more efficient o The notify columns are in a separate BigTable locality group o Each Percolator worker dedicates a few threads to do the scan. The starting point (for the scan) of each thread is randomly selected. To avoid “platooning” or “bus clumping”, when a scanning thread discovers that it is scanning the same row as another thread, it chooses a new random location in BigTable to scan • Many RPCs to check a lock and write a lock o Solution: Add conditional mutations in BigTable API • Batch read/write operations o Conditional mutations destined for the same BigTable tablet server can be batched into a single RPC o Delaying lock operations and read operations for several seconds to collect them into batches • Prefetch o Percolator predicts what other columns in the same row may be read later • Get() and Commit() are blocking o Rely on running thousands of thread to provide enough parallelism

Editor's Notes

  • #12 Seems that only Get() can remove stale locks
  • #14 The two-phase commit makes it possible to roll forward or roll back.
  • #18 Sec 2.4: “Each observer registers a function and a set of columns with Percolator, and Percolator invokes the function after data is written to one of those columns in any row.”