Study Notes: Google Percolator

Summary of Google
Percolator
Original Paper: “Large-scale Incremental Processing Using Distributed
Transactions and Notifications” by Daniel Peng and Frank Dabek
Available at: Google Research#36726

Introduction
Transaction
Notification

Motivation
• Problem: Given a problem: Data -> Result. Assume that the input data
change over time. Can we update the result without having to run over the
entire repository again?
• Idea: Given a large repository of data, we want to update the computation
result via small, independent mutations, so that the latency is proportional
to the size of an update instead of the size of the repository
• Intended Use Case: dynamically update Google’s web search index (Multi-
petabyte repository of input data)
• Existing infrastructures are lacking:
o Databases: do not provide enough storage or throughput
o MapReduce: batch-processing systems are not efficient with small, individual
updates

Underlying Infrastructure
• BigTable
• Timestamp Oracle
oHands out timestamps in strictly increasing order.
 batches timestamp requests
 ~2 million timestamps per second from a single machine
oGuarantee that Get() returns all committed writes before the transaction’s
start timestamp
• Chubby lock server

Percolator Library
• BigTable operations wrapped in Percolator-specific computations
• To each column named “c”, attach the following columns:
Column Use
c:lock An uncommitted transaction is writing this cell; contains the location of primary lock
c:write Committed data present; stores the Bigtable timestamp of the data
c:data Stores the data itself
c:notify Hint: observers may need to run
c:ack_O Observer “O” has run ; stores start timestamp of successful last run

Evaluation
• Percolator performs significantly better than MapReduce when the
updates are relatively small compared to the size of the repository

Concept of Transaction
• Transactions under snapshot isolation:
o reads from a stable snapshot taken at a start_timestamp (represented here by an open square) and
o writes at a different, later, commit_timestamp (closed circle)
o each timestamp represents a consistent snapshot
o If two transactions running concurrently both write to the same cell, at most one will commit
• In the diagram above
o Transaction 2 would not see writes from transaction 1 since transaction 2’s start timestamp is before
transaction 1’s commit timestamp.
o Transaction 3 will see writes from both 1 and 2.
o Transaction 1 and 2 are running concurrently: if they both write the same cell, at least one will abort.
• Provides no serializability

Transaction APIs: Semantics
• Set(<row, col>, val): adds to the list of writes for this transaction
• Get(<row, col>, *ptr): return the current value of <row, col> in *ptr
oGet() returns true upon success
oGet() returns false if there is no data in the indicated <row, col>
oGet() may hang there waiting if another transaction is writing
• Commit(): use two-phase commit sequence to perform the list of
writes for this transaction
oCommit() returns true upon success
oIf Commit() returns false, the transaction has conflicted and should be retried
after a backoff
• Calls to Get() and Commit() are blocking

Transaction APIs: Implementation
• Each transaction, in its constructor, obtains a start_timestamp from
the timestamp oracle
oWrites committed before or at the start timestamp is visible to this
transaction
• Set(<row, col>, val) just adds to the list of writes. Writes are not
visible until Commit().

• Commit(), Phase#1 (Prewrite)
• One of the writes is designated as primary. The rest are secondaries.
• For each (primary or secondary) write, <row, col, val>, check for conflicts:
o If <row, col:write> column has an entry versioned later than start_timestamp, it aborts.
 It means another transaction has committed a (primary or secondary) write after this
transaction. This is the write-write conflict that snap isolation guards against.
 Checking the lock column does not detect this case: the lock column is cleared after commit
o If <row, col:lock> has an entry, versioned at any timestamp, it aborts
 Another transaction is in progress (phase#1 of another transaction has succeeded)
• FIXME: What if the other transaction crashed before being able to release its locks?
• If there is no conflict
o Write new data to <row, col:data, start_timestamp> <= val
o Write <row, col:lock, start_timestamp> <= <primary.row, primary.column>
 <row, col> is now locked. Another transaction trying to write <row, col> will abort

• Commit(), Phase#2
• Transaction obtains a commit_timestamp from the timestamp oracle
• For primary,
o if <row, col:lock, start_timestamp> does not have an entry, it aborts
 the lock is lost somehow: maybe this transaction took too long, and another transaction erased the lock
o Write <row, col:write, commit_timestamp> <= start_timestamp
 Recall that <row, col:data, start_timestamp> has the data
o Erase <row, col:lock, 0..commit_timestamp>
 This and the previous write are bundled into a BigTable row transaction to be atomic
 If another transaction is trying to erase this primary lock at the same time as this transaction is trying to commit the
primary write, only one will succeed. If this transaction did not succeed, the Commit() of this transaction has failed.
• (Commit point) If a crash happened before the primary write is successful, this commit will roll
back; if a crash happened after this point, this commit will roll forward.
• For each of the secondaries,
• Write <row, col:write, commit_timestamp> <= start_timestamp
• Erase <row, col:lock, 0..commit_timestamp>

• Get(<row, col>, *ptr): Reading data from <row, col>
o If there is an entry to <row, col:lock> and the entry is before start_timestamp,
 The entry at some_start_timestamp is <primary.row, primary.col>. (Not documented) it is possible to deduce the
owning Percolator worker which locked <primary.row, primary.col> at some_start_timestamp.
 If the Chubby lock indicates that the owning worker is dead, or if the Chubby lock indicates that the owning
worker is alive but the worker has not updated the Chubby lock for too long, start the cleanup process:
 The cleanup process: Has the primary lock been replaced by a write record?
 If <primary.row, primary.col:lock, 0..some_start_timestamp> is clear and <primary.row, primary.col:write,
some_commit_timestamp> contains some_start_timestamp where some_commit_timestamp happens after
some_start_timestamp, the primary write has taken place and we have to roll forward. Otherwise we roll back. Notice
that a search is necessary in order to identify some_commit_timestamp.
 To roll forward, we replace the stranded lock with a write record as the original transaction would have done, i.e., write
<row, col:write, some_commit_timestamp> <= some_start_timestamp and clear <row, col:lock,
0..some_commit_timestamp>.
 To roll back, we simply erase the primary lock. Erasing the lock has to be done as an atomic BigTable row transaction.
 Otherwise, or if the cleanup process failed (e.g., the locking process resurrects in time), Get() will wait.
o Reads the latest entry to <row, col:write> that is before start_timestamp.
 If there is no entry, then there is no data to read. Return false.
 Otherwise the entry in <row, col:write> should be the start_timestamp of the latest transaction. <row, col:data,
start_timestamp> is the data. Return the data in *ptr and Get() returns true.

Notification
• A Percolator system consists of three binaries running on every machine in the cluster: a Percolator
worker, a BigTable tablet server and a GFS chunkserver
• All the observers are linked into the Percolator worker binary
o Each observer is explicitly constructed in the main() of the worker binary
• Percolator applications are structured as a series of observers; each observer completes a task and
creates more work for “downstream” observers by writing to the table
• Each observed column col has a col:notify column
o When a transaction writes to an observed cell, it also sets col:notify
o The notify columns are in a separate BigTable locality group to make the scan (below) more efficient
• Percolator worker does a distributed scan over the notify columns to find dirty cells
• Each observed column col has a col:ack_O column for its observer
o Contains the latest start_timestamp of the observer transaction that has run.
o When a dirty cell is discovered, Percolator starts a transaction to check both col:write and col:ack_O. If col has been
written after col:ack_O, we run the observer and update ack_O with the new start_timestamp.
• The triggered observer runs in a separate transaction from the triggering write, so the triggering write
and the triggered observer’s writes are not atomic.

Notification
• It is possible for multiple observers to observe the same column, but
we avoid this feature so it is clear what observer will run when a
particular column is written
• One guarantee: at most one observer’s transaction will commit for
each change of an observed column
• Percolator does nothing to prevent infinite cycles of notification. User
has to be careful when constructing their observers

Missing From the Paper
• Missing from the paper:
o What does an observer look like?
o How does a client application register an observer for a column?
• My guess is that there is Percolator API to register an observer for a column
because this information has to come from the user
• The Percolator library should be able to hide from the user the
implementation details of the notify column and the ack_O column as well
as the parallel scan for the dirty cells
• An alternative (but less likely) possibility: since each observer is a user-
implemented transaction, the Percolator library may be able to parse the
transaction for all Get() calls and therefore deduce the observed column(s).

Performance Optimizations
• To make the scan for dirty cells more efficient
o The notify columns are in a separate BigTable locality group
o Each Percolator worker dedicates a few threads to do the scan. The starting point (for the scan) of
each thread is randomly selected. To avoid “platooning” or “bus clumping”, when a scanning
thread discovers that it is scanning the same row as another thread, it chooses a new random
location in BigTable to scan
• Many RPCs to check a lock and write a lock
o Solution: Add conditional mutations in BigTable API
• Batch read/write operations
o Conditional mutations destined for the same BigTable tablet server can be batched into a single
RPC
o Delaying lock operations and read operations for several seconds to collect them into batches
• Prefetch
o Percolator predicts what other columns in the same row may be read later
• Get() and Commit() are blocking
o Rely on running thousands of thread to provide enough parallelism

Study Notes: Google Percolator

More Related Content

What's hot

Similar to Study Notes: Google Percolator

Recently uploaded

Study Notes: Google Percolator

Editor's Notes