Percolator

1,819 views
1,640 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,819
On SlideShare
0
From Embeds
0
Number of Embeds
105
Actions
Shares
0
Downloads
27
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Percolator

  1. 1. Large-scale Incremental Processing Using Distributed Transactions and Notifications [email_address]
  2. 2. Agenda <ul><li>Introduction </li></ul><ul><li>Design </li></ul><ul><ul><li>Bigtable overview </li></ul></ul><ul><ul><li>Transaction </li></ul></ul><ul><ul><li>Timestamps </li></ul></ul><ul><ul><li>Notifications </li></ul></ul><ul><ul><li>Discussion </li></ul></ul><ul><li>Reference </li></ul>
  3. 3. Introduction <ul><li>why Percolator? </li></ul><ul><ul><li>data processing tasks that transform a large repository of data via small, independent mutations. </li></ul></ul><ul><ul><ul><li>RDBMS? </li></ul></ul></ul><ul><ul><ul><li>MapReduce? </li></ul></ul></ul><ul><ul><li>a system for incrementally processing updates to a large data set </li></ul></ul><ul><ul><ul><li>create the Google web search index </li></ul></ul></ul><ul><ul><ul><li>reduce the average age of documents in Google search results by 50% </li></ul></ul></ul>
  4. 4. Introduction(cont.) <ul><li>Percolator </li></ul><ul><ul><li>features </li></ul></ul><ul><ul><ul><li>random access to a multi-PB repository </li></ul></ul></ul><ul><ul><ul><li>ACID-compliant transactions </li></ul></ul></ul><ul><ul><ul><li>snapshot isolation semantics </li></ul></ul></ul><ul><ul><ul><li>observers: like triggers in DBMS, applications are structured as a series of observers </li></ul></ul></ul><ul><ul><li>user scenarios </li></ul></ul><ul><ul><ul><li>computation should be very large in some dimension </li></ul></ul></ul><ul><ul><ul><li>can be broken down into small updates </li></ul></ul></ul><ul><ul><ul><li>have some strong consistency requirements </li></ul></ul></ul>
  5. 5. Design <ul><li>Two main abstractions </li></ul><ul><ul><li>ACID transactions over a random-access repository </li></ul></ul><ul><ul><li>observers </li></ul></ul><ul><li>Components </li></ul><ul><ul><li>a percolator worker/ a bigtable tablet server/ a GFS chunk server </li></ul></ul><ul><ul><li>timestamp oracle </li></ul></ul><ul><ul><li>light weight lock service </li></ul></ul>
  6. 6. Design: Bigtable overview <ul><li>Bigtable </li></ul><ul><ul><li>row transaction (hbase?) </li></ul></ul><ul><ul><li>percolator’s API closely resembles Bigtable’s API </li></ul></ul><ul><ul><li>percolator library largely consists of Bigtable operations wrapped in Percolator-specific computations </li></ul></ul><ul><li>Challenges </li></ul><ul><ul><li>multirow transactions </li></ul></ul><ul><ul><li>the observer framework </li></ul></ul>
  7. 7. Design: Transactions <ul><li>Cross-row, cross-table transactions with ACID snapshot-isolation semantics </li></ul><ul><ul><li>no serializability </li></ul></ul><ul><li>No central transactions management, but built as a client library accessing Bigtable </li></ul><ul><ul><li>lock server need to replicated, distributed and balanced, and write to a persistent data store. </li></ul></ul><ul><ul><li>store locks in special in-memory columns in the same Bigtable that stores data </li></ul></ul>
  8. 8. Design: Transactions(cont.) <ul><li>The transaction’s constructor asks the timestamp oracle for a start timestamp. </li></ul><ul><ul><li>determines the consistent snapshot seen by Get() </li></ul></ul><ul><ul><li>calls to Set() are buffered until commit time. </li></ul></ul><ul><li>2-phase commit </li></ul><ul><ul><li>try to lock all the cells being written </li></ul></ul><ul><ul><li>obtains the commit timestamp, then release its lock and make its write visible by replacing the lock with a write record </li></ul></ul>
  9. 9. Design: Transactions(cont.) <ul><li>Error recovery </li></ul><ul><ul><li>client failure while transaction being commited. </li></ul></ul><ul><ul><ul><li>lazy approach to cleanup </li></ul></ul></ul><ul><ul><ul><li>failure judgment: primary lock </li></ul></ul></ul><ul><ul><ul><li>roll back </li></ul></ul></ul><ul><ul><li>client failure during the second phase of commit. </li></ul></ul><ul><ul><ul><li>past the commit point </li></ul></ul></ul><ul><ul><ul><li>roll forward </li></ul></ul></ul><ul><li>Lock cleanup </li></ul><ul><ul><li>only cleanup lock belongs to a dead or stuck worker (use chubby) </li></ul></ul>
  10. 10. Design: Timestamps <ul><li>Hands out timestamps in strictly increasing order. </li></ul><ul><ul><li>batches timestamp requests </li></ul></ul><ul><ul><li>2 million timestamps per second from a single machine </li></ul></ul><ul><li>Guarantee that Get() returns all commited writes before the transaction’s start timestamp. </li></ul><ul><ul><li>T W < T R </li></ul></ul>
  11. 11. Design: Notifications <ul><li>Observer </li></ul><ul><ul><li>registers a function and a set of columns with Percolator </li></ul></ul><ul><ul><li>Percolator scan two special columns and call responding observers </li></ul></ul><ul><ul><ul><li>Ack </li></ul></ul></ul><ul><ul><ul><li>Notify </li></ul></ul></ul><ul><ul><li>in practice, very few observers(10), one observer run on a particular column </li></ul></ul>
  12. 12. Design: Discussion <ul><li>Many RPCs per work unit </li></ul><ul><ul><li>50 to process a single document </li></ul></ul><ul><ul><li>solutions </li></ul></ul><ul><ul><ul><li>Add conditional mutations in Bigtable API </li></ul></ul></ul><ul><ul><ul><li>Batch operations </li></ul></ul></ul><ul><ul><ul><li>Prefetch </li></ul></ul></ul><ul><li>All API calls blocking </li></ul><ul><ul><li>Rely on running thousands of thread to provide enough parallelism </li></ul></ul>
  13. 13. Reference <ul><li>Large-scale Incremental Processing Using Distributed Transactions and Notifications”, OSDI’10 </li></ul><ul><li>http://www.infoq.com/cn/news/2010/10/google-percolator </li></ul>
  14. 14. HBase Coprocessor <ul><li>provides a framework both for distributed computation directly within the HBase server processes and flexible and generic extension. </li></ul><ul><ul><li>Observer </li></ul></ul><ul><ul><ul><li>RegionObserver </li></ul></ul></ul><ul><ul><ul><li>MasterObserver </li></ul></ul></ul><ul><ul><ul><li>WALObserver </li></ul></ul></ul><ul><ul><li>Endpoint </li></ul></ul>
  15. 15. <ul><li>Thank you! </li></ul>

×