Google Megastore


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 1. Query Local : Query the local replica's coordinator to determine if the entity group is up-to-date locally. 2. Find Position : Determine the highest possibly-committed log position, and select a replica that has applied through that log position. (a) ( Local read ) If step 1 indicates that the local replica is up-to-date, read the highest accepted log position and timestamp from the local replica. (b) ( Majority read ) If the local replica is not up-to-date (or if step 1 or step 2a times out), read from a majority of replicas to nd the maximum log position that any replica has seen, and pick a replica to read from. We select the most responsive or up-to-date replica, not always the local replica. 3. Catchup : As soon as a replica is selected, catch it up to the maximum known log position as follows: (a) For each log position in which the selected replica does not know the consensus value, read the value from another replica. For any log positions without a known-committed value available, invoke Paxos to propose a no-op write. Paxos will drive a majority of replicas to converge on a single value|either the no-op or a previously proposed write. (b) Sequentially apply the consensus value of all unapplied log positions to advance the replica's state to the distributed consensus state.In the event of failure, retry on another replica. 4. Validate : If the local replica was selected and was not previously up-to-date, send the coordinator a validate message asserting that the ( entity group; replica ) pair reflects all committed writes. Do not wait for a reply if the request fails, the next read will retry. 5. Query Data : Read the selected replica using the timestamp of the selected log position. If the selected replica becomes unavailable, pick an alternate replica, perform catchup, and read from it instead. The results of a single large query may be assembled transparently from multiple replicas. Note that in practice 1 and 2a are executed in parallel.
  • 1. Accept Leader : Ask the leader to accept the value as proposal number zero. If successful, skip to step 3. 2. Prepare : Run the Paxos Prepare phase at all replicas with a higher proposal number than any seen so far at this log position. Replace the value being written with the highest-numbered proposal discovered, if any. 3. Accept : Ask remaining replicas to accept the value. If this fails on a majority of replicas, return to step 2 after a randomized backoff. 4. Invalidate : Invalidate the coordinator at all full replicas that did not accept the value. Fault handling at this step is described in Section 4.7 below. 5. Apply : Apply the value's mutations at as many replicas as possible. If the chosen value differs from that originally proposed, return a conflict error.
  • Google Megastore

    1. 1. Megastore - Providing Scalable, Highly Available J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M Léon, Y. Li, A. Lloyd, V. Yushprakh Google Inc. [email_address] May. 2011
    2. 2. Agenda <ul><li>Motivation </li></ul><ul><li>Architecture </li></ul><ul><li>ACID over NOSQL Database </li></ul><ul><li>Replication via Paxos </li></ul><ul><li>Operational Results </li></ul>
    3. 3. Motivation <ul><li>Build a system to please everyone (users, admins, developers). </li></ul>
    4. 4. Motivation <ul><li>High availability – Fully functional during planned maintenance periods, as well as most unplanned infrastructure issues. </li></ul><ul><li>Scalability – Service huge audience of potential users. </li></ul><ul><li>ACID – Easier for writing and deploying applications. </li></ul>
    5. 5. Agenda <ul><li>Motivation </li></ul><ul><li>Megastore Architecture </li></ul><ul><li>ACID over NOSQL Database </li></ul><ul><li>Replication via Paxos </li></ul><ul><li>Operational Results </li></ul>
    6. 6. Megastore Overview <ul><li>Widely deployed in Google for several years. </li></ul><ul><li>Used on more than 100 production applications. </li></ul><ul><li>Handles more than 3 billion write and 20 billion read transactions daily. </li></ul><ul><li>Stores nearly a petabyte of primary data across many global datacenters. </li></ul><ul><li>Available on GAE since Jan 2011. </li></ul>
    7. 7. Architecture <ul><li>Built on top of Bigtable and Chubby. </li></ul><ul><li>Blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS </li></ul><ul><li>Synchronous replication based on Paxos across datacenters. </li></ul>
    8. 8. Architecture
    9. 9. Architecture <ul><li>Scalable replication. </li></ul>
    10. 10. Architecture <ul><li>Operation across Entity Groups </li></ul>
    11. 11. Agenda <ul><li>Motivation </li></ul><ul><li>Megastore Architecture </li></ul><ul><li>ACID over NOSQL Database </li></ul><ul><li>Replication via Paxos </li></ul><ul><li>Operational Results </li></ul>
    12. 12. Data Model <ul><li>Somewhere between RDBMS and row-column storage of NOSQL. </li></ul><ul><ul><li>Schemas </li></ul></ul><ul><ul><li>Tables (Entity group root table/child table, child table must have a single distinguished foreign key referencing root table) </li></ul></ul><ul><ul><li>Entities </li></ul></ul><ul><ul><li>Properties </li></ul></ul>
    13. 13. Sample Schema
    14. 14. Mapping to Bigtable <ul><li>Primary Keys are chosen to cluster entities that will be read together. </li></ul><ul><li>Each entity is mapped into a single Bigtable row. </li></ul><ul><li>“ IN TABLE” instructs to colocate tables into the same Bigtable, and key ordering ensures Photo entities are stored adjacent to corresponding User. </li></ul><ul><li>Bigtable column name = Megastore table name + property name </li></ul>
    15. 15. Indexes <ul><li>Two level of indexes: </li></ul><ul><ul><li>Local index: Separate indexes for each entity group. Stored in entity group and updated atomically and consistently. </li></ul></ul><ul><ul><li>Global index: Span entity groups. Not guaranteed to reflect all recent updates. </li></ul></ul>
    16. 16. Transactions & Concurrency <ul><li>Entity group is a mini-database providing serializable ACID semantics. </li></ul><ul><li>MVCC (MultiVersion Concurrency Control) using transaction timestamp </li></ul><ul><li>Reads and Writes are isolated </li></ul>
    17. 17. Transactions & Concurrency <ul><li>Three level of reads consistency </li></ul><ul><ul><li>Current: apply all previous committed logs before read within a single entity group. </li></ul></ul><ul><ul><li>Snapshot: pick the last known fully applied transaction to read, within a single entity group. </li></ul></ul><ul><ul><li>Inconsistent: ignore the state of log and read the latest value directly. </li></ul></ul>
    18. 18. Transactions & Concurrency <ul><li>Write transaction: </li></ul><ul><ul><li>Current read: Obtain the timestamp and log position of the last committed transaction. </li></ul></ul><ul><ul><li>Application logic: Read from Bigtable and gather writes into a log entry. </li></ul></ul><ul><ul><li>Commit: Use Paxos to achieve consensus for appending the log entry to log. </li></ul></ul><ul><ul><li>Apply: Write mutations to the entities and indexes in Bigtable. </li></ul></ul><ul><ul><li>Clean up: Delete temp data. </li></ul></ul>
    19. 19. Transactions & Concurrency <ul><li>Queues provide transactional messaging between entity groups. Declaring a queue automatically creates an inbox on each entity group (scale automatically). </li></ul><ul><li>Two phase commit </li></ul><ul><li>Queue is recommended over two phase commit. </li></ul>
    20. 20. Agenda <ul><li>Motivation </li></ul><ul><li>Megastore Architecture </li></ul><ul><li>ACID over NOSQL Database </li></ul><ul><li>Replication via Paxos </li></ul><ul><li>Operational Results </li></ul>
    21. 21. Paxos <ul><li>Basic Paxos </li></ul><ul><li>Multi-Paxos </li></ul>
    22. 22. Reads
    23. 23. Writes
    24. 24. Failure Detection <ul><li>Coordinators obtain specific Chubby locks in remote datacenters at startup. </li></ul><ul><li>If it ever loses a majority of its locks from a crash or network partition, it will consider all entity groups in its purview to be out-of-date. </li></ul><ul><li>reads at the replica must query the log position from a majority of replicas until the locks are regained and its coordinator entries are revalidated. </li></ul><ul><li>all writers must wait for the coordinator's Chubby locks to expire before writes can complete </li></ul>
    25. 25. Agenda <ul><li>Motivation </li></ul><ul><li>Megastore Architecture </li></ul><ul><li>ACID over NOSQL Database </li></ul><ul><li>Replication via Paxos </li></ul><ul><li>Operational Results </li></ul>
    26. 26. Distribution of Availability
    27. 27. Distribution of Average Latencies
    28. 28. Conclusion <ul><li>Most users see five nines availability </li></ul><ul><li>Average read latencies are tens of milliseconds, indicating most reads are local. </li></ul><ul><li>Most writes costs 100-400 milliseconds. </li></ul>
    29. 29. Questions?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.