Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning from google megastore (Part-1)


Published on

This is a study of Megastore, but not include Pasox replication.

Published in: Technology, Business
  • Be the first to comment

Learning from google megastore (Part-1)

  1. 1. Learning from Google Megastore Part1: Data Model and Transactions in single datacenter (w/o Replication and Paxos) Schubert Zhang April 9, 2011
  2. 2. Megastore Introduction2012/3/25 2
  3. 3. Three Aspects of Megastore• Data Model to be a DB – Data layout – Indexing• Transactions and ACID – Within Entity Group – Across Entity Group• Replication across datacenter (not be researched in detail in this presentation) – Synchronous replication – Optimized Paxos 2012/3/25 3
  4. 4. What is? Megastore is: A database over Bigtable, with High Availability across datacenters. Bigdata Philosophy: fine-grained partitioning to make things easy, data placement for relations, and Paxos then, a simple API/Language for convenience of usage! 2012/3/25 4
  5. 5. Target Applications• Interactive online services • Application developers – User facing applications – Be familiar with RDBMS, SQL – Difficult to give up “read-• Conflicting requirements modify-write” idiom – Highly scalable (size, – But now need high scalability throughput) for bigdata – Rapid development, fast time- to-market – Responsive, Low latency – Consistent view of data – Highly available• Reads vs. Writes – 20 billion:3 billion, daily @Google – 7:1• Bigdata – Petabyte of primary data – Across datacenters 2012/3/25 5
  6. 6. NoSQL + RDBMS = Megastore• NoSQL datastore (Bigtable) • Megastore database – Pros – High scalability • Highly scalable – Distributed transactions • Highly available within DC – Consistency guarantees (across hosts) – Fully serializable ACID – Cons semantics within entity-groups • Limited API – Convenience, rapid • Loose consistency models development for applications • Complicate application blend development • + High Availability• RDBMS – Within-DC (Bigtable) – Pros – Across-DC replication, Paxos (synchronously write within EG) • Rich set of features for convenience, rapid – Strong consistency guarantees development for applications (synchronously replicate) • Transactions – Reasonable latency, seamless failover • ACID semantics – Cons • Difficult to scale 2012/3/25 6
  7. 7. Design Principles• Taking a middle ground in the RDBMS vs. NoSQL design space: – partition the datastore and – replicate each partition separately, – providing full ACID semantics within partitions, – but only limited/loose consistency guarantees across them.• Use Paxos to build a highly available system: – provides reasonable latencies for interactive applications while – synchronously replicating writes across geographically distributed datacenters, – to achieve across-DC high availability and a consistent view of the data.• Approachs: – for database scale, partitioning data into a vast space of small databases, each with its own replicated log stored in a per-replica Bigtable; – for availability, implementing a synchronous, fault-tolerant log replicator optimized for cross-DC replication. 2012/3/25 7
  8. 8. EG: Entity-Groups• Entity-Group concept is the footstone of scalability and availability! – Fine-grained partitions of data – Fine-grained control over data’s partitioning and locality – Like many mini-databases – To scale throughput and localize outages – Each independently and synchronously replicated across-DC The data for most Internet• An physical EG in Bigtable consist of services can be suitably – A write-ahead-log (for ACID transactions) partitioned (e.g., by user) to – Related data (pre-joined) make this approach viable. – Local indexes (with also ACID) – … Like a mini-database (locally complete) Nearly all applications built on – And a inbox for receiving across-EG messages Megastore have found ways to draw EG boundaries.• Size of a EG – Not too large, Not too small – A priori/natural or deliberate grouping of data for fast operations – If too large: serializable ACID make long latency and low throughput – If too small: many across-EG expensive consistency operations (e.g. 2PC), or looser consistency asynchronous messaging 2012/3/25 8
  9. 9. Schematic Diagrams A EG like a mini-DB WAL (logs) Primary Data Local Indexes Inbox for Queue Messages EG 2 …… EG n Megastore layout in Bigtable 2012/3/25 9
  10. 10. Many WAL vs. Single WAL• Many replicated logs each governing its own EG, to improve availability and throughput. – Independent and concurrent operations for different EG – Only operations within a EG need to be serialized – Temporary long-wait and failed operations does not impact other EG• Many WAL to scale throughput and localize outages• WAL is stored with each EG in Bigtable• Examples with the same tenet – The asynchronous and concurrent RPC communication framework of HBase and Hadoop IPC. 2012/3/25 10
  11. 11. Consistency Levels and the Approaches• Within each EG: Full ACID semantics – Single-Phase-Commit ACID transactions – And commit entity is replicated via Paxos across-DC• Across-EG: Limited consistency guarantees (two methods for tow levels) – Two-Phase-Commit (expensive, long latency) -> strong consistency – Or, Typically leverage efficient asynchronous messaging (queue!, inexpensive, low latency) -> loose (or eventual) consistency• Two-phase-commit vs. asynchronous-messaging – Two-Phase-Commit transactions • Strong consistency • Expensive • Long latency and low throughput • Usually for low-traffic operations – Asynchronous-messaging • Loose consistency, may be inconsistent (or may be eventual consistency) • Inexpensive • Usually for heavy-traffic operations• Objects to be made consistent: – Data, Local Indexes, within EG : strong (via WAL, ACID) – Data, Global Indexes, cross-EG : strong (via 2PC) or looser (via messaging) – Replicas within DC : strong (via GFS and Bigtable) – Replicas across DC : strong (via Paxos) 2012/3/25 11
  12. 12. The two Faces of ACID Transactions• Frontface: – Simplify development for applications – Reasoning about correctness• Backface: – Performance reduce – Latency – Throughput 2012/3/25 12
  13. 13. Architecture of Megastore – How it deploy?• How it deploy – a client library (DB logic) – and auxiliary servers (for across-DC replication)• Applications link to the client library 2012/3/25 13
  14. 14. Data Model and Semantics to be a database …2012/3/25 14
  15. 15. Principles to be a DBMS• Provides traditional database features, such as secondary indexes, etc.• but only those features that can scale within user-tolerable latency limits,• and only with the semantics that EG partitioning scheme can support. Feature set carefully chosen, tradeoffs. 2012/3/25 15
  16. 16. Data Model (concepts for database)• A Data Model is a notation for describing data or information.• Consists of 3 parts, generally – Structure of the data – Operations on the data – Constraints on the data• Megastore Data Model: Relational Model + Scale – Limited relational model – Bigtable’s scalability• High Level Model vs. Physical Level Model – Physical Level • Complicate application development • Bigtable’s data model is at physical level – High Level • Let programmers to write code conveniently • Language, SQL 2012/3/25 16
  17. 17. Data Model• Schemaful • Primary key − Strongly typed (Primitives or PB) – Built from a sequence of − Required, optional or repeated properties − All entities in a table have the same set of allowable properties. – Must be unique within the table − Nested Protocol-Buffers? An EG= a root entity + all entities Entities Properties Schemas Tables in child tables that reference it (primary (name, (name) (name) key) type) EG Root Child tables Property- table (foreign Entities 111 (EG key) key=EG key) Property- Entity-11 112 Entity Table-1 Property- Photo Entity-12 113 Entity Schema User Entity-21 Entity Table-2 Book Entity-22 Entity schema related hierarchical data 2012/3/25 17
  18. 18. SQL-Like Schema Language (DDL)CREATE SCHEMA DemoApp; Additional Qualifiers:CREATE TABLE User { DESC|ASC|SCATTER required int64 userId; required string name; ------------------------------------} PRIMARY KEY(userId), ENTITY GROUP ROOT; CREATE TABLE Book{ required int64 userId;CREATE TABLE Photo { required int32 bookId; required int64 userId; required int64 time; required int32 photoId; required int64 time; required string url; required string url; repeated string tag; optional string thumbUrl; } PRIMARY KEY([DESC|ASC|SCATTER] userId, repeated string tag; [DESC|ASC|SCATTER] bookId),} PRIMARY KEY(userId, photoId), IN TABLE User, IN TABLE User, ENTITY GROUP KEY(userId) REFERENCES User; ENTITY GROUP KEY(userId) REFERENCES User;CREATE LOCAL INDEX PhotosByTime CREATE LOCAL INDEX BooksByTime ON Photo(userId, time); ON Book([DESC|ASC|SCATTER] userId, [DESC|ASC] time);CREATE GLOBAL INDEX PhotosByTag ON Photo(tag) STORING (thumbUrl); 2012/3/25 18
  19. 19. Data Placement in Bigtable (principles)Pre-join with Keys, for performance …• Lets applications control the placement of hierarchical/related data, to minimize latency and maximize throughput – Storing data that is accessed together in nearby rows, or – Denormalized into the same row• The data for a EG are held in contiguous ranges of Bigtable rows, for – Low latency – High throughput – Cache efficiency• Pre-Joining with keys – Primary keys to cluster entities that will be read together. – Each entity maps into a single Bigtable row. – Primary key values are concatenated to form the Bigtable row key – Each remaining property occupies its own Bigtable column – Entity-group key as the prefix of Primary key (row key) – Sorted keys ascending or descending – SCATTER (two-byte hash prefix), to prevent hotspots in Bigtable – Recursive for arbitrary join depths (multiple levels of “IN TABLE”) 2012/3/25 19
  20. 20. Data Placement in Bigtable (details)Pre-join with Keys, for performance …• Bigtable row key = primary key of each table• Bigtable column name = <table name>.<property name> – Allowing entities from different Megastore tables to be mapped into the same Bigtable row without collision.• Store the transaction and replication log and metadata for the EG in root entity’s Bigtable row. – Because Bigtable provides per-row transactions.• Indexes: Each index entry is represented as a single Bigtable row – Bigtable row key = <indexed property values> + <primary key> – Bigtable cell columns: denormalized properties 2012/3/25 20
  21. 21. Data Placement in Bigtable (examples) STORING Transaction Meta User Table Photo Table Denormalized Row Key Root. Root. User. Photo. Photo. Photo. Photo. PhotosByTag. WAL meta name time url thumbUrl tag thumbUrl <U1> Log3 commit Jack Log2 offset Root User Log1 applied offset … EG for U1 <U1,P1> T1 URL1 TURL1 girl, carPhoto Local Index Global Index Data PhotosByTime PhotosByTag <U1,P2> T2 URL2 TURL2 dress, girl <U1,T1><U1,P1> <U1,T2><U1,P2> <car><U1,P1> TURL1 <dress><U1,P2> TURL2 <girl><U1,P1> TURL1 <girl><U1,P2> TURL2 2012/3/25 21
  22. 22. Secondary Indexes• Secondary indexes can be declared on any list of entity properties(optional is ok), including repeated properties, as well as sub-fields within ProtocolBuffers, and full-text index.• Local Indexes – Within EG – Obey ACID semantics • The index entries are stored in the entity group and are updated atomically and consistently with the primary entity data.• Global Indexes – Span EGs – Looser consistency (or may eventual) • Not guaranteed to reflect all recent updates. (may inconsistent with the primary data?) • It is a trick to keep consistent between Global Indexes and primary data!? – Special Two-Phase-Commit? and – Read-Repair? 2012/3/25 22
  23. 23. Secondary Indexes and Demoralization• STORING clause for copied data in index entities – Avoid the indirect access of primary entities, it is very expensive random access. – But, keeping consistent is a issue!• Inline Indexes – Index entries from the source entities appear as a virtual repeated column in the target entry. – An inline index can be created on any table (child) that has a foreign key referencing another table (parent) by using the first primary key of the target entity as the first components of the index. Inline Index Repeated Columns Inline User Row Key User. PhotosByTime. PhotosByTime. Photo. Photo. Parent Table name T1 T2 time thumbUrl <U1> Jack <P1> <P2> Photo Child Table <U1,P1> T1 TURL1 <U1,P2> T2 TURL2 CREATE INLINE INDEX PhotosByTime ON Photo(userId, time); 2012/3/25 23
  24. 24. Inline Indexes for many-to-manyRelationships• Coupled with repeated indexes, inline indexes can also be used to implement many-to-many relationships more efficiently than by maintaining a many-to-many link table. Inline Index many-to-many Repeated Columns Inline Row User. PhotosByTag. PhotosByTag. PhotosByTag. Photo. Photo. User Key name car dress girl time thumbUrl Parent Table <U1> Jack <P1> <P2> <P1> <P2> Photo <U1,P1> T1 TURL1 Child Table <U1,P2> T2 TURL2 <U2> Tom <P1> <U2,P1> T3 TURL3 CREATE INLINE INDEX PhotosByTag ON Photo(userId, tag); 2012/3/25 24
  25. 25. API• Cost-transparent API – Match application developers’ intuitions – High-volume interactive workloads benefit more from predictable performance than from an expressive query language.• Normalized relational schemas rely on joins at query time to service user operations, is not the right model for Megastore applications. – Pre-joins – Denormalization• SQL-Like Schema language (DDL, for data structures and data placement) – Fine-grained control over physical locality • Hierarchical layouts (pre-joins) • Declarative denormalization – Eliminate the need for most joins• Queries API against particular tables and indexes – Range Scans – Lookups• Schema changes require corresponding modifications to the query implementation code 2012/3/25 25
  26. 26. Query Joins• Query Joins, when required, are implemented in application code.• Index-based join• Merge joins – Multiple queries returns primary keys for the same table, in the same order. – Then intersection of keys for them.• Outer joins – Index lookup (return small result set) – Parallel index lookups using the results of the above lookup• Other joins …? 2012/3/25 26
  27. 27. Query Joins - Merge Joins Query-1 SELECT * FROM Photo WHERE tag=girl girl & car Intersection or & or | girl | car SELECT * FROM Photo WHERE tag=car Query-2 Use the global index: PhotosByTag Just like: SELECT * FROM Photo WHERE tag=girl AND tag=car or SELECT * FROM Photo WHERE tag=girl OR tag=car Strictly, Merge Join is not a real join in the lingo of SQL, but is really a “Join”. 2012/3/25 27
  28. 28. Query Joins - Outer Joins name=Jack, userId=U1,U2 Query-2 Query-1 userId=U1,U2 Parallel Index Lookup Query-2 Index lookup T1<time<T10 Parallel Index Lookup SELECT name, userId FROM User SELECT thumbnUrl FROM Photo WHERE name=Jack WHERE time>T1 AND time<T10; (suppose there is a index: … Parallel for each userId. UsersByName) Just like: SELECT, User.userId, Photo.thumbUrl FROM User LEFT OUTER JOIN Photo ON Photo.userId=User.userId WHERE AND Photo.time>T1 and Photo.time<T10 Example of result: Jack, U1, TURL1 Jack, U2, NULL 2012/3/25 28
  29. 29. Transactions and Concurrency Control• An EG as a mini-database, serializable ACID transactions .• Transactions within-EG – A transaction writes its mutations into the EGs WAL, then the mutations are applied to the data. – Readers use the timestamp of the last fully applied transaction to avoid seeing partial updates.• MVCC: Multi-Version Concurrency Control (very important) – Use Bigtable cell’s timestamps/versions – Readers and writers dont block each other, and reads are isolated from writes for the duration of a transaction. (How? See MVCC in Wikipedia)• Write patterns – A write transaction always begins with a current read to determine the next available log position. (This current read only ensures that all previously committed writes to be applied.) – The commit operation gathers mutations into a log entry, assigns it a timestamp higher than any previous one, and appends it to the log (and using Paxos for replicate across-DC). – The write operation can return to the client at any point after Commit. Write Op Commit Read Op• Read patterns – Current Read Metadata and WAL of EG root Check for – Snapshot Read recover committed logs ad In Bigtable Re – Inconsistent Read The apply may be async Apply Tables data and Indexes data in Bigtable 2012/3/25 29
  30. 30. Transactions and Concurrency Control - Write Last committed position Writer (ts2) When failure occurs here: Last fully applied position (ts1)  Transaction-1: Metadata Very safe. Transaction-3 Writing  Transaction-2:Serializable Transactions (ongoing) Writ writing but not commit e not gather and append into log Safe, no data loss, but Transaction-2 (commited) Write Commit Mutation-22-ts2 should be recovered from committed but not fully applied partially applied data log to data, when doing Transaction-1 Writ Mutation-21-ts2 “current read” or “write” (committed, applied) Com e operations. mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1  Transaction-3: a log entry WAL Data-part1-ts1 Not complete, failed. assign it a timestamp Application will get failed committed and fully applied Data (Use Bigtable timestamp return. Figure : The state of Transaction System for a EG for MVCC) Note: The commit operation gathers mutations into a log entry, assigns it a timestamp to it. A write transaction always begins with ensuring that all previously committed writes to be applied (via a current read)! 2012/3/25 30
  31. 31. Transactions Read Patterns and Lifecycle• Current Read • A complete transaction – Only within-EG lifecycle – When starting a current read, – Read the transaction system first ensures that all previously • Get timestamp of the last committed writes are applied. committed transaction from (Just like the recovery of metadata. commit-logs.) – Application logic – Then the application reads at the • Read-modify-write. timestamp of the latest – Commit committed transaction. • Gathers mutations into a log entry, assigns it a higher• Snapshot Read timestamp. – Only within-EG • Replicate across-DC via – Picks up the timestamp of the Paxos. last known fully applied • Can return to client here. transaction and reads from there. ---------------------------------------- – Some committed transactions (following job may be asynchronous) may not yet be applied. ---------------------------------------- – Apply• Inconsistent Read • Write mutations into data – Read the latest values directly, and indexes. may get partially applied data, for aggressive latency. – Clean up • Delete fully applied logs. 2012/3/25 31
  32. 32. Transactions Read Patterns – Current Read (1) Check latest Current Reader committed writes Last committed position (ts2) (3) Update metadata Last fully applied position (ts1) -> (ts2) Metadata Transaction-3 Writing Serializable Transactions (ongoing) Writ writing but not commit (4) Read data at ts2 e not gather and append into log (2) Apply previous Transaction-2 Write committed writes (commited) Commit Mutation-22-ts2 committed but not fully applied Transaction-1 Mutation-21-ts2 Data-part2-ts2 (committed, applied) Writ Com e mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1 WAL a log entry Data-part1-ts1 assign it a timestamp committed and Data fully applied (Use Bigtable timestamp Figure : The state of Transaction System for a EG for MVCC) Do recovery write before read data. 2012/3/25 32
  33. 33. Transactions Read Patterns – Snapshot Read Snapshot Reader Last committed position (1) Get last fully applied ts (ts2) Last fully applied position (ts1) Metadata Transaction-3 Writing Serializable Transactions (ongoing) Writ writing but not commit e not gather and append into log Transaction-2 Write (2) Read data at ts1 (commited) Commit Mutation-22-ts2 committed but not fully applied Transaction-1 Mutation-21-ts2 (committed, applied) Writ Com e mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1 WAL a log entry Data-part1-ts1 assign it a timestamp committed and Data fully applied (Use Bigtable timestamp Figure : The state of Transaction System for a EG for MVCC) The very easy read pattern. 2012/3/25 33
  34. 34. Transactions Read Patterns – Inconsistent Read Inconsistent Reader Last committed position (ts2) Last fully applied position (ts1) Metadata (1) Directly read partial data Transaction-3 Writing Serializable Transactions (ongoing) Writ writing but not commit e not gather and append into log Transaction-2 Write (commited) Commit Mutation-22-ts2 committed but not partially fully applied applied data Transaction-1 Mutation-21-ts2 (committed, applied) Writ Com e mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1 WAL a log entry Data-part1-ts1 assign it a timestamp committed and Data fully applied (Use Bigtable timestamp Figure : The state of Transaction System for a EG for MVCC) The application must tolerate the stale or partially applied data. 2012/3/25 34
  35. 35. Two-Phase-Commit Expensive, Long Latency 2012/3/25 35
  36. 36. Replication for High Availability …I need more study about Paxos, so it is not go-to- detailed. 2012/3/25 36
  37. 37. Replication• Within-DC – Across hosts – Built-in from Bigtable and GFS• Across-DC – … synchronous and consistent for each write 2012/3/25 37
  38. 38. Replication cross-DC• Traditional strategies (not work) • EG based synchronously – Asynchronous Master/Slave • Asynchronously propagate replicate each write • Master supports fast ACID transactions • Low latency • Data loss risk • Use Paxos • Downtime for failover – No distinguished master • Heavyweight master • Required a mediate mastership(e.g. – Replicate write-ahead-log ZooKeeper) – Synchronously replicating writes (each log append blocks – Synchronous Master/Slave • No data loss on acknowledgments from a • Downtime for failover majority of replicas, and • Long latency replicas in the minority catch • Heavyweight master up as they are able) • Required a mediate mastership(e.g. ZooKeeper) – Any node can initiate writes and reads – Optimistic Replication – Reasonable latency • No distinguished master • Asynchronously propagate • Availability and latency are excellent – Extensions • No mutation order and transactions are impossible • Allows local reads at any up- • Like Cassandra/Dynamo to-date replica • Permits single-roundtrip writes 2012/3/25 38
  39. 39. Paxos• Traditional usages – Locking – Master election – Replication of metadata and configurations• Megastore use Paxos – Replicate primary user data across-DC on every write – For across-DC high availability 2012/3/25 39
  40. 40. To Study More …2012/3/25 40
  41. 41. Valuable References• P. Helland. Life beyond distributed transactions: an apostates opinion. In CIDR, pages 132-141, 2007. – The philosophy inspire of Megastore 2012/3/25 41