Learning from Google Megastore

 Part1: Data Model and Transactions in single datacenter
                (w/o Replication and Paxos)

                     Schubert Zhang
                      April 9, 2011
Megastore Introduction




2012/3/25                            2
Three Aspects of Megastore

• Data Model to be a DB
   – Data layout
   – Indexing


• Transactions and ACID
   – Within Entity Group
   – Across Entity Group


• Replication across datacenter (not be researched in detail in
  this presentation)
   – Synchronous replication
   – Optimized Paxos




     2012/3/25                                         3
What is?

                              Megastore is:

                        A database over Bigtable,
                with High Availability across datacenters.



                          Bigdata Philosophy:

          fine-grained partitioning to make things easy,
                   data placement for relations,
                             and Paxos

   then, a simple API/Language for convenience of usage!


    2012/3/25                                                4
Target Applications
•   Interactive online services        •   Application developers
     – User facing applications            – Be familiar with RDBMS, SQL
                                           – Difficult to give up “read-
•   Conflicting requirements                 modify-write” idiom
     – Highly scalable (size,              – But now need high scalability
       throughput)                           for bigdata
     – Rapid development, fast time-
       to-market
     – Responsive, Low latency
     – Consistent view of data
     – Highly available

•   Reads vs. Writes
     – 20 billion:3 billion, daily
       @Google
     – 7:1

•   Bigdata
     – Petabyte of primary data
     – Across datacenters
       2012/3/25                                                 5
NoSQL + RDBMS = Megastore

•   NoSQL datastore (Bigtable)                    •   Megastore database
    – Pros                                             – High scalability
         • Highly scalable                             – Distributed transactions
         • Highly available within DC                  – Consistency guarantees
           (across hosts)
                                                       – Fully serializable ACID
    – Cons
                                                         semantics within entity-groups
         • Limited API
                                                       – Convenience, rapid
         • Loose consistency models
                                                         development for applications
         • Complicate application         blend
           development
                                                  •   + High Availability
•   RDBMS                                              – Within-DC (Bigtable)
    – Pros                                             –   Across-DC replication, Paxos
                                                           (synchronously write within EG)
         • Rich set of features for
           convenience, rapid                          –   Strong consistency guarantees
           development for applications                    (synchronously replicate)
         • Transactions                                –   Reasonable latency, seamless
                                                           failover
         • ACID semantics
    – Cons
         • Difficult to scale

       2012/3/25                                                                 6
Design Principles

•   Taking a middle ground in the RDBMS vs. NoSQL design space:
    –   partition the datastore and
    –   replicate each partition separately,
    –   providing full ACID semantics within partitions,
    –   but only limited/loose consistency guarantees across them.


•   Use Paxos to build a highly available system:
    – provides reasonable latencies for interactive applications while
    – synchronously replicating writes across geographically distributed
      datacenters,
    – to achieve across-DC high availability and a consistent view of the data.


•   Approachs:
    – for database scale, partitioning data into a vast space of small
      databases, each with its own replicated log stored in a per-replica
      Bigtable;
    – for availability, implementing a synchronous, fault-tolerant log replicator
      optimized for cross-DC replication.

        2012/3/25                                                     7
EG: Entity-Groups
•   Entity-Group concept is the footstone of scalability and availability!
     –   Fine-grained partitions of data
     –   Fine-grained control over data’s partitioning and locality
     –   Like many mini-databases
     –   To scale throughput and localize outages
     –   Each independently and synchronously replicated across-DC

                                                               The data for most Internet
•   An physical EG in Bigtable consist of                       services can be suitably
     –   A write-ahead-log (for ACID transactions)            partitioned (e.g., by user) to
     –   Related data (pre-joined)                            make this approach viable.
     –   Local indexes (with also ACID)
     –   … Like a mini-database (locally complete)
                                                            Nearly all applications built on
     –   And a inbox for receiving across-EG messages
                                                            Megastore have found ways to
                                                                draw EG boundaries.
•   Size of a EG
     –   Not too large, Not too small
     –   A priori/natural or deliberate grouping of data for fast operations
     –   If too large: serializable ACID make long latency and low throughput
     –   If too small: many across-EG expensive consistency operations (e.g. 2PC), or
         looser consistency asynchronous messaging

         2012/3/25                                                               8
Schematic Diagrams
                        A EG like a mini-DB

                             WAL (logs)


                             Primary Data


                            Local Indexes


                      Inbox for Queue Messages




                                 EG 2

                                ……

                                 EG n


                     Megastore layout in Bigtable




    2012/3/25                                       9
Many WAL vs. Single WAL

• Many replicated logs each governing its own EG, to improve
  availability and throughput.
   – Independent and concurrent operations for different EG
   – Only operations within a EG need to be serialized
   – Temporary long-wait and failed operations does not impact
     other EG


• Many WAL to scale throughput and localize outages

• WAL is stored with each EG in Bigtable

• Examples with the same tenet
   – The asynchronous and concurrent RPC communication
     framework of HBase and Hadoop IPC.



     2012/3/25                                            10
Consistency Levels and the Approaches
•   Within each EG: Full ACID semantics
     –   Single-Phase-Commit ACID transactions
     –   And commit entity is replicated via Paxos across-DC


•   Across-EG: Limited consistency guarantees (two methods for tow levels)
     –   Two-Phase-Commit (expensive, long latency) -> strong consistency
     –   Or, Typically leverage efficient asynchronous messaging (queue!, inexpensive, low latency) ->
         loose (or eventual) consistency


•   Two-phase-commit vs. asynchronous-messaging
     –   Two-Phase-Commit transactions
           •   Strong consistency
           •   Expensive
           •   Long latency and low throughput
           •   Usually for low-traffic operations
     –   Asynchronous-messaging
           •   Loose consistency, may be inconsistent (or may be eventual consistency)
           •   Inexpensive
           •   Usually for heavy-traffic operations


•   Objects to be made consistent:
     –   Data, Local Indexes, within EG : strong (via WAL, ACID)
     –   Data, Global Indexes, cross-EG : strong (via 2PC) or looser (via messaging)
     –   Replicas within DC : strong (via GFS and Bigtable)
     –   Replicas across DC : strong (via Paxos)
         2012/3/25                                                                        11
The two Faces of ACID Transactions

• Frontface:
  – Simplify development for applications
  – Reasoning about correctness


• Backface:
  – Performance reduce
  – Latency
  – Throughput




    2012/3/25                               12
Architecture of Megastore – How it deploy?

• How it deploy
   – a client library (DB logic)
   – and auxiliary servers (for across-DC replication)


• Applications link to the client library




      2012/3/25                                          13
Data Model and Semantics
                to be a database …




2012/3/25                              14
Principles to be a DBMS


• Provides traditional database features, such as secondary
  indexes, etc.

• but only those features that can scale within user-tolerable
  latency limits,

• and only with the semantics that EG partitioning scheme can
  support.



                 Feature set carefully chosen, tradeoffs.




     2012/3/25                                              15
Data Model (concepts for database)

•   A Data Model is a notation for describing data or information.

•   Consists of 3 parts, generally
     – Structure of the data
     – Operations on the data
     – Constraints on the data


•   Megastore Data Model: Relational Model + Scale
     – Limited relational model
     – Bigtable’s scalability


•   High Level Model vs. Physical Level Model
     – Physical Level
         • Complicate application development
         • Bigtable’s data model is at physical level
     – High Level
         • Let programmers to write code conveniently
         • Language, SQL

       2012/3/25                                              16
Data Model

•     Schemaful                                            •   Primary key
       −      Strongly typed (Primitives or PB)                 – Built from a sequence of
       −      Required, optional or repeated                      properties
       −      All entities in a table have the
              same set of allowable properties.
                                                                – Must be unique within the table
       −      Nested Protocol-Buffers?
                                                               An EG= a root entity + all entities
                                  Entities    Properties
    Schemas          Tables                                    in child tables that reference it
                                  (primary     (name,
    (name)          (name)
                                    key)        type)
                                                                 EG Root    Child tables
                                              Property-           table       (foreign     Entities
                                                 111
                                                                 (EG key)   key=EG key)
                                              Property-
                                  Entity-11
                                                 112                                       Entity
                     Table-1
                                              Property-                       Photo
                                  Entity-12
                                                 113                                       Entity
    Schema
                                                                  User
                                  Entity-21                                                Entity
                     Table-2                                                  Book
                                  Entity-22                                                Entity


                         schema                                     related hierarchical data

             2012/3/25                                                                       17
SQL-Like Schema Language (DDL)


CREATE SCHEMA DemoApp;                        Additional Qualifiers:
CREATE TABLE User {                           DESC|ASC|SCATTER
    required int64 userId;
    required string name;                     ------------------------------------
} PRIMARY KEY(userId), ENTITY GROUP ROOT;     CREATE TABLE Book{
                                                  required int64 userId;
CREATE TABLE Photo {                              required int32 bookId;
    required int64 userId;
                                                  required int64 time;
    required int32 photoId;
    required int64 time;                          required string url;
    required string url;                          repeated string tag;
    optional string thumbUrl;                 } PRIMARY KEY([DESC|ASC|SCATTER] userId,
    repeated string tag;                      [DESC|ASC|SCATTER] bookId),
} PRIMARY KEY(userId, photoId),                 IN TABLE User,
  IN TABLE User,                                ENTITY GROUP KEY(userId) REFERENCES User;
  ENTITY GROUP KEY(userId) REFERENCES User;

CREATE LOCAL INDEX PhotosByTime               CREATE LOCAL INDEX BooksByTime
       ON Photo(userId, time);                       ON Book([DESC|ASC|SCATTER] userId,
                                              [DESC|ASC] time);
CREATE GLOBAL INDEX PhotosByTag
       ON Photo(tag) STORING (thumbUrl);




         2012/3/25                                                          18
Data Placement in Bigtable (principles)
Pre-join with Keys, for performance …
•   Lets applications control the placement of hierarchical/related data, to
    minimize latency and maximize throughput
     –   Storing data that is accessed together in nearby rows, or
     –   Denormalized into the same row


•   The data for a EG are held in contiguous ranges of Bigtable rows, for
     –   Low latency
     –   High throughput
     –   Cache efficiency


•   Pre-Joining with keys
     –   Primary keys to cluster entities that will be read together.
     –   Each entity maps into a single Bigtable row.
     –   Primary key values are concatenated to form the Bigtable row key
     –   Each remaining property occupies its own Bigtable column
     –   Entity-group key as the prefix of Primary key (row key)
     –   Sorted keys ascending or descending
     –   SCATTER (two-byte hash prefix), to prevent hotspots in Bigtable

     –   Recursive for arbitrary join depths (multiple levels of “IN TABLE”)

         2012/3/25                                                             19
Data Placement in Bigtable (details)
Pre-join with Keys, for performance …
•   Bigtable row key = primary key of each table

•   Bigtable column name = <table name>.<property name>
    – Allowing entities from different Megastore tables to be mapped into the
      same Bigtable row without collision.


•   Store the transaction and replication log and metadata for the EG
    in root entity’s Bigtable row.
    – Because Bigtable provides per-row transactions.


•   Indexes: Each index entry is represented as a single Bigtable row
    – Bigtable row key = <indexed property values> + <primary key>
    – Bigtable cell columns: denormalized properties




       2012/3/25                                                   20
Data Placement in Bigtable (examples)

                                                                                                                              STORING
                                                    Transaction Meta User Table                 Photo Table                 Denormalized

                                   Row Key          Root. Root.        User.      Photo. Photo. Photo.        Photo.        PhotosByTag.
                                                    WAL meta           name       time   url    thumbUrl      tag           thumbUrl
                                   <U1>             Log3    commit     Jack
                                                    Log2    offset
   Root
   User




                                                    Log1    applied
                                                            offset …




                                                                                                                                           EG for U1
                                   <U1,P1>                                        T1     URL1     TURL1       girl, car
Photo Local Index Global Index
 Data PhotosByTime PhotosByTag




                                   <U1,P2>                                        T2     URL2     TURL2       dress, girl

                                   <U1,T1><U1,P1>

                                   <U1,T2><U1,P2>

                                   <car><U1,P1>                                                                             TURL1

                                   <dress><U1,P2>                                                                           TURL2

                                   <girl><U1,P1>                                                                            TURL1

                                   <girl><U1,P2>                                                                            TURL2


                                       2012/3/25                                                                            21
Secondary Indexes

•   Secondary indexes can be declared on any list of entity
    properties(optional is ok), including repeated properties, as well as
    sub-fields within ProtocolBuffers, and full-text index.

•   Local Indexes
     – Within EG
     – Obey ACID semantics
         • The index entries are stored in the entity group and are updated atomically
           and consistently with the primary entity data.


•   Global Indexes
     – Span EGs
     – Looser consistency (or may eventual)
         • Not guaranteed to reflect all recent updates. (may inconsistent with the
           primary data?)
         • It is a trick to keep consistent between Global Indexes and primary data!?
                   – Special Two-Phase-Commit? and
                   – Read-Repair?



       2012/3/25                                                            22
Secondary Indexes and Demoralization
•   STORING clause for copied data in index entities
      –   Avoid the indirect access of primary entities, it is very expensive random access.
      –   But, keeping consistent is a issue!


•   Inline Indexes
      –   Index entries from the source entities appear as a virtual repeated column in the
          target entry.
      –   An inline index can be created on any table (child) that has a foreign key
          referencing another table (parent) by using the first primary key of the target
          entity as the first components of the index.
                                               Inline Index
                                         Repeated Columns Inline

          User        Row Key   User.   PhotosByTime.   PhotosByTime. Photo.   Photo.
      Parent Table              name    T1              T2            time     thumbUrl
                      <U1>      Jack    <P1>            <P2>
      Photo
    Child Table       <U1,P1>                                         T1       TURL1

                      <U1,P2>                                         T2       TURL2

              CREATE INLINE INDEX PhotosByTime ON Photo(userId, time);

          2012/3/25                                                                    23
Inline Indexes for many-to-many
Relationships
•      Coupled with repeated indexes, inline indexes can also be used to
       implement many-to-many relationships more efficiently than by maintaining
       a many-to-many link table.
                                        Inline Index
                                                                  many-to-many
                                  Repeated Columns Inline

                         Row       User.   PhotosByTag.   PhotosByTag.   PhotosByTag.   Photo.   Photo.
         User
                         Key       name    car            dress          girl           time     thumbUrl
     Parent Table
                         <U1>      Jack    <P1>           <P2>           <P1>
                                                                         <P2>

      Photo              <U1,P1>                                                        T1       TURL1
    Child Table          <U1,P2>                                                        T2       TURL2

                         <U2>      Tom                                   <P1>

                         <U2,P1>                                                        T3       TURL3



                    CREATE INLINE INDEX PhotosByTag ON Photo(userId, tag);



             2012/3/25                                                                           24
API
•   Cost-transparent API
     –   Match application developers’ intuitions
     –   High-volume interactive workloads benefit more from predictable performance than from an
         expressive query language.


•   Normalized relational schemas rely on joins at query time to service user operations, is
    not the right model for Megastore applications.
     –   Pre-joins
     –   Denormalization


•   SQL-Like Schema language (DDL, for data structures and data placement)
     –   Fine-grained control over physical locality
           •   Hierarchical layouts (pre-joins)
           •   Declarative denormalization
     –   Eliminate the need for most joins


•   Queries API against particular tables and indexes
     –   Range Scans
     –   Lookups


•   Schema changes require corresponding modifications to the query implementation
    code



         2012/3/25                                                                     25
Query Joins

• Query Joins, when required, are implemented in application
  code.

• Index-based join

• Merge joins
   – Multiple queries returns primary keys for the same table, in the
     same order.
   – Then intersection of keys for them.


• Outer joins
   – Index lookup (return small result set)
   – Parallel index lookups using the results of the above lookup


• Other joins …?

     2012/3/25                                              26
Query Joins - Merge Joins

              Query-1
 SELECT * FROM Photo WHERE tag=girl
                                                                                           girl & car
                                                          Intersection                         or
                                                             & or |                        girl | car
 SELECT * FROM Photo WHERE tag=car

              Query-2


                               Use the global index: PhotosByTag

                                          Just like:
                        SELECT * FROM Photo WHERE tag=girl AND tag=car
                                              or
                         SELECT * FROM Photo WHERE tag=girl OR tag=car

          Strictly, Merge Join is not a real join in the lingo of SQL, but is really a “Join”.



      2012/3/25                                                                            27
Query Joins - Outer Joins

                       name=Jack,
                       userId=U1,U2           Query-2
       Query-1                                                           userId=U1,U2
                                       Parallel Index Lookup
                                                 Query-2
     Index lookup                                                        T1<time<T10
                                        Parallel Index Lookup

 SELECT name, userId FROM User
                                       SELECT thumbnUrl FROM Photo
 WHERE name=Jack
                                       WHERE time>T1 AND time<T10;
 (suppose there is a index:
                                       … Parallel for each userId.
  UsersByName)


                                          Just like:
                  SELECT User.name, User.userId, Photo.thumbUrl FROM User
                     LEFT OUTER JOIN Photo ON Photo.userId=User.userId
                 WHERE User.name=Jack AND Photo.time>T1 and Photo.time<T10

                                      Example of result:
                                      Jack, U1, TURL1
                                      Jack, U2, NULL


     2012/3/25                                                               28
Transactions and Concurrency Control
•   An EG as a mini-database, serializable ACID transactions .
•   Transactions within-EG
     –   A transaction writes its mutations into the EG's WAL, then the mutations are applied to the data.
     –   Readers use the timestamp of the last fully applied transaction to avoid seeing partial updates.


•   MVCC: Multi-Version Concurrency Control (very important)
     –   Use Bigtable cell’s timestamps/versions
     –   Readers and writers don't block each other, and reads are isolated from writes for the duration
         of a transaction. (How? See MVCC in Wikipedia)


•   Write patterns
     –   A write transaction always begins with a current read to determine the next available log
         position. (This current read only ensures that all previously committed writes to be applied.)
     –   The commit operation gathers mutations into a log entry, assigns it a timestamp higher than
         any previous one, and appends it to the log (and using Paxos for replicate across-DC).
     –   The write operation can return to the client at any point after Commit.
                                         Write Op                Commit                                              Read Op
•   Read patterns
     –   Current Read                   Metadata and WAL of EG root                        Check for
     –   Snapshot Read                                                               recover committed logs




                                                                                                                ad
                                                 In Bigtable




                                                                                                              Re
     –   Inconsistent Read                   The apply may be async    Apply


                                                                      Tables data and Indexes data
                                                                               in Bigtable


         2012/3/25                                                                                                             29
Transactions and Concurrency Control -
                      Write
                                                                       Last committed position
                            Writer
                                                                               (ts2)
                                                                                                                                                         When failure occurs here:
                                                                   Last fully applied position
                                                                               (ts1)
                                                                                                                                                          Transaction-1:
                                                                              Metadata
                                                                                                                                                           Very safe.
                               Transaction-3         Writing
                                                                                                                                                          Transaction-2:
Serializable Transactions




                                 (ongoing)           Writ                                               writing but not commit
                                                          e
                                                                                                    not gather and append into log
                                                                                                                                                            Safe, no data loss, but
                               Transaction-2
                                (commited)
                                                        Write
                                                       Commit             Mutation-22-ts2
                                                                                                                                                         should be recovered from
                                                                                                     committed but not
                                                                                                       fully applied
                                                                                                                                            partially
                                                                                                                                          applied data
                                                                                                                                                         log to data, when doing
                               Transaction-1
                                                       Writ
                                                                          Mutation-21-ts2                                                                “current read” or “write”
                            (committed, applied)
                                                      Com e                                                                                              operations.
                                                         mit              Mutation-12-ts1                                Data-part1-ts2
                               Transactions
                                                                          Mutation-11-ts1                                Data-part2-ts1
                                                                                                                                                          Transaction-3:
                                                    a log entry
                                                                               WAL
                                                                                                                         Data-part1-ts1                     Not complete, failed.
                                               assign it a timestamp                                                                                     Application will get failed
                                                                                                 committed and
                                                                                                  fully applied
                                                                                                                              Data
                                                                                                                    (Use Bigtable timestamp
                                                                                                                                                         return.
                                   Figure : The state of Transaction System for a EG                                      for MVCC)



                            Note: The commit operation gathers mutations into a log entry, assigns it a timestamp to it.
                            A write transaction always begins with ensuring that all previously committed writes to be
                                                          applied (via a current read)!
                                   2012/3/25                                                                                                                           30
Transactions Read Patterns and Lifecycle
•   Current Read                          •   A complete transaction
    – Only within-EG                          lifecycle
    – When starting a current read,           – Read
      the transaction system first
      ensures that all previously                  • Get timestamp of the last
      committed writes are applied.                  committed transaction from
      (Just like the recovery of                     metadata.
      commit-logs.)                           – Application logic
    – Then the application reads at the            • Read-modify-write.
      timestamp of the latest                 – Commit
      committed transaction.
                                                    • Gathers mutations into a log
                                                       entry, assigns it a higher
•   Snapshot Read                                      timestamp.
    – Only within-EG                                • Replicate across-DC via
    – Picks up the timestamp of the                    Paxos.
      last known fully applied                      • Can return to client here.
      transaction and reads from there.       ----------------------------------------
    – Some committed transactions             (following job may be asynchronous)
      may not yet be applied.                 ----------------------------------------
                                              – Apply
•   Inconsistent Read                              • Write mutations into data
    – Read the latest values directly,               and indexes.
      may get partially applied data,
      for aggressive latency.                 – Clean up
                                                   • Delete fully applied logs.

       2012/3/25                                                         31
Transactions Read Patterns – Current Read

                                                                                                           (1) Check latest
                                                                                                                                       Current Reader
                                                                                                          committed writes
                                                                            Last committed position
                                                                                    (ts2)
                                                                                                           (3) Update metadata
                                                                        Last fully applied position
                                                                               (ts1) -> (ts2)
                                                                                   Metadata



                                    Transaction-3         Writing
     Serializable Transactions




                                      (ongoing)           Writ                                               writing but not commit             (4) Read data at ts2
                                                               e
                                                                                                         not gather and append into log
                                                                                                                (2) Apply previous
                                    Transaction-2            Write                                               committed writes
                                     (commited)             Commit             Mutation-22-ts2
                                                                                                        committed but not
                                                                                                          fully applied
                                    Transaction-1                              Mutation-21-ts2                                Data-part2-ts2
                                 (committed, applied)       Writ
                                                           Com e
                                                              mit              Mutation-12-ts1                                Data-part1-ts2
                                    Transactions
                                                                               Mutation-11-ts1                                Data-part2-ts1

                                                                                    WAL
                                                         a log entry                                                          Data-part1-ts1
                                                    assign it a timestamp
                                                                                                      committed and                Data
                                                                                                       fully applied     (Use Bigtable timestamp
                                        Figure : The state of Transaction System for a EG                                      for MVCC)




                                                           Do recovery write before read data.

    2012/3/25                                                                                                                                                      32
Transactions Read Patterns – Snapshot Read

                                                                                                                                        Snapshot Reader

                                                                            Last committed position
                                                                                                              (1) Get last fully applied ts
                                                                                    (ts2)
                                                                        Last fully applied position
                                                                                    (ts1)
                                                                                   Metadata



                                    Transaction-3         Writing
     Serializable Transactions




                                      (ongoing)           Writ                                               writing but not commit
                                                               e
                                                                                                         not gather and append into log

                                    Transaction-2            Write                                                                               (2) Read data at ts1
                                     (commited)             Commit             Mutation-22-ts2
                                                                                                          committed but not
                                                                                                            fully applied
                                    Transaction-1                              Mutation-21-ts2
                                 (committed, applied)       Writ
                                                           Com e
                                                              mit              Mutation-12-ts1                                Data-part1-ts2
                                    Transactions
                                                                               Mutation-11-ts1                                Data-part2-ts1

                                                                                    WAL
                                                         a log entry                                                          Data-part1-ts1
                                                    assign it a timestamp
                                                                                                      committed and                 Data
                                                                                                       fully applied      (Use Bigtable timestamp
                                        Figure : The state of Transaction System for a EG                                       for MVCC)




                                                                      The very easy read pattern.

    2012/3/25                                                                                                                                                       33
Transactions Read Patterns – Inconsistent Read

                                                                                                                                           Inconsistent
                                                                                                                                             Reader
                                                                             Last committed position
                                                                                     (ts2)
                                                                         Last fully applied position
                                                                                     (ts1)
                                                                                    Metadata                                                       (1) Directly read
                                                                                                                                                      partial data

                                     Transaction-3         Writing
      Serializable Transactions




                                       (ongoing)           Writ                                               writing but not commit
                                                                e
                                                                                                          not gather and append into log

                                     Transaction-2            Write
                                      (commited)             Commit             Mutation-22-ts2
                                                                                                           committed but not                      partially
                                                                                                             fully applied                      applied data
                                     Transaction-1                              Mutation-21-ts2
                                  (committed, applied)       Writ
                                                            Com e
                                                               mit              Mutation-12-ts1                                Data-part1-ts2
                                     Transactions
                                                                                Mutation-11-ts1                                Data-part2-ts1

                                                                                     WAL
                                                          a log entry                                                          Data-part1-ts1
                                                     assign it a timestamp
                                                                                                       committed and                Data
                                                                                                        fully applied     (Use Bigtable timestamp
                                         Figure : The state of Transaction System for a EG                                      for MVCC)



                                  The application must tolerate the stale or partially applied data.

    2012/3/25                                                                                                                                                    34
Two-Phase-Commit




                   Expensive, Long Latency




   2012/3/25                           35
Replication
             for High Availability …

I need more study about Paxos, so it is not go-to-
                   detailed.




 2012/3/25                                   36
Replication

• Within-DC
   – Across hosts
   – Built-in from Bigtable and GFS


• Across-DC
   – … synchronous and consistent for each write




     2012/3/25                                     37
Replication cross-DC
•   Traditional strategies (not work)                   •   EG based synchronously
     –   Asynchronous Master/Slave
           •   Asynchronously propagate
                                                            replicate each write
           •   Master supports fast ACID transactions
           •   Low latency
           •   Data loss risk                           •   Use Paxos
           •   Downtime for failover                        – No distinguished master
           •   Heavyweight master
           •   Required a mediate mastership(e.g.           – Replicate write-ahead-log
               ZooKeeper)                                   – Synchronously replicating
                                                              writes (each log append blocks
     –   Synchronous Master/Slave
           •   No data loss
                                                              on acknowledgments from a
           •   Downtime for failover                          majority of replicas, and
           •   Long latency                                   replicas in the minority catch
           •   Heavyweight master                             up as they are able)
           •   Required a mediate mastership(e.g.
               ZooKeeper)                                   – Any node can initiate writes
                                                              and reads
     –   Optimistic Replication
                                                            – Reasonable latency
           •   No distinguished master
           •   Asynchronously propagate
           •   Availability and latency are excellent
                                                            – Extensions
           •   No mutation order and transactions
               are impossible                                   • Allows local reads at any up-
           •   Like Cassandra/Dynamo                              to-date replica
                                                                • Permits single-roundtrip writes
         2012/3/25                                                                  38
Paxos

• Traditional usages
   – Locking
   – Master election
   – Replication of metadata and configurations


• Megastore use Paxos
   – Replicate primary user data across-DC on every write
   – For across-DC high availability




     2012/3/25                                              39
To Study More …




2012/3/25                     40
Valuable References

•   P. Helland. Life beyond distributed transactions: an apostate's
    opinion. In CIDR, pages 132-141, 2007.
     – The philosophy inspire of Megastore




       2012/3/25                                               41

Learning from google megastore (Part-1)

  • 1.
    Learning from GoogleMegastore Part1: Data Model and Transactions in single datacenter (w/o Replication and Paxos) Schubert Zhang April 9, 2011
  • 2.
  • 3.
    Three Aspects ofMegastore • Data Model to be a DB – Data layout – Indexing • Transactions and ACID – Within Entity Group – Across Entity Group • Replication across datacenter (not be researched in detail in this presentation) – Synchronous replication – Optimized Paxos 2012/3/25 3
  • 4.
    What is? Megastore is: A database over Bigtable, with High Availability across datacenters. Bigdata Philosophy: fine-grained partitioning to make things easy, data placement for relations, and Paxos then, a simple API/Language for convenience of usage! 2012/3/25 4
  • 5.
    Target Applications • Interactive online services • Application developers – User facing applications – Be familiar with RDBMS, SQL – Difficult to give up “read- • Conflicting requirements modify-write” idiom – Highly scalable (size, – But now need high scalability throughput) for bigdata – Rapid development, fast time- to-market – Responsive, Low latency – Consistent view of data – Highly available • Reads vs. Writes – 20 billion:3 billion, daily @Google – 7:1 • Bigdata – Petabyte of primary data – Across datacenters 2012/3/25 5
  • 6.
    NoSQL + RDBMS= Megastore • NoSQL datastore (Bigtable) • Megastore database – Pros – High scalability • Highly scalable – Distributed transactions • Highly available within DC – Consistency guarantees (across hosts) – Fully serializable ACID – Cons semantics within entity-groups • Limited API – Convenience, rapid • Loose consistency models development for applications • Complicate application blend development • + High Availability • RDBMS – Within-DC (Bigtable) – Pros – Across-DC replication, Paxos (synchronously write within EG) • Rich set of features for convenience, rapid – Strong consistency guarantees development for applications (synchronously replicate) • Transactions – Reasonable latency, seamless failover • ACID semantics – Cons • Difficult to scale 2012/3/25 6
  • 7.
    Design Principles • Taking a middle ground in the RDBMS vs. NoSQL design space: – partition the datastore and – replicate each partition separately, – providing full ACID semantics within partitions, – but only limited/loose consistency guarantees across them. • Use Paxos to build a highly available system: – provides reasonable latencies for interactive applications while – synchronously replicating writes across geographically distributed datacenters, – to achieve across-DC high availability and a consistent view of the data. • Approachs: – for database scale, partitioning data into a vast space of small databases, each with its own replicated log stored in a per-replica Bigtable; – for availability, implementing a synchronous, fault-tolerant log replicator optimized for cross-DC replication. 2012/3/25 7
  • 8.
    EG: Entity-Groups • Entity-Group concept is the footstone of scalability and availability! – Fine-grained partitions of data – Fine-grained control over data’s partitioning and locality – Like many mini-databases – To scale throughput and localize outages – Each independently and synchronously replicated across-DC The data for most Internet • An physical EG in Bigtable consist of services can be suitably – A write-ahead-log (for ACID transactions) partitioned (e.g., by user) to – Related data (pre-joined) make this approach viable. – Local indexes (with also ACID) – … Like a mini-database (locally complete) Nearly all applications built on – And a inbox for receiving across-EG messages Megastore have found ways to draw EG boundaries. • Size of a EG – Not too large, Not too small – A priori/natural or deliberate grouping of data for fast operations – If too large: serializable ACID make long latency and low throughput – If too small: many across-EG expensive consistency operations (e.g. 2PC), or looser consistency asynchronous messaging 2012/3/25 8
  • 9.
    Schematic Diagrams A EG like a mini-DB WAL (logs) Primary Data Local Indexes Inbox for Queue Messages EG 2 …… EG n Megastore layout in Bigtable 2012/3/25 9
  • 10.
    Many WAL vs.Single WAL • Many replicated logs each governing its own EG, to improve availability and throughput. – Independent and concurrent operations for different EG – Only operations within a EG need to be serialized – Temporary long-wait and failed operations does not impact other EG • Many WAL to scale throughput and localize outages • WAL is stored with each EG in Bigtable • Examples with the same tenet – The asynchronous and concurrent RPC communication framework of HBase and Hadoop IPC. 2012/3/25 10
  • 11.
    Consistency Levels andthe Approaches • Within each EG: Full ACID semantics – Single-Phase-Commit ACID transactions – And commit entity is replicated via Paxos across-DC • Across-EG: Limited consistency guarantees (two methods for tow levels) – Two-Phase-Commit (expensive, long latency) -> strong consistency – Or, Typically leverage efficient asynchronous messaging (queue!, inexpensive, low latency) -> loose (or eventual) consistency • Two-phase-commit vs. asynchronous-messaging – Two-Phase-Commit transactions • Strong consistency • Expensive • Long latency and low throughput • Usually for low-traffic operations – Asynchronous-messaging • Loose consistency, may be inconsistent (or may be eventual consistency) • Inexpensive • Usually for heavy-traffic operations • Objects to be made consistent: – Data, Local Indexes, within EG : strong (via WAL, ACID) – Data, Global Indexes, cross-EG : strong (via 2PC) or looser (via messaging) – Replicas within DC : strong (via GFS and Bigtable) – Replicas across DC : strong (via Paxos) 2012/3/25 11
  • 12.
    The two Facesof ACID Transactions • Frontface: – Simplify development for applications – Reasoning about correctness • Backface: – Performance reduce – Latency – Throughput 2012/3/25 12
  • 13.
    Architecture of Megastore– How it deploy? • How it deploy – a client library (DB logic) – and auxiliary servers (for across-DC replication) • Applications link to the client library 2012/3/25 13
  • 14.
    Data Model andSemantics to be a database … 2012/3/25 14
  • 15.
    Principles to bea DBMS • Provides traditional database features, such as secondary indexes, etc. • but only those features that can scale within user-tolerable latency limits, • and only with the semantics that EG partitioning scheme can support. Feature set carefully chosen, tradeoffs. 2012/3/25 15
  • 16.
    Data Model (conceptsfor database) • A Data Model is a notation for describing data or information. • Consists of 3 parts, generally – Structure of the data – Operations on the data – Constraints on the data • Megastore Data Model: Relational Model + Scale – Limited relational model – Bigtable’s scalability • High Level Model vs. Physical Level Model – Physical Level • Complicate application development • Bigtable’s data model is at physical level – High Level • Let programmers to write code conveniently • Language, SQL 2012/3/25 16
  • 17.
    Data Model • Schemaful • Primary key − Strongly typed (Primitives or PB) – Built from a sequence of − Required, optional or repeated properties − All entities in a table have the same set of allowable properties. – Must be unique within the table − Nested Protocol-Buffers? An EG= a root entity + all entities Entities Properties Schemas Tables in child tables that reference it (primary (name, (name) (name) key) type) EG Root Child tables Property- table (foreign Entities 111 (EG key) key=EG key) Property- Entity-11 112 Entity Table-1 Property- Photo Entity-12 113 Entity Schema User Entity-21 Entity Table-2 Book Entity-22 Entity schema related hierarchical data 2012/3/25 17
  • 18.
    SQL-Like Schema Language(DDL) CREATE SCHEMA DemoApp; Additional Qualifiers: CREATE TABLE User { DESC|ASC|SCATTER required int64 userId; required string name; ------------------------------------ } PRIMARY KEY(userId), ENTITY GROUP ROOT; CREATE TABLE Book{ required int64 userId; CREATE TABLE Photo { required int32 bookId; required int64 userId; required int64 time; required int32 photoId; required int64 time; required string url; required string url; repeated string tag; optional string thumbUrl; } PRIMARY KEY([DESC|ASC|SCATTER] userId, repeated string tag; [DESC|ASC|SCATTER] bookId), } PRIMARY KEY(userId, photoId), IN TABLE User, IN TABLE User, ENTITY GROUP KEY(userId) REFERENCES User; ENTITY GROUP KEY(userId) REFERENCES User; CREATE LOCAL INDEX PhotosByTime CREATE LOCAL INDEX BooksByTime ON Photo(userId, time); ON Book([DESC|ASC|SCATTER] userId, [DESC|ASC] time); CREATE GLOBAL INDEX PhotosByTag ON Photo(tag) STORING (thumbUrl); 2012/3/25 18
  • 19.
    Data Placement inBigtable (principles) Pre-join with Keys, for performance … • Lets applications control the placement of hierarchical/related data, to minimize latency and maximize throughput – Storing data that is accessed together in nearby rows, or – Denormalized into the same row • The data for a EG are held in contiguous ranges of Bigtable rows, for – Low latency – High throughput – Cache efficiency • Pre-Joining with keys – Primary keys to cluster entities that will be read together. – Each entity maps into a single Bigtable row. – Primary key values are concatenated to form the Bigtable row key – Each remaining property occupies its own Bigtable column – Entity-group key as the prefix of Primary key (row key) – Sorted keys ascending or descending – SCATTER (two-byte hash prefix), to prevent hotspots in Bigtable – Recursive for arbitrary join depths (multiple levels of “IN TABLE”) 2012/3/25 19
  • 20.
    Data Placement inBigtable (details) Pre-join with Keys, for performance … • Bigtable row key = primary key of each table • Bigtable column name = <table name>.<property name> – Allowing entities from different Megastore tables to be mapped into the same Bigtable row without collision. • Store the transaction and replication log and metadata for the EG in root entity’s Bigtable row. – Because Bigtable provides per-row transactions. • Indexes: Each index entry is represented as a single Bigtable row – Bigtable row key = <indexed property values> + <primary key> – Bigtable cell columns: denormalized properties 2012/3/25 20
  • 21.
    Data Placement inBigtable (examples) STORING Transaction Meta User Table Photo Table Denormalized Row Key Root. Root. User. Photo. Photo. Photo. Photo. PhotosByTag. WAL meta name time url thumbUrl tag thumbUrl <U1> Log3 commit Jack Log2 offset Root User Log1 applied offset … EG for U1 <U1,P1> T1 URL1 TURL1 girl, car Photo Local Index Global Index Data PhotosByTime PhotosByTag <U1,P2> T2 URL2 TURL2 dress, girl <U1,T1><U1,P1> <U1,T2><U1,P2> <car><U1,P1> TURL1 <dress><U1,P2> TURL2 <girl><U1,P1> TURL1 <girl><U1,P2> TURL2 2012/3/25 21
  • 22.
    Secondary Indexes • Secondary indexes can be declared on any list of entity properties(optional is ok), including repeated properties, as well as sub-fields within ProtocolBuffers, and full-text index. • Local Indexes – Within EG – Obey ACID semantics • The index entries are stored in the entity group and are updated atomically and consistently with the primary entity data. • Global Indexes – Span EGs – Looser consistency (or may eventual) • Not guaranteed to reflect all recent updates. (may inconsistent with the primary data?) • It is a trick to keep consistent between Global Indexes and primary data!? – Special Two-Phase-Commit? and – Read-Repair? 2012/3/25 22
  • 23.
    Secondary Indexes andDemoralization • STORING clause for copied data in index entities – Avoid the indirect access of primary entities, it is very expensive random access. – But, keeping consistent is a issue! • Inline Indexes – Index entries from the source entities appear as a virtual repeated column in the target entry. – An inline index can be created on any table (child) that has a foreign key referencing another table (parent) by using the first primary key of the target entity as the first components of the index. Inline Index Repeated Columns Inline User Row Key User. PhotosByTime. PhotosByTime. Photo. Photo. Parent Table name T1 T2 time thumbUrl <U1> Jack <P1> <P2> Photo Child Table <U1,P1> T1 TURL1 <U1,P2> T2 TURL2 CREATE INLINE INDEX PhotosByTime ON Photo(userId, time); 2012/3/25 23
  • 24.
    Inline Indexes formany-to-many Relationships • Coupled with repeated indexes, inline indexes can also be used to implement many-to-many relationships more efficiently than by maintaining a many-to-many link table. Inline Index many-to-many Repeated Columns Inline Row User. PhotosByTag. PhotosByTag. PhotosByTag. Photo. Photo. User Key name car dress girl time thumbUrl Parent Table <U1> Jack <P1> <P2> <P1> <P2> Photo <U1,P1> T1 TURL1 Child Table <U1,P2> T2 TURL2 <U2> Tom <P1> <U2,P1> T3 TURL3 CREATE INLINE INDEX PhotosByTag ON Photo(userId, tag); 2012/3/25 24
  • 25.
    API • Cost-transparent API – Match application developers’ intuitions – High-volume interactive workloads benefit more from predictable performance than from an expressive query language. • Normalized relational schemas rely on joins at query time to service user operations, is not the right model for Megastore applications. – Pre-joins – Denormalization • SQL-Like Schema language (DDL, for data structures and data placement) – Fine-grained control over physical locality • Hierarchical layouts (pre-joins) • Declarative denormalization – Eliminate the need for most joins • Queries API against particular tables and indexes – Range Scans – Lookups • Schema changes require corresponding modifications to the query implementation code 2012/3/25 25
  • 26.
    Query Joins • QueryJoins, when required, are implemented in application code. • Index-based join • Merge joins – Multiple queries returns primary keys for the same table, in the same order. – Then intersection of keys for them. • Outer joins – Index lookup (return small result set) – Parallel index lookups using the results of the above lookup • Other joins …? 2012/3/25 26
  • 27.
    Query Joins -Merge Joins Query-1 SELECT * FROM Photo WHERE tag=girl girl & car Intersection or & or | girl | car SELECT * FROM Photo WHERE tag=car Query-2 Use the global index: PhotosByTag Just like: SELECT * FROM Photo WHERE tag=girl AND tag=car or SELECT * FROM Photo WHERE tag=girl OR tag=car Strictly, Merge Join is not a real join in the lingo of SQL, but is really a “Join”. 2012/3/25 27
  • 28.
    Query Joins -Outer Joins name=Jack, userId=U1,U2 Query-2 Query-1 userId=U1,U2 Parallel Index Lookup Query-2 Index lookup T1<time<T10 Parallel Index Lookup SELECT name, userId FROM User SELECT thumbnUrl FROM Photo WHERE name=Jack WHERE time>T1 AND time<T10; (suppose there is a index: … Parallel for each userId. UsersByName) Just like: SELECT User.name, User.userId, Photo.thumbUrl FROM User LEFT OUTER JOIN Photo ON Photo.userId=User.userId WHERE User.name=Jack AND Photo.time>T1 and Photo.time<T10 Example of result: Jack, U1, TURL1 Jack, U2, NULL 2012/3/25 28
  • 29.
    Transactions and ConcurrencyControl • An EG as a mini-database, serializable ACID transactions . • Transactions within-EG – A transaction writes its mutations into the EG's WAL, then the mutations are applied to the data. – Readers use the timestamp of the last fully applied transaction to avoid seeing partial updates. • MVCC: Multi-Version Concurrency Control (very important) – Use Bigtable cell’s timestamps/versions – Readers and writers don't block each other, and reads are isolated from writes for the duration of a transaction. (How? See MVCC in Wikipedia) • Write patterns – A write transaction always begins with a current read to determine the next available log position. (This current read only ensures that all previously committed writes to be applied.) – The commit operation gathers mutations into a log entry, assigns it a timestamp higher than any previous one, and appends it to the log (and using Paxos for replicate across-DC). – The write operation can return to the client at any point after Commit. Write Op Commit Read Op • Read patterns – Current Read Metadata and WAL of EG root Check for – Snapshot Read recover committed logs ad In Bigtable Re – Inconsistent Read The apply may be async Apply Tables data and Indexes data in Bigtable 2012/3/25 29
  • 30.
    Transactions and ConcurrencyControl - Write Last committed position Writer (ts2) When failure occurs here: Last fully applied position (ts1)  Transaction-1: Metadata Very safe. Transaction-3 Writing  Transaction-2: Serializable Transactions (ongoing) Writ writing but not commit e not gather and append into log Safe, no data loss, but Transaction-2 (commited) Write Commit Mutation-22-ts2 should be recovered from committed but not fully applied partially applied data log to data, when doing Transaction-1 Writ Mutation-21-ts2 “current read” or “write” (committed, applied) Com e operations. mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1  Transaction-3: a log entry WAL Data-part1-ts1 Not complete, failed. assign it a timestamp Application will get failed committed and fully applied Data (Use Bigtable timestamp return. Figure : The state of Transaction System for a EG for MVCC) Note: The commit operation gathers mutations into a log entry, assigns it a timestamp to it. A write transaction always begins with ensuring that all previously committed writes to be applied (via a current read)! 2012/3/25 30
  • 31.
    Transactions Read Patternsand Lifecycle • Current Read • A complete transaction – Only within-EG lifecycle – When starting a current read, – Read the transaction system first ensures that all previously • Get timestamp of the last committed writes are applied. committed transaction from (Just like the recovery of metadata. commit-logs.) – Application logic – Then the application reads at the • Read-modify-write. timestamp of the latest – Commit committed transaction. • Gathers mutations into a log entry, assigns it a higher • Snapshot Read timestamp. – Only within-EG • Replicate across-DC via – Picks up the timestamp of the Paxos. last known fully applied • Can return to client here. transaction and reads from there. ---------------------------------------- – Some committed transactions (following job may be asynchronous) may not yet be applied. ---------------------------------------- – Apply • Inconsistent Read • Write mutations into data – Read the latest values directly, and indexes. may get partially applied data, for aggressive latency. – Clean up • Delete fully applied logs. 2012/3/25 31
  • 32.
    Transactions Read Patterns– Current Read (1) Check latest Current Reader committed writes Last committed position (ts2) (3) Update metadata Last fully applied position (ts1) -> (ts2) Metadata Transaction-3 Writing Serializable Transactions (ongoing) Writ writing but not commit (4) Read data at ts2 e not gather and append into log (2) Apply previous Transaction-2 Write committed writes (commited) Commit Mutation-22-ts2 committed but not fully applied Transaction-1 Mutation-21-ts2 Data-part2-ts2 (committed, applied) Writ Com e mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1 WAL a log entry Data-part1-ts1 assign it a timestamp committed and Data fully applied (Use Bigtable timestamp Figure : The state of Transaction System for a EG for MVCC) Do recovery write before read data. 2012/3/25 32
  • 33.
    Transactions Read Patterns– Snapshot Read Snapshot Reader Last committed position (1) Get last fully applied ts (ts2) Last fully applied position (ts1) Metadata Transaction-3 Writing Serializable Transactions (ongoing) Writ writing but not commit e not gather and append into log Transaction-2 Write (2) Read data at ts1 (commited) Commit Mutation-22-ts2 committed but not fully applied Transaction-1 Mutation-21-ts2 (committed, applied) Writ Com e mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1 WAL a log entry Data-part1-ts1 assign it a timestamp committed and Data fully applied (Use Bigtable timestamp Figure : The state of Transaction System for a EG for MVCC) The very easy read pattern. 2012/3/25 33
  • 34.
    Transactions Read Patterns– Inconsistent Read Inconsistent Reader Last committed position (ts2) Last fully applied position (ts1) Metadata (1) Directly read partial data Transaction-3 Writing Serializable Transactions (ongoing) Writ writing but not commit e not gather and append into log Transaction-2 Write (commited) Commit Mutation-22-ts2 committed but not partially fully applied applied data Transaction-1 Mutation-21-ts2 (committed, applied) Writ Com e mit Mutation-12-ts1 Data-part1-ts2 Transactions Mutation-11-ts1 Data-part2-ts1 WAL a log entry Data-part1-ts1 assign it a timestamp committed and Data fully applied (Use Bigtable timestamp Figure : The state of Transaction System for a EG for MVCC) The application must tolerate the stale or partially applied data. 2012/3/25 34
  • 35.
    Two-Phase-Commit Expensive, Long Latency 2012/3/25 35
  • 36.
    Replication for High Availability … I need more study about Paxos, so it is not go-to- detailed. 2012/3/25 36
  • 37.
    Replication • Within-DC – Across hosts – Built-in from Bigtable and GFS • Across-DC – … synchronous and consistent for each write 2012/3/25 37
  • 38.
    Replication cross-DC • Traditional strategies (not work) • EG based synchronously – Asynchronous Master/Slave • Asynchronously propagate replicate each write • Master supports fast ACID transactions • Low latency • Data loss risk • Use Paxos • Downtime for failover – No distinguished master • Heavyweight master • Required a mediate mastership(e.g. – Replicate write-ahead-log ZooKeeper) – Synchronously replicating writes (each log append blocks – Synchronous Master/Slave • No data loss on acknowledgments from a • Downtime for failover majority of replicas, and • Long latency replicas in the minority catch • Heavyweight master up as they are able) • Required a mediate mastership(e.g. ZooKeeper) – Any node can initiate writes and reads – Optimistic Replication – Reasonable latency • No distinguished master • Asynchronously propagate • Availability and latency are excellent – Extensions • No mutation order and transactions are impossible • Allows local reads at any up- • Like Cassandra/Dynamo to-date replica • Permits single-roundtrip writes 2012/3/25 38
  • 39.
    Paxos • Traditional usages – Locking – Master election – Replication of metadata and configurations • Megastore use Paxos – Replicate primary user data across-DC on every write – For across-DC high availability 2012/3/25 39
  • 40.
    To Study More… 2012/3/25 40
  • 41.
    Valuable References • P. Helland. Life beyond distributed transactions: an apostate's opinion. In CIDR, pages 132-141, 2007. – The philosophy inspire of Megastore 2012/3/25 41