Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Google Cloud Datastore Inside-Out

6,039 views

Published on

Explaining an index structure of Google Cloud Datastore as well as underlying components such as Google File System (Colossus), Bigtable and Megastore.

Session video (Japanese)
https://youtu.be/H-tZUZGBo60?t=8524

2016/11/08 ver1.0 Published
2016/11/11 ver2.0 Add notes on Spanner
2017/02/09 ver2.1 Fix on Spanner's consistency description.

Published in: Technology

Google Cloud Datastore Inside-Out

  1. 1. Google Cloud Datastore Inside-Out Etsuji Nakai Cloud Solutions Architect at Google February 9, 2017 ver2.1
  2. 2. Etsuji Nakai Cloud Solutions Architect at Google Twitter @enakai00 Now On Sale! 2
  3. 3. Cloud Datastore 101 The mystery of entity groups
  4. 4. Dual nature of entities ● An entity represents a row of a specific "kind". ● You can think of "kind" as a table in the relational data model. ● An entity is identified by an ID (user-specified string or auto-generated UUID) plus its (mysterious) parent key. A row of a kind 4 Unique identifier
  5. 5. Dual nature of entities ● An entity represents a node of an "entity group" tree. ● An entity group can contain entities from multiple kinds. ● An entity is identified by a key (ancestor path + ID). ○ A key must contain all entities from the root. ○ Some entities in the ancestor path may not exist. A node of an entity group 5 Organization: Flywheel (doesn't exist) ancestor path ID Key: (Organization, 'Flywheel', User, 'Alice', Mail, '15de6')
  6. 6. The bright/dark side of an entity ● It's safe to treat an entity as a member of an entity group. ○ Entities treated as part of an entity group are guaranteed to be strongly consistent. ● An ancestor query is a query that specifies an ancestor. ○ The search range is limited to the descendants of the specified ancestor. ○ Ancestor queries are strongly consistent. ○ In other words, it always retrieves the latest data. ○ You can use a single phase transaction inside an entity group ○ A cross group transaction can also be used, but slower than a single phase transaction. ● A global query is a query without specifying an ancestor. ○ Global queries are eventually consistent. ○ You may see old content and/or fail to find newly created entities. 6
  7. 7. Mystery of composite indexes ● Can you tell which query requires an additional (non-default) index? ○ Global query ○ Ancestor query ■ 7 SELECT * FROM Mail WHERE size>256 ⇒ ◯(OK) SELECT * FROM Mail WHERE size=256 and access_count>5 ⇒ △(Need an additional index) SELECT * FROM Mail WHERE size>256 and access_count>5 ⇒ ✕(This is not allowed) SELECT size FROM Mail WHERE size>256 ⇒ ◯(OK) SELECT title FROM Mail WHERE size>256 ⇒ △(Need an additional index) SELECT * FROM Mail WHERE __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ ◯ SELECT * FROM Mail WHERE size=256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ △ SELECT * FROM Mail WHERE size>256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ △
  8. 8. What's happening under the covers? ● How is strong consistency guaranteed for ancestor queries? ● Why do I have to define additional indexes for some queries? ● When and why do I need to specify "ancestor = True" for an index?
  9. 9. Truth is here ● Cloud Datastore is implemented on top of Megastore which has the layered structure over Bigtable and Google File System. The internal architecture of Megastore, Bigtable and Google File System is explained in the published research papers. ● Megastore: Providing Scalable, Highly Available Storage for Interactive Services ○ http://research.google.com/pubs/pub36971.html ● Bigtable: A Distributed Storage System for Structured Data ○ http://research.google.com/archive/bigtable.html ● The Google File System ○ http://research.google.com/archive/gfs.html 9 Google File System Bigtable Megastore
  10. 10. Notes on Colossus ● Colossus is a successor of Google File System which overcomes shortcomings of Google File System. It is used as an infrastructure of Google Cloud Platform as well as Google's internal systems today. ● The following characteristics were mentioned at Google Faculty Summit 2010. ○ Next-generation cluster-level file system ○ Automatically sharded metadata layer ○ Data typically written using Reed-Solomon (1.5x) ○ Client-driven replication, encoding and replication ○ Metadata space has enabled availability analyses ● Since the architectural details of Colossus is not yet published, this presentation explains the architecture of Google File System.
  11. 11. Google File System
  12. 12. What is Google File System? ● Large scale distributed file system used in Google's internal systems to store large files. ● Optimized for file append and sequential file read for large files. ○ Other operations are supported but may be very slow. ● Transparent file replication for redundancy. ○ Each file is split into multiple 64MB chunks and each chunk is stored in (at least) three chunk servers. 12 Handing over large data between servers Streaming data aggregation Typical access patterns
  13. 13. Optimized dataflow ● Data is transferred serially from a client to chunk servers. The chunk server starts sending the data right after it starts receiving it. ○ Faster than sending data from a client to all chunk servers in parallel. ● Control messages are handled by the primary chunk server to keep the consistency among replicas. 13 Client Chunk servers PrimarySecondary Secondary Client Dataflow to append data Control flow to commit the write
  14. 14. Data corruption detection ● Each chunk is associated with a checksum to detect data corruption. ● The whole chunk is read and validated with the checksum for the read operation. ○ This is optimized for the sequential read. ● A new checksum is calculated with appended data and the existing checksum for the write operation. ○ This is optimized for the file append. 14
  15. 15. Bigtable
  16. 16. What is Bigtable? ● Large scale distributed key-value style datastore used in Google's internal systems to store structured data with varying data sizes (from web page URLs to satellite imagery.) ● Google Cloud Platform offers managed service for Bigtable with HBase compatible APIs. 16 Column family design to store HTML contents and inversed links (excerpt from the research paper)
  17. 17. Row as a Database ● Data is identified with "Row Key + Column family: Column" (+ timestamp). ● You may think a single row as a small database. ○ A column family represents a table. ○ Columns can be dynamically added to a column family. ○ Atomic operations can be used within a single row. 17 Column family design for user profiles and query histories
  18. 18. Global view of the "big" table ● Rows are stored in lexicographic order by row key. The row range for a table is dynamically partitioned into units called 'tablets'. ○ This strategy is optimized for fast row range scans. ● Tablet servers provide the access to tablets. The tablet assignment is managed by a master node. 18
  19. 19. Tablet representation ● Tablet data is consisted of in-memory data (memtable) and immutable files (SSTables) stored in Google File System. ○ SSTables store the freezed view of a tablet at some point of time. Updates are appended to a tablet log and memtable. ○ A tablet server construct the united view of the tablet by combining memtable and SSTables. 19 Tablet representation mechanism (excerpt from the research paper) ● When memtable becomes too large, a new memtable is created and the old one is freezed to a new SSTable. (Minor compaction.) ● When SSTables becomes too many, they are merged into a single SSTable by discarding obsolete entries (Major compaction.)
  20. 20. Cloud Datastore / Megastore
  21. 21. Overview of Megastore ● Megastore provides the ACID semantics for globally distributed datasets using fast synchronous replication mechanism based on (an enhanced version of) Paxos. ● This part explains the index structure of Cloud Datastore implemented on top of Megastore. ● Note that ancestor/global query is additional features of Cloud Datastore. They are not a part of Megastore. 21 Multi datacenter replication architecture of Megastore (excerpt from the research paper)
  22. 22. Index structure for ancestor queries
  23. 23. How are entities stored in Bigtable? ● Row key: entity key (ancestor path + ID). ○ The whole entity group can be scanned by a row range scan (depth-first search). ● Column family: properties of an entity. ○ An independent column family is used for each property. 23 Row key status of the group email title size access_count Organization, 'Flywheel' Organization, 'Flywheel', User, 'Alice' xxxx Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 9 Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 5 Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' xxxx 256 3 Organization, 'Flywheel', User, 'Bob' xxxx ・ ・ ・ Transaction log and replication status is recorded for operations with strong consistency. Rowrangescan
  24. 24. Ancestor query without inequality filters ● The following queries don't require an additional index since they can be done by a row range scan. ● The scan starts from a row with the specified ancestor key. Row key status of the group email title size Organization, 'Flywheel' Organization, 'Flywheel', User, 'Alice' xxxx Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 Starts from here SELECT * FROM Mail WHERE __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') SELECT * FROM Mail WHERE size=256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') 24
  25. 25. Ancestor query with inequality filters ● The following query requires an additional index. ● Theoretically it's possible to do the same table scan, but may not be efficient enough. Instead, the following index should be used. ○ The row key of this index table consists of: ■ "Ancestor of the entity" + "Property value" + "Entity key (ancestor path + ID)" ○ See next pages for details. SELECT * FROM Mail WHERE size>256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') indexes: - kind: Mail ancestor: yes properties: - name: size 25
  26. 26. Single-property indexes for ancestor queries ● Each entity is mapped to multiple rows corresponding to all its ancestors. ○ The following example shows the rows for two entities. ○ This will be sorted in the order of row keys, then... Organization, 'Flywheel', | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity Organization, 'Flywheel', User, 'Alice' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity Organization, 'Flywheel', User, 'Alice', Mail, '15de6' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity Organization, 'Flywheel', | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity Organization, 'Flywheel', User, 'Alice' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity Organization, 'Flywheel', User, 'Alice', Mail, '65067' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity Row key Column Ancestors Property value Entity key (ancestor path + id) 26
  27. 27. Single-property indexes for ancestor queries ● Using the row keys which are sorted in lexicographic order: ○ First, the row range is limited by the specified ancestor. ○ The row range is narrowed further by the inequality filter. Organization, 'Flywheel' | 64 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6' Organization, 'Flywheel' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Organization, 'Flywheel' | 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' Organization, 'Flywheel' | 256 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c'' Organization, 'Flywheel' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Organization, 'Flywheel' | 1024 | Organization, 'Flywheel', User, Bob, Mail, 'f67de' Organization, 'Flywheel', User, 'Alice' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Organization, 'Flywheel', User, 'Alice' | 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' SELECT * FROM Mail WHERE size>256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel') 27
  28. 28. Composite indexes for multiple conditions ● Indexes with multiple properties are used for queries with multiple conditions. ● The following query requires the composite index. ● The order of properties in the index definition has meaning. ○ The property for equality filter must come first. indexes: - kind: Mail ancestor: yes properties: - name: size - name: access_count SELECT * FROM Mail WHERE size=256 and access_count<5 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel') 28 Organization, 'Flywheel' | 64 | 1 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6' Organization, 'Flywheel' | 128 | 5 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Organization, 'Flywheel' | 256 | 3 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' Organization, 'Flywheel' | 256 | 8 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c'' Organization, 'Flywheel' | 1024 | 9 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Organization, 'Flywheel' | 1024 | 2 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'
  29. 29. Multiple inequality filters are not allowed! ● The following query is not allowed. ○ The rows of index table cannot be a single range for this condition. SELECT * FROM Mail WHERE size>128 AND access_count<5 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') 29
  30. 30. Strong consistency of ancestor queries ● Indexes with "ancestor: yes" are used for ancestor queries where independent indexes are created for each ancestor tree. ○ A single index table contains entries only for one entity group. ● Indexes are created in each datacenter and replicated. ○ Replication status is checked before starting a query to guarantee strong consistency. 30 Row key status of the group email title size access_count Organization, 'Flywheel' Replication Status Organization, 'Flywheel', User, 'Alice' xxxx Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 9 Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 5 Root entity
  31. 31. Index structure for global queries
  32. 32. Indexes for global queries ● Indexes with "ancestor: no" are used for global queries where indexes are created for each kind. ○ One index table contains all entities of a specific kind including entities from multiple entity groups. Operation across entity groups (excerpt from the research paper) ● Megastore handles operations across entity groups with weaker consistency unless two-phase commitment is used. ● On the Cloud Datastore layer, it results in the eventual consistency of global queries. 32
  33. 33. Default single-property indexes ● Single-property indexes for global queries are automatically created (in both asc and desc orders). ○ Ancestors are not included in row keys of the index table. ● For example, the following queries use the default indexes. SELECT * FROM Mail WHERE size>256 SELECT size FROM Mail WHERE size>256 33 64 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6' 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' 256 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c'' 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' 1024 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'
  34. 34. Composite indexes for global queries ● Indexes with multiple properties (composite indexes) need to be created manually. ○ Projection queries also need composite indexes so that values can be retrieved directly from the index table. SELECT * FROM Mail WHERE size=256 and access_count>5 SELECT title FROM Mail WHERE size>256 Projection query indexes: - kind: Mail ancestor: no properties: - name: size - name: access_count - kind: Mail ancestor: no properties: - name: size - name: title 'title' can be retrieved directly from the index table. 34
  35. 35. Index direction matters for sort orders ● "ORDER BY" requires the corresponding index. ● When used with an equality filter, the index direction needs to match the sort order. ● "ORDER BY" cannot mixed with an inequality filter for other properties. ○ The following query is not allowed. SELECT * FROM Mail WHERE size=256 ORDER BY access_count DESC indexes: - kind: Mail ancestor: no properties: - name: size - name: access_count direction: desc 35 SELECT * FROM Mail WHERE size>256 ORDER BY access_count DESC
  36. 36. Design guide for entity groups
  37. 37. Design guide for entity groups ● Avoid global queries (queries without specifying an ancestor) unless you understand what you are doing. ○ Global queries may not retrieve the latest data. ● Splitting data into entity groups so that updates in a single entity group are less frequent. ○ The update of entities in a single entity group should be less than 1 update/sec. ● Examples: ○ Web mail service ■ An entity group of mails for each user. ○ SNS user group service ■ An entity group of user profile for each user. ■ An entity group of posts for each user group. ■ An entity group of group names and pointers to group sites which provides a catalog of user groups. ○ Online map service ■ An entity group of patches for an arbitrary region of the globe. 37
  38. 38. References ● Under the Covers of the Google App Engine Datastore ● How Entities and Indexes are Stored ● Balancing Strong and Eventual Consistency with Google Cloud Datastore 38
  39. 39. Notes on Spanner
  40. 40. What is Spanner? ● Spanner: Google's Globally-Distributed Database ○ http://research.google.com/archive/spanner.html ● Spanner is a Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is used as a successor of Megastore in Google's internal systems. ● Designed to overcome the shortcomings of Megastore and support general-purpose transactions with SQL-based query language. ● Example of shortcomings of Megastore: ○ It doesn't support the relational data model and SQL-based query language. ○ Transaction and strong consistency is limited within an entity group. ○ The number of updates is limited to 1 update/sec for each entity group. 40
  41. 41. Infrastructure design ● The overall server architecture of Spanner resembles Megastore over Bigtable. ○ A cluster in each zone contains multiple span servers. Zones are distributed across data centers. ○ Each span server manages tablets which hold the key-value mappings: (key: string, timestamp: int64) → value: string ○ Backend data files are stored in Colossus. 41 Spanner server organization (excerpt from the research paper) ● Differently from Bigtable, rows in a tablet are versioned with a system time instead of user specified timestamps. ○ The versioning mechanism is used for snapshot read and lock-free read-only transactions.
  42. 42. Paxos-based tablet replication ● Tablets in different zones are replicated with Paxos-based algorithm. ○ A leader in each replication group takes care of row-range write locks during read-write transactions. A leader is re-elected thorough Paxos if necessary. ○ In the case of transactions which involve multiple replication groups, transaction managers from each group cooperate to perform two phase commitment. 42 Replication between tablets (excerpt from the research paper)
  43. 43. So..., what's the problem? ● The problem with Paxos-based algorithm is that replications are done asynchronously. ○ When half of the replicas have agreed to write the data, it's considered to be committed. The remaining replication will be done asynchronously. ○ If you enforce the genuine full-replication on each write, performance will be highly degraded. (This is partly the reason for the limited strongly consistent updates on Megastore.) ● Spanner associates timestamps with all writes, and every replica tracks a value called "safe time: t-safe" which is the maximum timestamp at which a replica is up-to-date. ○ A replica can satisfy a read request for a timestamp t if t <= t-safe. If not, another replica is used. ○ t-safe advances at each Paxos write. During a transaction, the advancement is delayed until the transaction finishes. 43
  44. 44. So..., again, what's the problem? ● The timestamp-based tracking requires that the clocks on all replicas are synchronized. ○ At least, clocks should be calibrated within a limited amount of uncertainty, and the range of uncertainty is known to the system. 44 ● Spanner clusters are equipped with TrueTime API system consisting of multiple time servers using GPS and atomic clocks. ○ TrueTime API provides the time interval in which the current time is guaranteed to be. Fluctuations of time drifts from time servers (excerpt from the research paper) Hardware maintenance of two time servers Network latency improvement
  45. 45. Thank you!

×