SlideShare a Scribd company logo
1 of 33
MegaStore: Providing Scalable,
Highly Available Storage for
Interactive Services
Niels Claeys
2
Outline
1. Introduction
2. Availability and Scale
3. Megastore features
4. Replication
5. Results
6. Conclusion
3
1. Introduction (1)
• Interactive online services
demand
– High scalability
– Rapid development
– Low latency
– Consistency of data
– High availability
→ conflicting requirements

Solution Megastore
– Scalability NoSQL
→ partition + replicate
– Convenience RDBMS
→ ACID semantic
within partition
– High availability
4
1. Introduction (2)

Widely deployed in Google for several years

>100 production applications.

3 billion writes and 20 billion reads daily

A petabyte of data across multiple datacenters

Available on GAE since Jan 2011.
5
2.1 Availability and scalability
• Availability: Paxos
→ fault-tolerant consensus
algorithm
– No master
– Replicate logs
• Scale:
– Partition data in small
databases
– Each partition own
replicated log
6
2.2 Partitioning
7
Outline
1. Introduction
2. Availability and Scale
3. Megastore features
4. Replication
5. Results
6. Conclusion
8
3.1 Megastore features: API
• Megastore = cost-transparent API
– No expressive queries
– Storing and querying hierarchical data in
key-value store is easy
– Joins in application logic:
• Merge phase supported
• Outer joins based on indexes
→ understandable performance implications
9
3.2 Megastore features: Data model
• Megastore Tables:
– Entity group root
– Child table: reference to
root
• Entity: single row
→ identified by concatenation of
keys
10
3.2 Megastore features: Indexes
• 2 levels of indexes:
– Local: for each entity group
• updated atomically and consistently
– Global: spans entity groups
• Find entities without keys
• Not all updates visible
11
3.2 Megastore features: Bigtable

Primary Keys cluster entities together

Each entity = single Bigtable row.

“IN TABLE” includes tables into single Bigtable
→ key ordering ensure entities are stored adjacent

Bigtable column name = Megastore table name +
property name
12
3.3 Megastore features: Transactions (1)

Entity group= mini-database
→ serializable ACID semantics

MVCC (MultiVersion Concurrency Control)
→ transaction timestamp

Reads and Writes are isolated
13
3.3 Megastore features: Transactions (2)
• Three levels of read consistency
– Current: read EG after write logs are committed
– Snapshot: read last completed transaction of EG
– Inconsistent: ignore log and read latest values
14
3.3 Megastore features: Transactions (3)

Write transaction:
― Current read: Obtain the timestamp and log position of the
last committed transaction
―
Application logic: Read from Bigtable and gather writes into
a log entry
― Commit: Use Paxos to achieve consensus for appending the
log entry to log
― Apply: Write mutations to the entities and indexes in Bigtable
― Clean up: Delete temp data
15
3.3 Megastore features: Transactions (4)

Queue: transactional messaging between EG
― Transaction atomically handles messages
―
Perform operations on many EG
―
Associated with each EG (scalable)

Two phase commit

Queue is preferred over two phase commit.
16
Outline
1. Introduction
2. Availability and Scale
3. Megastore features
4. Replication
5. Results
6. Conclusion
18
4.1 Replication: Paxos
• Reach consensus between replicas
– Tolerate delay and reorder messages
– Majority replicas must be reachable
• Proposers, Acceptors, learners
• Proposers: Requests with monotonously increasing
sequence number
• Problems
– High-latency: multiple round trips
→ Adjusted to use in Megastore
19
4.1 Replication: Paxos illustration
20
4.2 Replication: Paxos adaptation
• Fast reads: local through coordinators
– Eliminates prepare phase
– Coordinator: controls EG up-to-date
→ Simple because no database
• Fast writes: through leaders
– Eliminate prepare phase
– Multiple writes issued to same leader
– Leader = closest replica to the writer
21
4.3 Replication: Algorithms (1)
• Replica stores log entries
of EG
→ can accept out of order
• Read:
→ >=1 replica up-to-date
22
4.3 Replication: Algorithms (2)
• Prep: Package changes +
timestamp + leader as log
• Write not succeed:
invalidate
• Data only visible after
invalidate step
23
4.4 Replication: Coordinator availability
• Coordinator: in each datacenter
→ keep state local replica
→ simple process = more stable
• Failure detection:
– Chubby lock: other coordinators online
→ Looses majority locks: all EG out-of-date
– Datacenter failure: writers wait for the locks of
coordinators to expire before write can be
completed
• Validation races:
– Always send log position
– Higher number wins
24
4.5 Replication: Replica types

Full replicas
― What we have seen until now

Witness replica:
– Can vote
– Store write-ahead logs but not the data

Read-only replica:
– Cannot vote
– full snapshot data
25
4.6 Replication: Architecture
26
5. Results
• Read latency: 10+ ms
• Write latency: 300 ms
• Issues: Replica unavailable
• Solution
– Reroute traffic to
servers nearby
– Disable the replica
coordinator
27
6. Conclusion
• Scalability and Availability
• Simpler reasoning and usage
→ ACID semantics
• Latency:
→ best effort (low enough for interactive apps)
• Throughput within EG: few per second
→ not enough: sharding EG or placement replicas
near each other
Questions?
29
Question1
As being stated several times, only 2 elements of CAP can be kept. Megastore
focusses on which two and how?
Reasoning:
– Partition tolerance: dividing database into EG and replicating these
over multiple data centers
– Availability: providing a service that is highly available through
Paxos
– Consistency: Relaxed consistency between EG and global indexes
30
Question2
Current reads have the following guarantees:
- A read always observes the last-acknowledged write.
- After a write has been observed, all future reads observe that write. (A write
might be observed before it is acknowledged.)
Contradiction?
→ No I do not think so but it is confusing
Reasoning:
– Two guarantees are focused on current reads (this are the reads
that preserve consistency)
– The sentence between parentheses mentions that inconsistent
reads are also possible but they are not current reads
31
Question3
In my opinion, a lot of their focus goes towards making the system consistent. But
in their API they also give you the possiblity to request current, snapshot and
inconsequent data. Do you think this is a valuable addition?
Reasoning:
→ Mainly due to performance bottleneck of consistent system
→ Depends on the application: There exist applications where you do not mind
that you read something inconsistent
→ Current and snapshot still maintain consistency
=> Personally I think the value is limited: cannot think of an application where
latency is so critical that rather wants inconsistent data than wait a bit longer
32
Question4
4.4.3: For me it is not clear how the 'read-only' replicas receive their data, as it
needs to get consistent data. Do you have an idea?
→ not mentioned in paper
Idea:
– Coordinator of replica keeps track of the up-to-date EG
– Mechanism that periodically takes a snapshot of these up-to-date
EG
and copies it to the read-only replicas
33
Question5
Megastore is multiple times compared with Bigtable, maybe could you give what
according to you are the biggest differences in implementation an in types of
usage?
Reasoning:
– Build on top of bigtable and based on different requirements:
→ consistency guarantees + wide-area communication
– Bigtable used within one data center ↔ MegaStore across multiple
→ increased availability (Paxos) but higher latency
– Consistency guarantees from Megastore: I suspect lower
performance, throughput than with bigtable
– Implementation:
• Bigtable: master ensures replication ↔ Paxos (no master
recovery)
• Bigtable: one log for each tablet server ↔ 1 log per EG in
replica
• Very different APIs: Megastore supports schemas + indexes
– Note: Google App Engine moved from BigTable to Megastore
34
References
MacDonald A., Paxos by example,
http://angusmacdonald.me/writing/paxos-by-example/, accessed 06-05-14
Google App Engine, Switch from Bigtable to MegaStore
http://googleappengine.blogspot.be/2009/09/migration-to-better-datastore.html,
accessed 08-05-13

More Related Content

What's hot

Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent
 
New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceAnil Nair
 
Achieving compliance With MongoDB Security
Achieving compliance With MongoDB Security Achieving compliance With MongoDB Security
Achieving compliance With MongoDB Security Mydbops
 
Achieving High Availability in PostgreSQL
Achieving High Availability in PostgreSQLAchieving High Availability in PostgreSQL
Achieving High Availability in PostgreSQLMydbops
 
Understand oracle real application cluster
Understand oracle real application clusterUnderstand oracle real application cluster
Understand oracle real application clusterSatishbabu Gunukula
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Glen Hawkins
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slidesMohamed Farouk
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Cross Data Center Replication with Redis using Redis Enterprise
Cross Data Center Replication with Redis using Redis EnterpriseCross Data Center Replication with Redis using Redis Enterprise
Cross Data Center Replication with Redis using Redis EnterpriseCihan Biyikoglu
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
Best Practices for Oracle Exadata and the Oracle Optimizer
Best Practices for Oracle Exadata and the Oracle OptimizerBest Practices for Oracle Exadata and the Oracle Optimizer
Best Practices for Oracle Exadata and the Oracle OptimizerEdgar Alejandro Villegas
 
Oracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion EditionOracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion EditionMarkus Michalewicz
 
Streaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & ElasticsearchStreaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & ElasticsearchKeira Zhou
 
Distributed Database Architecture for GDPR
Distributed Database Architecture for GDPRDistributed Database Architecture for GDPR
Distributed Database Architecture for GDPRYugabyte
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 

What's hot (20)

Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
New Generation Oracle RAC Performance
New Generation Oracle RAC PerformanceNew Generation Oracle RAC Performance
New Generation Oracle RAC Performance
 
Achieving compliance With MongoDB Security
Achieving compliance With MongoDB Security Achieving compliance With MongoDB Security
Achieving compliance With MongoDB Security
 
Achieving High Availability in PostgreSQL
Achieving High Availability in PostgreSQLAchieving High Availability in PostgreSQL
Achieving High Availability in PostgreSQL
 
Understand oracle real application cluster
Understand oracle real application clusterUnderstand oracle real application cluster
Understand oracle real application cluster
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Understanding oracle rac internals part 1 - slides
Understanding oracle rac internals   part 1 - slidesUnderstanding oracle rac internals   part 1 - slides
Understanding oracle rac internals part 1 - slides
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Cross Data Center Replication with Redis using Redis Enterprise
Cross Data Center Replication with Redis using Redis EnterpriseCross Data Center Replication with Redis using Redis Enterprise
Cross Data Center Replication with Redis using Redis Enterprise
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Best Practices for Oracle Exadata and the Oracle Optimizer
Best Practices for Oracle Exadata and the Oracle OptimizerBest Practices for Oracle Exadata and the Oracle Optimizer
Best Practices for Oracle Exadata and the Oracle Optimizer
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
Oracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion EditionOracle RAC Internals - The Cache Fusion Edition
Oracle RAC Internals - The Cache Fusion Edition
 
Streaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & ElasticsearchStreaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & Elasticsearch
 
Distributed Database Architecture for GDPR
Distributed Database Architecture for GDPRDistributed Database Architecture for GDPR
Distributed Database Architecture for GDPR
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 

Similar to Megastore: Providing scalable and highly available storage

Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning MongoDB
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_dbhyeongchae lee
 
Db presentation google_megastore
Db presentation google_megastoreDb presentation google_megastore
Db presentation google_megastoreAlanoud Alqoufi
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Silicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDBSilicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDBDaniel Coupal
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKAndrew Trossman
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsSingleStore
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsateeq ateeq
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Sergey Platonov
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture corehard_by
 
Scalability Considerations
Scalability ConsiderationsScalability Considerations
Scalability ConsiderationsNavid Malek
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamManaging 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamRedis Labs
 

Similar to Megastore: Providing scalable and highly available storage (20)

Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db
 
Db presentation google_megastore
Db presentation google_megastoreDb presentation google_megastore
Db presentation google_megastore
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Google file system
Google file systemGoogle file system
Google file system
 
Silicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDBSilicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDB
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELK
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processors
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture
 
try
trytry
try
 
Scalability Considerations
Scalability ConsiderationsScalability Considerations
Scalability Considerations
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamManaging 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Megastore: Providing scalable and highly available storage

  • 1. MegaStore: Providing Scalable, Highly Available Storage for Interactive Services Niels Claeys
  • 2. 2 Outline 1. Introduction 2. Availability and Scale 3. Megastore features 4. Replication 5. Results 6. Conclusion
  • 3. 3 1. Introduction (1) • Interactive online services demand – High scalability – Rapid development – Low latency – Consistency of data – High availability → conflicting requirements  Solution Megastore – Scalability NoSQL → partition + replicate – Convenience RDBMS → ACID semantic within partition – High availability
  • 4. 4 1. Introduction (2)  Widely deployed in Google for several years  >100 production applications.  3 billion writes and 20 billion reads daily  A petabyte of data across multiple datacenters  Available on GAE since Jan 2011.
  • 5. 5 2.1 Availability and scalability • Availability: Paxos → fault-tolerant consensus algorithm – No master – Replicate logs • Scale: – Partition data in small databases – Each partition own replicated log
  • 7. 7 Outline 1. Introduction 2. Availability and Scale 3. Megastore features 4. Replication 5. Results 6. Conclusion
  • 8. 8 3.1 Megastore features: API • Megastore = cost-transparent API – No expressive queries – Storing and querying hierarchical data in key-value store is easy – Joins in application logic: • Merge phase supported • Outer joins based on indexes → understandable performance implications
  • 9. 9 3.2 Megastore features: Data model • Megastore Tables: – Entity group root – Child table: reference to root • Entity: single row → identified by concatenation of keys
  • 10. 10 3.2 Megastore features: Indexes • 2 levels of indexes: – Local: for each entity group • updated atomically and consistently – Global: spans entity groups • Find entities without keys • Not all updates visible
  • 11. 11 3.2 Megastore features: Bigtable  Primary Keys cluster entities together  Each entity = single Bigtable row.  “IN TABLE” includes tables into single Bigtable → key ordering ensure entities are stored adjacent  Bigtable column name = Megastore table name + property name
  • 12. 12 3.3 Megastore features: Transactions (1)  Entity group= mini-database → serializable ACID semantics  MVCC (MultiVersion Concurrency Control) → transaction timestamp  Reads and Writes are isolated
  • 13. 13 3.3 Megastore features: Transactions (2) • Three levels of read consistency – Current: read EG after write logs are committed – Snapshot: read last completed transaction of EG – Inconsistent: ignore log and read latest values
  • 14. 14 3.3 Megastore features: Transactions (3)  Write transaction: ― Current read: Obtain the timestamp and log position of the last committed transaction ― Application logic: Read from Bigtable and gather writes into a log entry ― Commit: Use Paxos to achieve consensus for appending the log entry to log ― Apply: Write mutations to the entities and indexes in Bigtable ― Clean up: Delete temp data
  • 15. 15 3.3 Megastore features: Transactions (4)  Queue: transactional messaging between EG ― Transaction atomically handles messages ― Perform operations on many EG ― Associated with each EG (scalable)  Two phase commit  Queue is preferred over two phase commit.
  • 16. 16 Outline 1. Introduction 2. Availability and Scale 3. Megastore features 4. Replication 5. Results 6. Conclusion
  • 17. 18 4.1 Replication: Paxos • Reach consensus between replicas – Tolerate delay and reorder messages – Majority replicas must be reachable • Proposers, Acceptors, learners • Proposers: Requests with monotonously increasing sequence number • Problems – High-latency: multiple round trips → Adjusted to use in Megastore
  • 19. 20 4.2 Replication: Paxos adaptation • Fast reads: local through coordinators – Eliminates prepare phase – Coordinator: controls EG up-to-date → Simple because no database • Fast writes: through leaders – Eliminate prepare phase – Multiple writes issued to same leader – Leader = closest replica to the writer
  • 20. 21 4.3 Replication: Algorithms (1) • Replica stores log entries of EG → can accept out of order • Read: → >=1 replica up-to-date
  • 21. 22 4.3 Replication: Algorithms (2) • Prep: Package changes + timestamp + leader as log • Write not succeed: invalidate • Data only visible after invalidate step
  • 22. 23 4.4 Replication: Coordinator availability • Coordinator: in each datacenter → keep state local replica → simple process = more stable • Failure detection: – Chubby lock: other coordinators online → Looses majority locks: all EG out-of-date – Datacenter failure: writers wait for the locks of coordinators to expire before write can be completed • Validation races: – Always send log position – Higher number wins
  • 23. 24 4.5 Replication: Replica types  Full replicas ― What we have seen until now  Witness replica: – Can vote – Store write-ahead logs but not the data  Read-only replica: – Cannot vote – full snapshot data
  • 25. 26 5. Results • Read latency: 10+ ms • Write latency: 300 ms • Issues: Replica unavailable • Solution – Reroute traffic to servers nearby – Disable the replica coordinator
  • 26. 27 6. Conclusion • Scalability and Availability • Simpler reasoning and usage → ACID semantics • Latency: → best effort (low enough for interactive apps) • Throughput within EG: few per second → not enough: sharding EG or placement replicas near each other
  • 28. 29 Question1 As being stated several times, only 2 elements of CAP can be kept. Megastore focusses on which two and how? Reasoning: – Partition tolerance: dividing database into EG and replicating these over multiple data centers – Availability: providing a service that is highly available through Paxos – Consistency: Relaxed consistency between EG and global indexes
  • 29. 30 Question2 Current reads have the following guarantees: - A read always observes the last-acknowledged write. - After a write has been observed, all future reads observe that write. (A write might be observed before it is acknowledged.) Contradiction? → No I do not think so but it is confusing Reasoning: – Two guarantees are focused on current reads (this are the reads that preserve consistency) – The sentence between parentheses mentions that inconsistent reads are also possible but they are not current reads
  • 30. 31 Question3 In my opinion, a lot of their focus goes towards making the system consistent. But in their API they also give you the possiblity to request current, snapshot and inconsequent data. Do you think this is a valuable addition? Reasoning: → Mainly due to performance bottleneck of consistent system → Depends on the application: There exist applications where you do not mind that you read something inconsistent → Current and snapshot still maintain consistency => Personally I think the value is limited: cannot think of an application where latency is so critical that rather wants inconsistent data than wait a bit longer
  • 31. 32 Question4 4.4.3: For me it is not clear how the 'read-only' replicas receive their data, as it needs to get consistent data. Do you have an idea? → not mentioned in paper Idea: – Coordinator of replica keeps track of the up-to-date EG – Mechanism that periodically takes a snapshot of these up-to-date EG and copies it to the read-only replicas
  • 32. 33 Question5 Megastore is multiple times compared with Bigtable, maybe could you give what according to you are the biggest differences in implementation an in types of usage? Reasoning: – Build on top of bigtable and based on different requirements: → consistency guarantees + wide-area communication – Bigtable used within one data center ↔ MegaStore across multiple → increased availability (Paxos) but higher latency – Consistency guarantees from Megastore: I suspect lower performance, throughput than with bigtable – Implementation: • Bigtable: master ensures replication ↔ Paxos (no master recovery) • Bigtable: one log for each tablet server ↔ 1 log per EG in replica • Very different APIs: Megastore supports schemas + indexes – Note: Google App Engine moved from BigTable to Megastore
  • 33. 34 References MacDonald A., Paxos by example, http://angusmacdonald.me/writing/paxos-by-example/, accessed 06-05-14 Google App Engine, Switch from Bigtable to MegaStore http://googleappengine.blogspot.be/2009/09/migration-to-better-datastore.html, accessed 08-05-13

Editor's Notes

  1. Catchup: if no known-committed value for that log. → initiate no-op paxos → paxos will converge to accepted value or no-op
  2. Coordinators: out of band protocol to check if offline Looses majority of its locks: subsequent reads each ensure that the replica has a new lock → if it has majority: can handle requests again