Noha mega store

MegaStore
Google Inc.

Jason Baker, Chris Bond, James C Corbett, JJ Furman,
Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li,
Alexander Lloyd, Vadim Yushprakh. CIDR 2011.

Presented by: Noha Elprince
22 June, 2011

What is MegaStore?
§  A storage system developed to meet the
requirements of today’s online interactive services.

§  Megastore is the data engine supporting the Google
App Engine (GAE) https://appengine.google.com/

§  GAE cloud computing technology:
Hosts/virtualizes web apps across multiple
servers on Google’s platform.
Ø  Fast development and deployment.
Ø  Simple administration.
Ø  No need to worry about hardware
patches or backups and scalability.
Ø 

2

Outline
—  Motivation & Problem
—  Methodology
—  Design of Megastore
— 
— 
— 

Data Model
Data Storage
Transactions and Concurrency Control

—  How Megastore achieves Availability and Scalability.
— 
— 

PAXOS.
Megastore’s approach.

—  Experience
—  Related Work
—  Conclusion

3

Megastore- Motivation
•  Storage requirements of today’s interactive online
applications.
— 
— 
— 
— 
— 

Highly scalable
Rapid development
Low latency
Durability and consistency
Availability and fault tolerance.

•  These requirements are in conflict !

4

CAP Theorem – Eric Brewer 2000
“In a distributed database system,
you can only have at most two of
the following three characteristics:

Ø  Consistency
Ø  Availability
Ø  Partition tolerance
”
ACID = Atomicity, Consistency,
Isolation, Durability.

5

Problem
§  Conflicts between Available systems:
—  RDBMS
Rich set of features, expressive language helps development,
but difficult to scale.
Eg: MySQL, PostgreSQL, MS SQL Server, Oracle RDB.

—  NoSQL datastores
Highly Scalable but Limited API and loose consistency models.
Eg: Google’s BigTable, Apache Hadoop’s Hbase, Facebook’s Cassandra.

§ 

Reliability of a single datacenter cant be guaranteed 100%.
[“Always expect the unexpected”—James Patterson]

6

Methodology
—  Megastore blends the scalability of NoSQL with the
convenience of traditional RDBMS.

—  High reliability can be achieved by:
Ø  Data lives in multiple data centers.
Ø  Write to a majority of datacenters synchronously.
Ø  Allow the infrastructure decide what datacenter to read from and
write to.

7

Outline
þ —  Motivation & Problem
þ —  Methodology

—  Design of Megastore
— 
— 
— 

Data Model
Data Storage

— 
— 

PAXOS.


8

Design of Megastore : DataModel
—  The data model is declared in a schema.
—  Each schema has a set of tables : root tables or child tables.
—  Entity Group – consists of a root entity along with all child
entities.

CREATE SCHEMA PhotoApp;
CREATE TABLE User {
required int64 user_id;
required string name;
} PRIMARY KEY(user_id),
ENTITY GROUP ROOT;

CREATE TABLE Photo {
required int64 user_id;
required int32 photo_id;
required int64 time;
required string full_url;
optional string thumbnail_url;
repeated string tag;
} PRIMARY KEY(user_id, photo_id),
IN TABLE User,
ENTITY GROUP KEY(user_id)
REFERENCES User;

9

Design of Megastore : DataModel
•  (Hierarchical) data is de-normalized to eliminate the join costs
Joins are implemented in application level
•  Outer joins with parallel queries using secondary indexed
•  Provides an efficient stand-in for SQL-style joins

10

Design of Megastore : Data Storage
How is it stored in BigTable?

“A Bigtable is a compressed, high performance, and proprietary database
system built on :
Google File System (GFS), Chubby Lock service and other Google
programs ”

11

Example:
User {user_id:101, name: ‘John’ }
Photo{ user_id:101, photo_id:501, time 2009, full_url:
‘john-pic1’,

Row
Key
101

User.na Photo.
me
time

User{user_id:102, name: ‘Mary’ }
Photo{ user_id:102, photo_id:600, time:2009,
full_url: ‘mary-pic1’, tag:’office’, tag:’picnic’,
tag:’Paris’}
Photo{ user_id:102, photo_id:601, time:2011,
full_url: ‘mary-pic2’, tag:’birthday’, tag:’friends’}

Photo
URL

John

101,
501

2009

Vacation,
Hoilday,
Paris

…

101,
502

2010

Office,
friends, pub

…

102,
600

2009

Office,
Picnic,
Paris

…

102,
601

2011

Birthday,
Friends

…

tag:’vacation’, tag:’holiday’, tag:’Paris’}
Photo{ user_id:101, photo_id:502, time:2010, full_url:
‘john-pic2’, tag:’office’,
tag:’friends’, tag:’pub’}

Photo.
Tag

102

Mary

12

—  Indexing
—  Local Index – find data within Entity Group.
CREATE LOCAL INDEX PhotosByTime ON Photo(user_id, time);

—  Global Index - spans entity groups.
CREATE GLOBAL INDEX PhotosByTag ON Photo(tag) STORING
(thumbnail_url);

—  The ‘Storing’ Clause
Ø  Faster retrieval of certain properties.

13

How is it stored in BigTable?
PhotosByTime
Row Key
101,2009, 101,501
101,2010, 101,502
102,2009, 102,600
102,2011, 102,601

PhotosByTag
Row Key

Thumbnail.Url

Birthday,102, 601

…

Friends, 101, 502

…

Friends, 102,601

…

Holiday, 101, 501

…

Office, 101, 502

…

Office, 102, 600

…

Paris, 101, 501

…

Paris, 102, 600

…

Pub, 101, 502

…

14

Outline
þ —  Design of Megastore
✓ —  Data Model
✓ —  Data Storage
— 


— 
— 

PAXOS.


15

•  Each Entity Group acts as mini-db, provides
ACID semantics.

•  Transaction management using Write
Ahead Logging (WAL).

•  BigTable feature – ability to store multiple
data for same row/column with different
timestamps.

•  Cross entity group transactions supported
via two-phase commit (2PC).

•  Entites in an Entity group employs
Multiversion Concurrency Control (MVCC).

—  MVCC: multiversion concurrency control
Using timestamps - reads and writes do not block each other.

—  Read consistency
—  Current: wait for uncommitted writes then read last committed value
—  Snapshot: doesn't’t wait. Reads last committed values.
—  Inconsistent reads: ignore the state of log and read the last values directly
(data may be stale)

—  Write consistency
—  Determine the next available log position
—  Assigns mutations of write-ahead log (WAL) a timestamp higher than any
previous one

—  Employs Paxos to settle the resource contention : Select a winner to write on
a certain entity group. The others will abort/retry their operations.
It uses optimistic concurrency OCC with mutations (write operations):
(Assumes there is no transaction ‘s data conficts => proceed without locks )

q  Queues
§  Provide transactional messaging
between entity groups.
§  Each message either is :
Ø  Synchronous: has a single
sending and receiving entity group.
Ø  Asynchronous: has different
sending and receiving entity group.
Fig. Operations across entity groups

Ø  Useful to perform operations that affect many entity groups.

18

q  Two-Phase Commit (2PC)
§  Coordinator: the component that receives the commit/abort request
§  Participants: the resource managers that did work on behalf of
the transaction (by reading/updating resources).
* Goal: Ensure that the coordinator and all participants either
commit/abort the transaction => Atomicity is satisfied. Source: Ref[2]

Disadv. High latency
Adv.
Simplify code for unique secondary key enforcement.

19

Other Features
—  Integrated Backup System
Ø  used to restore back an entity group’s state to
any point in time

—  Data Encryption
Ø  use distinct key/entity group

20

Outline
✓ —  Transactions and Concurrency Control

— 
— 

PAXOS.


21

Megastore – Availability / Scalability
v  Megastore Replication System
•  Replication is done per entity group by:
synchronously replicating the group’s
transaction log into a number of replicas.
•  Reads and writes can be initiated
from any replicas.
•  Writes require one round of interdatacenter communication.
•  ACID semantics are preserved
regardless of what replica a client
starts from.

Fig. Scalable Replication

Megastore – Replication
—  PAXOS Algorithm
•  a way to reach consensus among a group of replicas on a single value.
•  Databases typically use PAXOS to replicate a transaction log, where a
separate instance of PAXOS is used for each position in the log.

Source: Ref[3]

Adv. Tolerates delayed or reordered messages and replicas that fail by
Stopping (can tolerate upto N/2 failures).
Disadv. high-latency bec. it demands multiple rounds of communication.
so Megastore uses an improved version.

•  Master-Based Approach
Ø  A Master-Slave model is generally used where the Master
handles all the replication of writes.
Ø  But it causes a bottleneck.

•  MegaStore Replication System (PAXOS-modified)
§  Fast Reads

-

Allow local reads from any where.

- Tracks a set of entity groups for which its replica has observed
all PAXOS writes and serve their local reads.

§  Fast Writes
- A specific replica is chosen as a leader.
- The leader decides the proposal no. and sends it to other writers.
- The first writer submits a value to the leader, wins the
right to ask all replicas to accept that value.
•  Select the next write’s leader using the closest replica heuristic
(aim: minimizes the writer-leader latency by observing: most
apps submit writes from the same region repeatedly).

Outline
✓ —  Transactions and Concurrency Control
þ —  How Megastore achieves Availability and Scalability.
— 
— 

PAXOS.


26

Experience
²  Real-world deployment
—  More than 100 production application use Megastore

(e.g. Google App Engine)
—  Most of applications see extremely high availability
—  Most of users see average write latencies of 100~400 ms.

Related Work
—  NoSQL data storage systems
—  Bigtable, Cassandra, Yahoo PNUTS, Amazon SimpleDB

—  Data replication process
—  Hbase, CouchDB, Dynamo, …
—  Extend replication scheme of traditional RDBMS
systems

—  Paxos algorithm
—  SCALARIS, Keyspace, …
—  Few have used Paxos to achieve synchronous replication

Conclusion
Megastore
Ø  A scalable, highly available datastore for interactive
internet services.
Ø  Paxos is used for synchronous replication.
Ø  Bigtable as the scalable datastore while adding richer
primitives (ACID, Indexes).
Ø  Has over 100 applications in productions

29

References
—  [1] “Megastore: Providing Scalable Highly Available Storage for
Interactive Services.” Jason Baker et al.. CIDR 2011.

—  [2] “Principles of transaction Processing.”

Philip A. Bernstein, Eric Newcomer, Morgan Kaufmann, 2009.

—  [3] http://paprika.umw.edu/~ernie/cpsc321/10312006.html
—  [4] Google MegaStore’s Presentation at SIGMOD 2008.
http://perspectives.mvdirona.com/2008/07/10/
GoogleMegastore.aspx.

31

Megastore Read Process
—  Each replica stores mutations
and metadata for the log entries
—  Read process
—  1. Query Local
— 
—  2.
— 
— 
—  3.
— 

Up-to-date check

Find position
Highest log position
Select replica

Catchup
Check the consensus
value from other
replica

—  4. Validate
—  Synchronizing with
up-to-data

—  5. Query data
—  Read data with timestamp

—  Megastore Write Process
— 
— 

Each replica stores mutations
and metadata for the log entries
Write process
—  1. Accept leader
— 

Ask the leader to accept
the value as proposal
number

—  2. Prepare
— 

Run the Paxos Prepare
phase at all replica

—  3. Accept
— 

Ask remaining replicas
to accept the value

—  4. Invalidate
— 

Fault handling for replicas
which did not accept the value

—  5. Apply
— 

Apply the value’s mutation at
as many replicas as possible

Noha mega store

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Noha mega store

Similar to Noha mega store (20)

More from Noha Elprince

More from Noha Elprince (6)

Recently uploaded

Recently uploaded (20)

Noha mega store