This document provides an overview of PNUTS, Yahoo!'s hosted data serving platform. The key points are:
1. PNUTS is a massively parallel and geographically distributed database system that provides a consistency model between full serializability and eventual consistency, through per-record timeline consistency.
2. It uses asynchronous geographic replication to replicas around the world for high availability and low latency updates. Notifications allow subscribers to stream updates. Bulk loading facilitates rapid data ingestion.
3. The architecture involves record-level mastering, storage units, tablet controllers, routers, and a message broker to propagate updates between regions while ensuring consistency.
2. • What does Yahoo! need?
• Consistency Levels
• What is PNUTS?
• Features
• Contributions
• FUNCTIONALITY
• SYSTEM ARCHITECTURE
• Experiments
• Future Work
Outline:
3. •Scalability - Flickr and del.icio.us.
•Response time and Geographic scope
•High Availability and Fault Tolerance
•Relaxed Consistency Guarantees
Characteristic of Web traffic
•Simple query needs
•Manipulate one record at a time
•Relaxed Consistency
What does Yahoo! need?
4. Consistency Levels
• Eventual consistency
o Transactions:
• Alice changes status from “Sleeping” to “Awake”
• Alice changes location from “Home” to “Work”
(Alice, Home, Sleeping) (Alice, Home, Awake)
Region 1
(Alice, Home, Sleeping) (Alice, Work, Sleeping)
Region 2
(Alice, Work, Awake)
(Alice, Work, Awake)
Work
Awake
Final state consistent
“Invalid” state visible
Awake Work
5. Consistency Levels
• Timeline consistency
o Transactions:
• Alice changes status from “Sleeping” to “Awake”
• Alice changes location from “Home” to “Work”
(Alice, Home, Sleeping) (Alice, Home, Awake)
Region 1
(Alice, Home, Sleeping) (Alice, Work, Awake)
Region 2
(Alice, Work, Awake)
Work
(Alice, Work, Awake)
Awake Work
6. • PNUTS, a massively parallel and
geographically distributed database system for
Yahoo!’s web applications.
What is PNUTS?
7. 1
Data Model and Features(scatter-gather, asynchronous
notification, bulk loading)
2
Fault Tolerance
3
Pub-Sub Message System protocol (for geographically distant
replicas)
4
Asynchronously writing to multiple copies around the world
Features
8. 5
Record-level Mastering
6
Flexible access: Hashed or ordered, indexes, views; flexible
schemas.
7
Centrally managed
8
Delivery of data management as hosted service.
Features
9. 1
An architecture based on record-level, asynchronous
geographic replication,
2
A consistency model
3
A careful choice of features to include or exclude
4
Delivery of data management as hosted service.
Contributions
10. • Data and Query Model
• Consistency Model: Hiding the Complexity of
Replication
• Notification
• Bulk Load
FUNCTIONALITY
11. Data and Query Model
Data representation
Table of records with attributes
Additional data types: Blob
- Flexible Schemas
- Point Access Vs Range Access
- Hash tables Vs Ordered tables
12. Consistency Model
PNUTS provides a consistency model that is between the two extremes of general
serializability and eventual consistency.
web applications typically manipulate one record at a time, while different records
may have activity with different geographic locality.
-We provide per-record timeline consistency: all replicas of a given record apply all
updates to the record in the same order.
14. Consistency Model
API calls
Read-any: Returns a possibly stale version of the
record.
14
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Current
version
Stale versionStale version
Read-any
15. Consistency Model
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read latest
Current
version
Stale versionStale version
Read latest: Returns the latest copy of the record that
reflects all writes that have succeeded.
16. Consistency Model
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read ≥ v.6
Current
version
Stale versionStale version
Read-critical(required version):
Read critical: Returns a version of the record that is strictly newer than, or
the same as the required version.
17. Consistency Model
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current
version
Stale versionStale version
Test-and-set-write(required version)
This call performs the requested write to the record if and only if the present
version of the record is the same as required version
18. Notification
- Trigger-like notifications are important for applications
such as ad serving, which must invalidate cached copies of
ads when the advertising contract expires.
- Allow the user to subscribe to the stream of updates on a
table.
19. Bulk Load
Necessary for applications such as comparison shopping,
which upload large blocks of new sale listings into the
database every day. Bulk inserts can be done in parallel to
multiple storage units for fast loading.
23. Data Storage and Retrieval
- Storage unit
Store tablets
Respond to get(), scan() and set() requests.
- Tablet controller
Owns the mapping
Routers poll periodically to get mapping updates.
Performs load-balancing and recovery
- Router
Determines which tablet contains the record
Determines which storage unit has the tablet
Interval mapping- Binary search of B+ tree.
Tablet controller does not become bottleneck.
24. Replication and Consistency
Asynchronous replication to ensure low latency updates. We use
the Yahoo! message broker, a publish/subscribe system
developed at Yahoo!
25. Yahoo! Message Broker
- Topic based Publish/subscribe system
- Used for logging and replication
- PNUTS + YMB = Sherpa data services platform
- Data updates considered committed when published to YMB.
- Updates asynchronously propagated to different regions (post-
publishing).
- Message purged after applied to all replicas.
- Per-record mastership mechanism.
26. Consistency via YMB and mastership
- Mastership is assigned on a record-by-record basis.
- All requests directed to master.
- Different records in same table can be mastered in different
clusters.
- Basis: Write requests locality
- Record stores its master as metadata.
- Tablet master for primary key constraints
- Multiple values for primary keys.
27. Recovery
- Any committed update is recoverable from a remote replica.
Three step recovery
1- Tablet controller requests copy from remote (source) replica.
2- “Checkpoint message” published to YMB, for in-flight updates.
3- Source tablet is copied to destination region.
Support for recovery
Synchronized tablet boundaries
Tablet splits at the same time (two-phase commit)
Backup regions within the same region.
28. Query Processing
Scatter-gather engine
- Receives a multi-record request
- Splits it into multiple individual requests for single
records/tablet scans
- Initiates requests in parallel.
- Gather the result and passes to client.
Server-side design?
- Prevent multiple parallel client requests.
- Server side optimization (group requests to same storage)
Range scan
29. Notifications
- Service to notify external systems of updates to data.
Example: popular keyword search engine index.
- Clients subscribe to all topics(tablets) for table
- Client need no knowledge of tablet organization.
- Creation of new topic (tablet split) - automatic subscription
- Break subscription of slow notification clients.
32. Experiments
Inserting Data
■ One region (West 1) is the tablet master
■ Hash: 99 clients (33 per region), MySQL: 60 clients
■ 1 million records, 1/3 per region
■ Result:
– Hash: West1: 75.6ms; West2: 131.5ms, East 315.5ms
– Ordered: West1: 33ms; West2: 105.8ms, East 324.5ms
■ Lesson: MySQL is faster than hash, although more vulnerable
to contention
■ More observations
35. Experiments
Varying Number of Storage Units
■ Storage units per region vary from 2-5
■ 10% writes, 1,200 requests/seconds
36. Experiments
Varying Size of Range Scans
■ Range scan between 0.01 to 0.1% size
■ Ordered table only
■ 30 clients vs. 300 clients
37. Bottlenecks
• Disk seek capacity on storage units
• Message Brokers
Different PNUTS customers are assigned different clusters
of storage units and message broker machines. Can share
routers and tablet controllers.
38. Future Work
• Consistency
– Referential integrity
– Bundled update
– Relaxed consistency
• Data Storage and Retrieval
– Fair sharing of storage units and message brokers
• Query Processing
– Query optimization: Maintain statistics
– Expansion of query language: join/aggregation
– Batch-query processing
• Indexes and Materialized views