PNUTS: Yahoo!’s
Hosted Data Serving
Platform
• What does Yahoo! need?
• Consistency Levels
• What is PNUTS?
• Features
• Contributions
• FUNCTIONALITY
• SYSTEM ARCHITECTURE
• Experiments
• Future Work
Outline:
•Scalability - Flickr and del.icio.us.
•Response time and Geographic scope
•High Availability and Fault Tolerance
•Relaxed Consistency Guarantees
Characteristic of Web traffic
•Simple query needs
•Manipulate one record at a time
•Relaxed Consistency
What does Yahoo! need?
Consistency Levels
• Eventual consistency
o Transactions:
• Alice changes status from “Sleeping” to “Awake”
• Alice changes location from “Home” to “Work”
(Alice, Home, Sleeping) (Alice, Home, Awake)
Region 1
(Alice, Home, Sleeping) (Alice, Work, Sleeping)
Region 2
(Alice, Work, Awake)
(Alice, Work, Awake)
Work
Awake
Final state consistent
“Invalid” state visible
Awake Work
Consistency Levels
• Timeline consistency
o Transactions:
• Alice changes status from “Sleeping” to “Awake”
• Alice changes location from “Home” to “Work”
(Alice, Home, Sleeping) (Alice, Home, Awake)
Region 1
(Alice, Home, Sleeping) (Alice, Work, Awake)
Region 2
(Alice, Work, Awake)
Work
(Alice, Work, Awake)
Awake Work
• PNUTS, a massively parallel and
geographically distributed database system for
Yahoo!’s web applications.
What is PNUTS?
1
Data Model and Features(scatter-gather, asynchronous
notification, bulk loading)
2
Fault Tolerance
3
Pub-Sub Message System protocol (for geographically distant
replicas)
4
Asynchronously writing to multiple copies around the world
Features
5
Record-level Mastering
6
Flexible access: Hashed or ordered, indexes, views; flexible
schemas.
7
Centrally managed
8
Delivery of data management as hosted service.
Features
1
An architecture based on record-level, asynchronous
geographic replication,
2
A consistency model
3
A careful choice of features to include or exclude
4
Delivery of data management as hosted service.
Contributions
• Data and Query Model
• Consistency Model: Hiding the Complexity of
Replication
• Notification
• Bulk Load
FUNCTIONALITY
Data and Query Model
Data representation
Table of records with attributes
Additional data types: Blob
- Flexible Schemas
- Point Access Vs Range Access
- Hash tables Vs Ordered tables
Consistency Model
PNUTS provides a consistency model that is between the two extremes of general
serializability and eventual consistency.
web applications typically manipulate one record at a time, while different records
may have activity with different geographic locality.
-We provide per-record timeline consistency: all replicas of a given record apply all
updates to the record in the same order.
Consistency Model
Per-record Timeline Consistency
Consistency Model
API calls
Read-any: Returns a possibly stale version of the
record.
14
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Current
version
Stale versionStale version
Read-any
Consistency Model
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read latest
Current
version
Stale versionStale version
Read latest: Returns the latest copy of the record that
reflects all writes that have succeeded.
Consistency Model
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read ≥ v.6
Current
version
Stale versionStale version
Read-critical(required version):
Read critical: Returns a version of the record that is strictly newer than, or
the same as the required version.
Consistency Model
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current
version
Stale versionStale version
Test-and-set-write(required version)
This call performs the requested write to the record if and only if the present
version of the record is the same as required version
Notification
- Trigger-like notifications are important for applications
such as ad serving, which must invalidate cached copies of
ads when the advertising contract expires.
- Allow the user to subscribe to the stream of updates on a
table.
Bulk Load
Necessary for applications such as comparison shopping,
which upload large blocks of new sale listings into the
database every day. Bulk inserts can be done in parallel to
multiple storage units for fast loading.
SYSTEM ARCHITECTURE
SYSTEM ARCHITECTURE
SYSTEM ARCHITECTURE
Data Storage and Retrieval
- Storage unit
Store tablets
Respond to get(), scan() and set() requests.
- Tablet controller
Owns the mapping
Routers poll periodically to get mapping updates.
Performs load-balancing and recovery
- Router
Determines which tablet contains the record
Determines which storage unit has the tablet
Interval mapping- Binary search of B+ tree.
Tablet controller does not become bottleneck.
Replication and Consistency
Asynchronous replication to ensure low latency updates. We use
the Yahoo! message broker, a publish/subscribe system
developed at Yahoo!
Yahoo! Message Broker
- Topic based Publish/subscribe system
- Used for logging and replication
- PNUTS + YMB = Sherpa data services platform
- Data updates considered committed when published to YMB.
- Updates asynchronously propagated to different regions (post-
publishing).
- Message purged after applied to all replicas.
- Per-record mastership mechanism.
Consistency via YMB and mastership
- Mastership is assigned on a record-by-record basis.
- All requests directed to master.
- Different records in same table can be mastered in different
clusters.
- Basis: Write requests locality
- Record stores its master as metadata.
- Tablet master for primary key constraints
- Multiple values for primary keys.
Recovery
- Any committed update is recoverable from a remote replica.
Three step recovery
1- Tablet controller requests copy from remote (source) replica.
2- “Checkpoint message” published to YMB, for in-flight updates.
3- Source tablet is copied to destination region.
Support for recovery
Synchronized tablet boundaries
Tablet splits at the same time (two-phase commit)
Backup regions within the same region.
Query Processing
Scatter-gather engine
- Receives a multi-record request
- Splits it into multiple individual requests for single
records/tablet scans
- Initiates requests in parallel.
- Gather the result and passes to client.
Server-side design?
- Prevent multiple parallel client requests.
- Server side optimization (group requests to same storage)
Range scan
Notifications
- Service to notify external systems of updates to data.
Example: popular keyword search engine index.
- Clients subscribe to all topics(tablets) for table
- Client need no knowledge of tablet organization.
- Creation of new topic (tablet split) - automatic subscription
- Break subscription of slow notification clients.
Experiments
Experiments
Experimental setup
Metric: latency
Being compared: hash and ordered tables
Clusters: three-region PNUTS cluster
2 to the west, 1 to the east
Parameters
Experiments
Inserting Data
■ One region (West 1) is the tablet master
■ Hash: 99 clients (33 per region), MySQL: 60 clients
■ 1 million records, 1/3 per region
■ Result:
– Hash: West1: 75.6ms; West2: 131.5ms, East 315.5ms
– Ordered: West1: 33ms; West2: 105.8ms, East 324.5ms
■ Lesson: MySQL is faster than hash, although more vulnerable
to contention
■ More observations
Experiments
Varying Load
■ Requests vary between 1200 – 3600
requests/second with 10% writes
■ Result:
Experiments
Varying Read/Write Ratio
■ Ratios vary between 0 and 50%
■ Fixed 1,200 requests/second
Experiments
Varying Number of Storage Units
■ Storage units per region vary from 2-5
■ 10% writes, 1,200 requests/seconds
Experiments
Varying Size of Range Scans
■ Range scan between 0.01 to 0.1% size
■ Ordered table only
■ 30 clients vs. 300 clients
Bottlenecks
• Disk seek capacity on storage units
• Message Brokers
Different PNUTS customers are assigned different clusters
of storage units and message broker machines. Can share
routers and tablet controllers.
Future Work
• Consistency
– Referential integrity
– Bundled update
– Relaxed consistency
• Data Storage and Retrieval
– Fair sharing of storage units and message brokers
• Query Processing
– Query optimization: Maintain statistics
– Expansion of query language: join/aggregation
– Batch-query processing
• Indexes and Materialized views
Thank you
Any questions

Pnuts yahoo!’s hosted data serving platform

  • 1.
  • 2.
    • What doesYahoo! need? • Consistency Levels • What is PNUTS? • Features • Contributions • FUNCTIONALITY • SYSTEM ARCHITECTURE • Experiments • Future Work Outline:
  • 3.
    •Scalability - Flickrand del.icio.us. •Response time and Geographic scope •High Availability and Fault Tolerance •Relaxed Consistency Guarantees Characteristic of Web traffic •Simple query needs •Manipulate one record at a time •Relaxed Consistency What does Yahoo! need?
  • 4.
    Consistency Levels • Eventualconsistency o Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) (Alice, Home, Awake) Region 1 (Alice, Home, Sleeping) (Alice, Work, Sleeping) Region 2 (Alice, Work, Awake) (Alice, Work, Awake) Work Awake Final state consistent “Invalid” state visible Awake Work
  • 5.
    Consistency Levels • Timelineconsistency o Transactions: • Alice changes status from “Sleeping” to “Awake” • Alice changes location from “Home” to “Work” (Alice, Home, Sleeping) (Alice, Home, Awake) Region 1 (Alice, Home, Sleeping) (Alice, Work, Awake) Region 2 (Alice, Work, Awake) Work (Alice, Work, Awake) Awake Work
  • 6.
    • PNUTS, amassively parallel and geographically distributed database system for Yahoo!’s web applications. What is PNUTS?
  • 7.
    1 Data Model andFeatures(scatter-gather, asynchronous notification, bulk loading) 2 Fault Tolerance 3 Pub-Sub Message System protocol (for geographically distant replicas) 4 Asynchronously writing to multiple copies around the world Features
  • 8.
    5 Record-level Mastering 6 Flexible access:Hashed or ordered, indexes, views; flexible schemas. 7 Centrally managed 8 Delivery of data management as hosted service. Features
  • 9.
    1 An architecture basedon record-level, asynchronous geographic replication, 2 A consistency model 3 A careful choice of features to include or exclude 4 Delivery of data management as hosted service. Contributions
  • 10.
    • Data andQuery Model • Consistency Model: Hiding the Complexity of Replication • Notification • Bulk Load FUNCTIONALITY
  • 11.
    Data and QueryModel Data representation Table of records with attributes Additional data types: Blob - Flexible Schemas - Point Access Vs Range Access - Hash tables Vs Ordered tables
  • 12.
    Consistency Model PNUTS providesa consistency model that is between the two extremes of general serializability and eventual consistency. web applications typically manipulate one record at a time, while different records may have activity with different geographic locality. -We provide per-record timeline consistency: all replicas of a given record apply all updates to the record in the same order.
  • 13.
  • 14.
    Consistency Model API calls Read-any:Returns a possibly stale version of the record. 14 Time v. 1 v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale versionStale version Read-any
  • 15.
    Consistency Model Time v. 1v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read latest Current version Stale versionStale version Read latest: Returns the latest copy of the record that reflects all writes that have succeeded.
  • 16.
    Consistency Model Time v. 1v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read ≥ v.6 Current version Stale versionStale version Read-critical(required version): Read critical: Returns a version of the record that is strictly newer than, or the same as the required version.
  • 17.
    Consistency Model Time v. 1v. 2 v. 3 v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale versionStale version Test-and-set-write(required version) This call performs the requested write to the record if and only if the present version of the record is the same as required version
  • 18.
    Notification - Trigger-like notificationsare important for applications such as ad serving, which must invalidate cached copies of ads when the advertising contract expires. - Allow the user to subscribe to the stream of updates on a table.
  • 19.
    Bulk Load Necessary forapplications such as comparison shopping, which upload large blocks of new sale listings into the database every day. Bulk inserts can be done in parallel to multiple storage units for fast loading.
  • 20.
  • 21.
  • 22.
  • 23.
    Data Storage andRetrieval - Storage unit Store tablets Respond to get(), scan() and set() requests. - Tablet controller Owns the mapping Routers poll periodically to get mapping updates. Performs load-balancing and recovery - Router Determines which tablet contains the record Determines which storage unit has the tablet Interval mapping- Binary search of B+ tree. Tablet controller does not become bottleneck.
  • 24.
    Replication and Consistency Asynchronousreplication to ensure low latency updates. We use the Yahoo! message broker, a publish/subscribe system developed at Yahoo!
  • 25.
    Yahoo! Message Broker -Topic based Publish/subscribe system - Used for logging and replication - PNUTS + YMB = Sherpa data services platform - Data updates considered committed when published to YMB. - Updates asynchronously propagated to different regions (post- publishing). - Message purged after applied to all replicas. - Per-record mastership mechanism.
  • 26.
    Consistency via YMBand mastership - Mastership is assigned on a record-by-record basis. - All requests directed to master. - Different records in same table can be mastered in different clusters. - Basis: Write requests locality - Record stores its master as metadata. - Tablet master for primary key constraints - Multiple values for primary keys.
  • 27.
    Recovery - Any committedupdate is recoverable from a remote replica. Three step recovery 1- Tablet controller requests copy from remote (source) replica. 2- “Checkpoint message” published to YMB, for in-flight updates. 3- Source tablet is copied to destination region. Support for recovery Synchronized tablet boundaries Tablet splits at the same time (two-phase commit) Backup regions within the same region.
  • 28.
    Query Processing Scatter-gather engine -Receives a multi-record request - Splits it into multiple individual requests for single records/tablet scans - Initiates requests in parallel. - Gather the result and passes to client. Server-side design? - Prevent multiple parallel client requests. - Server side optimization (group requests to same storage) Range scan
  • 29.
    Notifications - Service tonotify external systems of updates to data. Example: popular keyword search engine index. - Clients subscribe to all topics(tablets) for table - Client need no knowledge of tablet organization. - Creation of new topic (tablet split) - automatic subscription - Break subscription of slow notification clients.
  • 30.
  • 31.
    Experiments Experimental setup Metric: latency Beingcompared: hash and ordered tables Clusters: three-region PNUTS cluster 2 to the west, 1 to the east Parameters
  • 32.
    Experiments Inserting Data ■ Oneregion (West 1) is the tablet master ■ Hash: 99 clients (33 per region), MySQL: 60 clients ■ 1 million records, 1/3 per region ■ Result: – Hash: West1: 75.6ms; West2: 131.5ms, East 315.5ms – Ordered: West1: 33ms; West2: 105.8ms, East 324.5ms ■ Lesson: MySQL is faster than hash, although more vulnerable to contention ■ More observations
  • 33.
    Experiments Varying Load ■ Requestsvary between 1200 – 3600 requests/second with 10% writes ■ Result:
  • 34.
    Experiments Varying Read/Write Ratio ■Ratios vary between 0 and 50% ■ Fixed 1,200 requests/second
  • 35.
    Experiments Varying Number ofStorage Units ■ Storage units per region vary from 2-5 ■ 10% writes, 1,200 requests/seconds
  • 36.
    Experiments Varying Size ofRange Scans ■ Range scan between 0.01 to 0.1% size ■ Ordered table only ■ 30 clients vs. 300 clients
  • 37.
    Bottlenecks • Disk seekcapacity on storage units • Message Brokers Different PNUTS customers are assigned different clusters of storage units and message broker machines. Can share routers and tablet controllers.
  • 38.
    Future Work • Consistency –Referential integrity – Bundled update – Relaxed consistency • Data Storage and Retrieval – Fair sharing of storage units and message brokers • Query Processing – Query optimization: Maintain statistics – Expansion of query language: join/aggregation – Batch-query processing • Indexes and Materialized views
  • 39.