PNUTS: Yahoo!’s Hosted Data Serving Platform

PNUTS: Yahoo!’s Hosted Data
Serving Platform
VLDB ‘08
Auckland, New Zealand
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel
Weaver, and Ramana Yerneni
Presented by
Tarik Reza Toha
#1017052013

Outline
• Background and motivation
• Related work
• Proposed methodology
– Data storage and retrieval
– Asynchronous replication and consistency
• Experimental evaluation
• Conclusion and future work
2

3
Modern Web Applications
Brian
Sonja Jimi Brandon Kurt
What are my friends up to?
Sonja:
Brandon:

4
Modern Web Applications (contd.)
16 Mike <ph..
6 Jimi <ph..
8 Mary <re..
12 Sonja <ph..
15 Brandon <po..
17 Bob <re..
<photo>
<title>Flower</title>
<url>www.flickr.com</url>
</photo>

• Scalability
– Architectural scalability: scale during periods of rapid
growth with minimal operational effort
• Response time and geographic scope
– Fast response time to geographically distributed users
• High availability and fault tolerance
– Read and even write data in failures
• Relaxed consistency guarantees
– Eventually consistency: update one replica first and
then update others
5
Requirements of Modern Web Applications

• Traditional DBMS features are:
– Complicated queries
– Strong transactions
• Modern web applications need:
– Simplified query
• No joins, aggregations
– Relaxed consistency needs
• Applications can tolerate stale or reordered data
6
DBMS for Modern Web Applications

• Bigtable: A Distributed Storage System for
Structured Data [Google, Inc.]
– Chang et al., OSDI, 2006
– Provides record-oriented access to very large tables
– Lacks geographic replication
– Lacks rich database functionalities
• Secondary indexes
• Materialized views
• Create multiple tables
• Hash-organized tables
Existing Database Management Systems
7

• Dynamo: Amazon’s Highly Available Key-value
Store
– DeCandia et al., SIGOPS, 2007
– A highly-available system
– Provides geographic replication via a gossip
mechanism
– Uses eventual consistency model
• Creates temporary inconsistency
– Uses hash-tables
• Some storages become hot-spots
8
Existing Database Management Systems (contd.)

• Distributed filesystems
– Ceph, Boxwood, Sinfonia
– Store objects
– Inappropriate for databases
– Unscalable
• Distributed hash tables (peer-to-peer)
– Chord, Pastry
– Provides object routing and database system
– Lacks ordered table abstraction
– Focuses on reliable routing and object replication in the
face of massive node turnover
9
Existing Database Management Systems (contd.)

PNUTS is a massively parallel and geographically
distributed database system for Yahoo!’s web
applications, which provides data storage organized
as hashed or ordered tables, low latency for large
numbers of con-current requests including updates
and queries, and novel per-record consistency
guarantees
10
Platform for Nimble Universal Table Storage

11
Proposed Architecture of PNUTS
E 75656 C
A 42342 E
B 42521 W
C 66354 W
D 12352 E
F 15677 E
E 75656 C
A 42342 E
B 42521 W
C 66354 W
D 12352 E
F 15677 E
CREATE TABLE Parts (
ID VARCHAR,
StockNumber INT,
Status VARCHAR
…
)
Parallel database Geographic replication
Indexes and views
Structured, flexible schema
Hosted, managed infrastructure
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E

12
Detailed Architecture of PNUTS
Data-path components
Storage units
Tablet
controller
REST API
Clients
Message
Broker
Routers

13
Detailed Architecture of PNUTS (contd.)
Storage units
Routers
Tablet controller
REST API
Clients
Local region Remote regions
YMB

14
Tablets in Hash Table
Apple
Lemon
Grape
Orange
Lime
Strawberry
Kiwi
Avocado
Tomato
Banana
Grapes are good to eat
Limes are green
Apple is wisdom
Strawberry shortcake
Arrgh! Don’t get scurvy!
But at what price?
How much did you pay for this lemon?
Is this a vegetable?
New Zealand
The perfect fruit
Name Description Price
$12
$9
$1
$900
$2
$3
$1
$14
$2
$8
0x0000
0xFFFF
0x911F
0x2AF3
Tablet 1
Tablet 2
Tablet 3

15
Tablets in Ordered Table
Apple
Banana
Grape
Orange
Lime
Strawberry
Kiwi
Avocado
Tomato
Lemon
Grapes are good to eat
Limes are green
Apple is wisdom
Strawberry shortcake
Arrgh! Don’t get scurvy!
But at what price?
The perfect fruit
Is this a vegetable?
How much did you pay for this lemon?
New Zealand
$1
$3
$2
$12
$8
$1
$9
$2
$900
$14
Name Description Price
A
Z
Q
H
Tablet 1
Tablet 2
Tablet 3

16
Single Query in PNUTS
1
Get key k (get( ))
2
Get key k3
Record for key k
4
Record for key k
Routers
Storage unit 1 Storage unit 2 Storage unit 3

17
Range Queries in PNUTS
MIN-Canteloupe SU1
Canteloupe-Lime SU3
Lime-Strawberry SU2
Strawberry-MAX SU1
Storage unit 1 Storage unit 2 Storage unit 3
Router (Scatter-gather Engine)
Apple
Avocado
Banana
Blueberry
Canteloupe
Grape
Kiwi
Lemon
Lime
Mango
Orange
Pear
Strawberry
Tomato
Watermelon
Grapefruit…Pear? (scan( ))
Grapefruit…Lime?
Lime…Pear?
SU1Strawberry-MAX
SU2Lime-Strawberry
SU3Canteloupe-Lime
SU4MIN-Canteloupe

18
Update Operation in PNUTS
1
Write key k (set(v))
2
Write key k7
Sequence # for key k
8
Sequence # for key k
SU SU SU
3
Write key k
4
5
SUCCESS
6
Write key k
Routers
Message brokers

19
Load Balancing via Tablet Splitting
Each storage unit has many tablets (horizontal partitions of the table)
Tablets may grow over timeOverfull tablets split
Storage unit may become a hotspot
Shed load by moving tablets to other servers
Storage unit
Tablet

• Eventual consistency
– Transactions:
• Alice changes status from “Sleeping” to “Awake”
• Alice changes location from “Home” to “Work”
21
Consistency Levels
(Alice, Home, Sleeping) (Alice, Home, Awake)
Region 1
(Alice, Home, Sleeping) (Alice, Work, Sleeping)
Region 2
(Alice, Work, Awake)
Work
Awake
Final state consistent
“Invalid” state visible
Awake Work

• Timeline consistency
– Transactions:
• Alice changes status from “Sleeping” to “Awake”
• Alice changes location from “Home” to “Work”
22
Consistency Levels (contd.)
(Alice, Home, Sleeping) (Alice, Home, Awake)
Region 1
(Alice, Home, Sleeping) (Alice, Work, Awake)
Region 2
Work
Awake Work

23
Consistency via Mastership
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 E
C 66354 W
D 12352 E
E 75656 C
F 15677 E
C 66354 W
B 42521 E
A 42342 E
D 12352 E
E 75656 C
F 15677 E

24
Failover in PNUTS
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
X
X
OVERRIDE W → E

• PNUTS supports both eventual and timeline
consistency model
– Applications can choose which kind of table to create
• What happens to a record with primary key
“Brian”?
25
Consistency Models in PNUTS
Record
inserted
Update Update Update UpdateUpdate Delete
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Update Update

26
Some APIs of Timeline Model in PNUTS
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Current
version
Stale versionStale version
Read-any
• Read-any returns a possibly stale version of the record
‒ Served using a local copy
• It can be used for displaying a user’s friend’s status in a social
networking application, as it is not absolutely essential to get
the most up-to-date value

27
Some APIs of Timeline Model in PNUTS (contd.)
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read-latest
Current
version
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write
Current
version

28
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Read ≥ v.6
Current
version
Read-critical(required version):
• Read-critical returns a version of the record that is strictly
newer than, or the same as the required version
• It can be used when a user writes a record, and then wants to
read a version of the record that definitely reflects his changes

29
Time
v. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1
v. 6 v. 8
Write if = v.7
ERROR
Current
version
Test-and-set-write(required version)
• Test-and-set-write performs the requested write to the record if
and only if the present version of the record is the same as
required version
‒ Locking mechanism in row level
• It can be used to implement writing a record based on previous
reading, i.e., incrementing the value of a counter

• Yahoo! Message Broker (YMB) [redo log]
– Topic-based publish/subscribe system
– Data is considered “committed” when they have been published to YMB
– At some point after being committed, the update will be asynchronously
propagated to different regions and applied to their replicas
• Recovery via YMB
– The tablet controller requests a copy from a particular remote replica (the
“source tablet”)
– A “checkpoint message” is published to YMB to ensure that any in-flight
updates at the time the copy is initiated are applied to the source tablet
– The source tablet is copied to the destination region
– Backup is used in practice
30
Recovery via YMB

Other Features
31
• Notifications
– One pub-sub topic per tablet
– Client knows about tables instead of tablets
– Automatically subscribed to all tablets in spite of
adding/removing tablets
– Undelivered notifications are handled in usual way
• Hosted Database Service
– Centrally-managed database service shared by
multiple applications

Experimental Setup
32
• Three PNUTS regions
• Workload
– 1200-3600 requests/second
– 0-50% writes
– 80% locality
• Insert Operation
Region Machine Servers/region
West 1, West 2 2.8 GHz Xeon, 4GB RAM 5 SU, 2 YMB,
1 Router, 1 Tablet controllerEast Quad 2.13 GHz Xeon, 4GB RAM
Region Latency (hash table) Latency (ordered table)
West 1 (master) 75.6 ms 33 ms
West 2 (non-master) 131.5 ms 105.8 ms
East (non-master) 315.5 ms 324.5 ms

• Existing DBMS fails to provide rich database functionality and low
latency at massive scale
• PNUTS uses a asynchronous geographic replication to ensure low write
latency
– Per-record timeline consistency that provides useful guarantees to
applications without sacrificing scalability
– Message broker that serves both as the replication mechanism and redo log
of the database
– Flexible mapping of tablets to storage units to support automated failover
and load balancing
• Future work
– Indexes and materialized views
– Bundled updates
– Batch query processing (MapReduce)
34
Conclusion and Future Work

• Asynchronous View Maintenance for VLSD Databases
– Agarwal et al., SIGMOD, 2009
– Indexes and views
• A Batch of PNUTS: Experiences Connecting Cloud Batch
and Serving Systems
– Silberstein et al., SIGMOD, 2011
– PNUTS-Hadoop
• Where in the World is My Data?
– Kadambi et al., VLDB, 2011
– Selective replication
35
Subsequent Advancements

• Remote view table
– A regular table but updated by the view maintainer
instead of a client
36
Indexes and Views
Update
YMB YMBSU
VM

37
PNUTS-Hadoop
Reading from PNUTS
Hadoop Tasks
scan(0x2-0x4)
scan(0xa-0xc)
scan(0x8-0xa)
scan(0x0-0x2)
scan(0xc-0xe)
Map
PNUTS
1. Split PNUTS table into ranges
2. Each Hadoop task assigned a range
3. Task uses PNUTS scan API to retrieve
records in range
4. Task feeds scan results and feeds
records to map function
Record
Reader
Writing to PNUTS
Map or Reduce
Hadoop Tasks
PNUTS
Router
set
set
set
set
set
set
1. Call PNUTS set to write output
set

• If a European user’s record is never accessed in Asia, it does
not make sense to pay the bandwidth and disk costs to maintain
an Asian replica
• Static replacement
– Per-record constraints
– Client sets mandatory, disallowed regions
• Dynamic replacement
– Create replicas in regions where record is read
– Evict replicas from regions where record not read
– Lease-based
• When a replica read, guaranteed to survive for a time period
• Eviction lazy; when lease expires, replica deleted on next write
38
Selective Replication

Thank you
Questions are welcome!
Email: 1017052013@grad.cse.buet.ac.bd
39

PNUTS: Yahoo!’s Hosted Data Serving Platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PNUTS: Yahoo!’s Hosted Data Serving Platform

Similar to PNUTS: Yahoo!’s Hosted Data Serving Platform (20)

More from Tarik Reza Toha

More from Tarik Reza Toha (20)

Recently uploaded

Recently uploaded (20)

PNUTS: Yahoo!’s Hosted Data Serving Platform