Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

On Brewing Fresh Espresso: LinkedIn’s Distributed Data
Serving Platform
Swaroop Jagadish
http://www.linkedin.com/in/swaroopjagadish
LinkedIn Confidential ©2013 All Rights Reserved

Outline
LinkedIn Data Ecosystem
Espresso: Design Points
Data Model and API
Architecture
Deep Dive: Fault Tolerance
Deep Dive: Secondary Indexing
Espresso In Production
Future work
2

The World’s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
225M+ 2M+
Company Pages
Connecting Talent  Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 3

Espresso: Key Design Points
 Source-of-truth
– Master-Slave, Timeline consistent
– Query-after-write
– Backup/Restore
– High Availability
 Horizontally Scalable
 Rich functionality
– Hierarchical data model
– Document oriented
– Transactions within a hierarchy
– Secondary Indexes
5

Espresso: Key Design Points
 Agility – no “pause the world” operations
– “On the fly” Schema Evolution
– Elasticity
 Integration with the data ecosystem
– Change stream with freshness in O(seconds)
– ETL to Hadoop
– Bulk import
 Modular and Pluggable
– Off-the-shelf: MySQL, Lucene, Avro
6

Application View
8
key
value
REST API:
/mailbox/msg_meta/bob/2

Partitioning
9
/mailbox/msg_meta/bob/2
MemberId is the partitioning key

Document based data model
Richer than a plain key-value store
Hierarchical keys
Values are rich documents and may contain
nested types
10
from : {
name : "Chris",
email : "chris@linkedin.com"
}
subject : "Go Giants!"
body : "World Series 2012! w00t!"
unread : true
Messages
mailboxID : String
messageID : long
from : {
name : String
email : String
}
subject : String
body : String
unread : boolean

REST based API
• Secondary Index query
– GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true
+isInbox:true”&start=0&count=15
• Partial updates
POST /MailboxDB/MessageMeta/bob/1
Content-Type: application/json
Content-Length: 21
{“unread” : “false”}
• Conditional operations
– Get a message, only if recently updated
GET /MailboxDB/MessageMeta/bob/1
If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT
11

Transactional writes within a hierarchy
mboxId value
George { “numUnread”:
2 }
MessageCounter
mboxId msgId value etag
George 0 {…, “unread”: false, …} 7abf8091
George 1 {…, “unread”: true, …} b648bc5f
George 2 {…, “unread”: true, …} 4fde8701
Message/Message/George/0 {…, “unread”: false, …} 7abf8091
/Message/George/0 {…, “unread”: true, …}
/MessageCounter/George {…, “numUnread”: “+1”, …}
1. Read, record etags
2. Prepare after-image
3.Update
mboxId value
George { “numUnread”:
3 }

Cluster Management and Fault
Tolerance
20

Generic Cluster Manager: Apache Helix
 Generic cluster management
– State model + constraints
– Ideal state of distribution of partitions
across the cluster
– Migrate cluster from current state to
ideal state
• More Info
• SoCC 2012
• http://helix.incubator.apache.org
21

Espresso Partition Layout: Master, Slave
 3 Storage Engine nodes, 2-way replication
22
Apache Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline

Cluster Management
Cluster Expansion
Node Failover

Cluster Expansion
 Initial State with 3 Storage Nodes. Step1: Compute new Ideal
state
24
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4

Cluster Expansion
 Step 2: Bootstrap new node’s partitions by restoring from
backups
25
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Snapshots

Cluster Expansion
 Step 3: Catch up from live replication stream
26
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Snapshots

Cluster Expansion
 Step 4: Migrate masters and slaves to rebalance
27
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2 P3
P5 P6
P10
Node 2
P5 P6 P7
P2
P11 P12
Node 3
P9 P10 P11
P3 P4
P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1

Cluster Expansion
 Partitions are balanced. Router starts sending traffic to new
node
28
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1 Node 2
P5 P6 P7
P2 P11 P12
Node 3
Master
Slave
Offline
Node 4
P1 P2 P3
P5 P6 P10
P9 P10 P11
P3 P4 P8
P4 P8 P12
P1 P7 P9

Node Failover
• During failure or planned maintenance
29
Node 1
P1 P2 P3
P10P5 P6
Node 2
P5 P6 P7
P12P2 P11
Node 3
P9 P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…

Node Failover
• Step 1: Detect Node failure
30
Node 1
P1 P2 P3
P10P5 P6
Node 2
P5 P6 P7
P12P2 P11
Node 3
P9 P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…

Node Failover
• Step 2: Compute new ideal state for promoting slaves to
master
31
Node 1
P1 P2 P3
P5 P6
Node 2
P5 P6 P7
P12P2
Node 3
P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
P11P10
P9

Espresso Secondary Indexing
• Local Secondary Index Requirements
• Read after write
• Consistent with primary data under failure
• Rich query support: match, prefix, range, text search
• Cost-to-serve proportional to working set
• Pluggable Index Implementations
• MySQL B-Tree
• Inverted index using Apache Lucene with MySQL backing store
• Inverted index using Prefix Index
• Fastbit based bitmap index

Lucene based implementation
• Requires entire index to be memory-resident to support low latency
query response times
• For the Mailbox application, we have two options

Optimizations for Lucene based implementation
• Concurrent transactions on the same Lucene
index leads to inconsistency
• Need to acquire a lock
• Opening an index repeatedly is expensive
• Group commit to amortize index opening cost
write
Request 2
Request 3
Request 4
Request 5
Request 1

Optimizations for Lucene based implementation
 High value users of the site accumulate large
mailboxes
– Query performance degrades with a large index
 Performance shouldn’t get worse with more usage!
 Time Partitioned Indexes: Partition index into buckets
based on created time

Espresso in Production
 Unified Social Content Platform –social activity aggregation
 High Read:Write ratio
39

Espresso in Production
 InMail - Allows members to communicate with each other
 Large storage footprint
 Low latency requirement for secondary index queries involving text
search and relational predicates
40

Performance
 Average Failover Latency with 1024 partitions is
around 300ms
 Primary Data Reads and Writes
 For Single Storage Node on SSD
 Average row size = 1KB
41
Operation Average Latency Average
Throughput
Reads ~3ms 40,000 per
second
Writes ~6ms 20,000 per
second

Performance
 Partition-key level Secondary Index using Lucene
 One Index per Mailbox use-case
 Base data on SAS, Indexes on SSDs
 Average throughput per index = ~1000 per second
(after the group commit and partitioned index
optimizations)
42
Operation Average Latency
Queries (average
of 5 indexed
fields)
~20ms
Writes (Around
30 indexed fields)
~20ms

Durability and Consistency
 Within a Data Center
 Across Data Centers

 Within a Data Center
– Write latency vs Durability
 Asynchronous replication
– May lead to data loss
– Tooling can mitigate some of this
 Semi-synchronous replication
– Wait for at least one relay to acknowledge
– During failover, slaves wait for catchup
 Consistency over availability
 Helix selects slave with least replication lag to take over
mastership
 Failover time is ~300ms in practice

 Across data centers
– Asynchronous replication
– Stale reads possible
– Active-active: Conflict resolution via last-writer-wins

Lessons learned
Dealing with transient failures
Planned upgrades
Slave reads
Storage Devices
– SSDs vs SAS disks
Scaling Cluster Management
46

Future work
Coprocessors
– Synchronous, Asynchronous
Richer query processing
– Group-by, Aggregation
47

Key Takeaways
Espresso is a timeline consistent,
document-oriented distributed database
Feature rich: Secondary indexing,
transactions over related documents,
seamless integration with the data
ecosystem
In production since June 2012 serving
several key use-cases
48

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Similar to Espresso: LinkedIn's Distributed Data Serving Platform (Talk) (20)

More from Amy W. Tang

More from Amy W. Tang (12)

Recently uploaded

Recently uploaded (20)

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)