Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
On Brewing Fresh Espresso: LinkedIn’s Distributed Data
Serving Platform
Swaroop Jagadish
http://www.linkedin.com/in/swaroo...
Outline
LinkedIn Data Ecosystem
Espresso: Design Points
Data Model and API
Architecture
Deep Dive: Fault Tolerance
D...
The World’s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
225M+ 2M...
LinkedIn Data Ecosystem
4
Espresso: Key Design Points
 Source-of-truth
– Master-Slave, Timeline consistent
– Query-after-write
– Backup/Restore
– H...
Espresso: Key Design Points
 Agility – no “pause the world” operations
– “On the fly” Schema Evolution
– Elasticity
 Int...
Data Model and API
7
Application View
8
key
value
REST API:
/mailbox/msg_meta/bob/2
Partitioning
9
/mailbox/msg_meta/bob/2
MemberId is the partitioning key
Document based data model
Richer than a plain key-value store
Hierarchical keys
Values are rich documents and may conta...
REST based API
• Secondary Index query
– GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true
+isInbox:true”&start=0&coun...
Transactional writes within a hierarchy
mboxId value
George { “numUnread”:
2 }
MessageCounter
mboxId msgId value etag
Geor...
Espresso Architecture
13
14
15
16
17
18
19
Cluster Management and Fault
Tolerance
20
Generic Cluster Manager: Apache Helix
 Generic cluster management
– State model + constraints
– Ideal state of distributi...
Espresso Partition Layout: Master, Slave
 3 Storage Engine nodes, 2-way replication
22
Apache Helix
Partition: P1
Node: 1...
Cluster Management
Cluster Expansion
Node Failover
Cluster Expansion
 Initial State with 3 Storage Nodes. Step1: Compute new Ideal
state
24
Helix
Partition: P1
Node: 1
…
Pa...
Cluster Expansion
 Step 2: Bootstrap new node’s partitions by restoring from
backups
25
Helix
Partition: P1
Node: 1
…
Par...
Cluster Expansion
 Step 3: Catch up from live replication stream
26
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
...
Cluster Expansion
 Step 4: Migrate masters and slaves to rebalance
27
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: ...
Cluster Expansion
 Partitions are balanced. Router starts sending traffic to new
node
28
Helix
Partition: P1
Node: 1
…
Pa...
Node Failover
• During failure or planned maintenance
29
Node 1
P1 P2 P3
P10P5 P6
Node 2
P5 P6 P7
P12P2 P11
Node 3
P9 P10 ...
Node Failover
• Step 1: Detect Node failure
30
Node 1
P1 P2 P3
P10P5 P6
Node 2
P5 P6 P7
P12P2 P11
Node 3
P9 P10 P11
P8P3 P...
Node Failover
• Step 2: Compute new ideal state for promoting slaves to
master
31
Node 1
P1 P2 P3
P5 P6
Node 2
P5 P6 P7
P1...
Failover Performance
32
Secondary indexing
33
Espresso Secondary Indexing
• Local Secondary Index Requirements
• Read after write
• Consistent with primary data under f...
Lucene based implementation
• Requires entire index to be memory-resident to support low latency
query response times
• Fo...
Optimizations for Lucene based implementation
• Concurrent transactions on the same Lucene
index leads to inconsistency
• ...
Optimizations for Lucene based implementation
 High value users of the site accumulate large
mailboxes
– Query performanc...
Espresso in Production
38
Espresso in Production
 Unified Social Content Platform –social activity aggregation
 High Read:Write ratio
39
Espresso in Production
 InMail - Allows members to communicate with each other
 Large storage footprint
 Low latency re...
Performance
 Average Failover Latency with 1024 partitions is
around 300ms
 Primary Data Reads and Writes
 For Single S...
Performance
 Partition-key level Secondary Index using Lucene
 One Index per Mailbox use-case
 Base data on SAS, Indexe...
Durability and Consistency
 Within a Data Center
 Across Data Centers
Durability and Consistency
 Within a Data Center
– Write latency vs Durability
 Asynchronous replication
– May lead to d...
Durability and Consistency
 Across data centers
– Asynchronous replication
– Stale reads possible
– Active-active: Confli...
Lessons learned
Dealing with transient failures
Planned upgrades
Slave reads
Storage Devices
– SSDs vs SAS disks
Scal...
Future work
Coprocessors
– Synchronous, Asynchronous
Richer query processing
– Group-by, Aggregation
47
Key Takeaways
Espresso is a timeline consistent,
document-oriented distributed database
Feature rich: Secondary indexing...
49
Questions?
Upcoming SlideShare
Loading in …5
×

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

12,705 views

Published on

This talk was given by Swaroop Jagadish (Staff Software Engineer @ LinkedIn) at the ACM SIGMOD/PODS Conference (June 2013). For the paper written by the LinkedIn Espresso Team, go here:
http://www.slideshare.net/amywtang/espresso-20952131

Published in: Technology
  • Dating for everyone is here: ❤❤❤ http://bit.ly/2F7hN3u ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ♥♥♥ http://bit.ly/2F7hN3u ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/th54vze } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... Download Full doc Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... Download PDF EBOOK here { https://tinyurl.com/th54vze } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/th54vze } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... Download Full doc Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... Download PDF EBOOK here { https://tinyurl.com/th54vze } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/th54vze } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... Download Full doc Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... Download PDF EBOOK here { https://tinyurl.com/th54vze } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/th54vze } ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

  1. 1. On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform Swaroop Jagadish http://www.linkedin.com/in/swaroopjagadish LinkedIn Confidential ©2013 All Rights Reserved
  2. 2. Outline LinkedIn Data Ecosystem Espresso: Design Points Data Model and API Architecture Deep Dive: Fault Tolerance Deep Dive: Secondary Indexing Espresso In Production Future work 2
  3. 3. The World’s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 225M+ 2M+ Company Pages Connecting Talent  Opportunity. At scale… LinkedIn Confidential ©2013 All Rights Reserved 3
  4. 4. LinkedIn Data Ecosystem 4
  5. 5. Espresso: Key Design Points  Source-of-truth – Master-Slave, Timeline consistent – Query-after-write – Backup/Restore – High Availability  Horizontally Scalable  Rich functionality – Hierarchical data model – Document oriented – Transactions within a hierarchy – Secondary Indexes 5
  6. 6. Espresso: Key Design Points  Agility – no “pause the world” operations – “On the fly” Schema Evolution – Elasticity  Integration with the data ecosystem – Change stream with freshness in O(seconds) – ETL to Hadoop – Bulk import  Modular and Pluggable – Off-the-shelf: MySQL, Lucene, Avro 6
  7. 7. Data Model and API 7
  8. 8. Application View 8 key value REST API: /mailbox/msg_meta/bob/2
  9. 9. Partitioning 9 /mailbox/msg_meta/bob/2 MemberId is the partitioning key
  10. 10. Document based data model Richer than a plain key-value store Hierarchical keys Values are rich documents and may contain nested types 10 from : { name : "Chris", email : "chris@linkedin.com" } subject : "Go Giants!" body : "World Series 2012! w00t!" unread : true Messages mailboxID : String messageID : long from : { name : String email : String } subject : String body : String unread : boolean
  11. 11. REST based API • Secondary Index query – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true +isInbox:true”&start=0&count=15 • Partial updates POST /MailboxDB/MessageMeta/bob/1 Content-Type: application/json Content-Length: 21 {“unread” : “false”} • Conditional operations – Get a message, only if recently updated GET /MailboxDB/MessageMeta/bob/1 If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT 11
  12. 12. Transactional writes within a hierarchy mboxId value George { “numUnread”: 2 } MessageCounter mboxId msgId value etag George 0 {…, “unread”: false, …} 7abf8091 George 1 {…, “unread”: true, …} b648bc5f George 2 {…, “unread”: true, …} 4fde8701 Message/Message/George/0 {…, “unread”: false, …} 7abf8091 /Message/George/0 {…, “unread”: true, …} /MessageCounter/George {…, “numUnread”: “+1”, …} 1. Read, record etags 2. Prepare after-image 3.Update mboxId value George { “numUnread”: 3 }
  13. 13. Espresso Architecture 13
  14. 14. 14
  15. 15. 15
  16. 16. 16
  17. 17. 17
  18. 18. 18
  19. 19. 19
  20. 20. Cluster Management and Fault Tolerance 20
  21. 21. Generic Cluster Manager: Apache Helix  Generic cluster management – State model + constraints – Ideal state of distribution of partitions across the cluster – Migrate cluster from current state to ideal state • More Info • SoCC 2012 • http://helix.incubator.apache.org 21
  22. 22. Espresso Partition Layout: Master, Slave  3 Storage Engine nodes, 2-way replication 22 Apache Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline
  23. 23. Cluster Management Cluster Expansion Node Failover
  24. 24. Cluster Expansion  Initial State with 3 Storage Nodes. Step1: Compute new Ideal state 24 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4
  25. 25. Cluster Expansion  Step 2: Bootstrap new node’s partitions by restoring from backups 25 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1 Snapshots
  26. 26. Cluster Expansion  Step 3: Catch up from live replication stream 26 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1 Snapshots
  27. 27. Cluster Expansion  Step 4: Migrate masters and slaves to rebalance 27 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P3 P5 P6 P10 Node 2 P5 P6 P7 P2 P11 P12 Node 3 P9 P10 P11 P3 P4 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1
  28. 28. Cluster Expansion  Partitions are balanced. Router starts sending traffic to new node 28 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 Node 2 P5 P6 P7 P2 P11 P12 Node 3 Master Slave Offline Node 4 P1 P2 P3 P5 P6 P10 P9 P10 P11 P3 P4 P8 P4 P8 P12 P1 P7 P9
  29. 29. Node Failover • During failure or planned maintenance 29 Node 1 P1 P2 P3 P10P5 P6 Node 2 P5 P6 P7 P12P2 P11 Node 3 P9 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active …
  30. 30. Node Failover • Step 1: Detect Node failure 30 Node 1 P1 P2 P3 P10P5 P6 Node 2 P5 P6 P7 P12P2 P11 Node 3 P9 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active …
  31. 31. Node Failover • Step 2: Compute new ideal state for promoting slaves to master 31 Node 1 P1 P2 P3 P5 P6 Node 2 P5 P6 P7 P12P2 Node 3 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active … P11P10 P9
  32. 32. Failover Performance 32
  33. 33. Secondary indexing 33
  34. 34. Espresso Secondary Indexing • Local Secondary Index Requirements • Read after write • Consistent with primary data under failure • Rich query support: match, prefix, range, text search • Cost-to-serve proportional to working set • Pluggable Index Implementations • MySQL B-Tree • Inverted index using Apache Lucene with MySQL backing store • Inverted index using Prefix Index • Fastbit based bitmap index
  35. 35. Lucene based implementation • Requires entire index to be memory-resident to support low latency query response times • For the Mailbox application, we have two options
  36. 36. Optimizations for Lucene based implementation • Concurrent transactions on the same Lucene index leads to inconsistency • Need to acquire a lock • Opening an index repeatedly is expensive • Group commit to amortize index opening cost write Request 2 Request 3 Request 4 Request 5 Request 1
  37. 37. Optimizations for Lucene based implementation  High value users of the site accumulate large mailboxes – Query performance degrades with a large index  Performance shouldn’t get worse with more usage!  Time Partitioned Indexes: Partition index into buckets based on created time
  38. 38. Espresso in Production 38
  39. 39. Espresso in Production  Unified Social Content Platform –social activity aggregation  High Read:Write ratio 39
  40. 40. Espresso in Production  InMail - Allows members to communicate with each other  Large storage footprint  Low latency requirement for secondary index queries involving text search and relational predicates 40
  41. 41. Performance  Average Failover Latency with 1024 partitions is around 300ms  Primary Data Reads and Writes  For Single Storage Node on SSD  Average row size = 1KB 41 Operation Average Latency Average Throughput Reads ~3ms 40,000 per second Writes ~6ms 20,000 per second
  42. 42. Performance  Partition-key level Secondary Index using Lucene  One Index per Mailbox use-case  Base data on SAS, Indexes on SSDs  Average throughput per index = ~1000 per second (after the group commit and partitioned index optimizations) 42 Operation Average Latency Queries (average of 5 indexed fields) ~20ms Writes (Around 30 indexed fields) ~20ms
  43. 43. Durability and Consistency  Within a Data Center  Across Data Centers
  44. 44. Durability and Consistency  Within a Data Center – Write latency vs Durability  Asynchronous replication – May lead to data loss – Tooling can mitigate some of this  Semi-synchronous replication – Wait for at least one relay to acknowledge – During failover, slaves wait for catchup  Consistency over availability  Helix selects slave with least replication lag to take over mastership  Failover time is ~300ms in practice
  45. 45. Durability and Consistency  Across data centers – Asynchronous replication – Stale reads possible – Active-active: Conflict resolution via last-writer-wins
  46. 46. Lessons learned Dealing with transient failures Planned upgrades Slave reads Storage Devices – SSDs vs SAS disks Scaling Cluster Management 46
  47. 47. Future work Coprocessors – Synchronous, Asynchronous Richer query processing – Group-by, Aggregation 47
  48. 48. Key Takeaways Espresso is a timeline consistent, document-oriented distributed database Feature rich: Secondary indexing, transactions over related documents, seamless integration with the data ecosystem In production since June 2012 serving several key use-cases 48
  49. 49. 49 Questions?

×