Data Infrastructure at LinkedIn

1,882 views

Published on

This talk was given by Kapil Surlaker (Staff Software Engineer @ LinkedIn) at the 28th IEEE International Conference on Data Engineering (ICDE 2012).

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,882
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
63
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Data Infrastructure at LinkedIn

  1. 1. Data Infrastructure at LinkedInKapil Surlakerhttp://www.linkedin.com/in/kapilsurlaker@kapilsurlaker 1
  2. 2. Outline LinkedIn Products Data Ecosystem LinkedIn Data Infrastructure Solutions Next Play 2
  3. 3. LinkedIn By The Numbers 150M + users* ~ 4.2B People Searches in 2011** >2M companies with LinkedIn Company Pages** 16 languages 75% of Fortune 100 Companies use LinkedIn to hire*** * As of February 9th 2012 ** As of December 31st 2011 *** As of September 30th 2011 3
  4. 4. Broad Range of Products & Services 4
  5. 5. User Profiles Large dataset Medium writes Very high reads Freshness <1s 5
  6. 6. Communications Large dataset High writes High reads Freshness <1s 6
  7. 7. People You May Know Large dataset Compute intensive High reads Freshness ~hrs 7
  8. 8. LinkedIn Today Moving dataset High writes High reads Freshness ~mins 8
  9. 9. Outline LinkedIn Products Data Ecosystem LinkedIn Data Infrastructure Solutions Next Play 9
  10. 10. Three Paradigms : Simplifying the Data Continuum• Member Profiles • Linkedin Today • People You May Know• Company Profiles • Profile Standardization • Connection Strength• Connections • News • News• Communications • Recommendations • Recommendations • Search • Next best idea • Communications Online Nearline OfflineActivity that should Activity that should Activity that can bebe reflected immediately be reflected soon reflected later 10
  11. 11. LinkedIn Product Architecture 11
  12. 12. LinkedIn Product Architecture 12
  13. 13. LinkedIn Product Architecture 13
  14. 14. LinkedIn Data Infrastructure SolutionsDatabus : Timeline-Consistent Change Data Capture 14
  15. 15. Databus at LinkedIn Client Relay Consumer 1 Client Lib Capture On-line Databus DB Changes Changes Event Win Consumer n On-line Changes Bootstrap Client Consumer 1 Client Lib Databus Consistent Snapshot at U DB Consumer n 15
  16. 16. Databus at LinkedIn Client Relay Consumer 1 Client Lib Capture On-line Databus DB Changes Changes Event Win Consumer n On-line Changes Bootstrap Client Consumer 1 Client Lib Databus Consistent Snapshot at U DB Consumer n Transport independent of data  Tens of relays source: Oracle, MySQL, …  Hundreds of sources Transactional semantics  Low latency - milliseconds In order, at least once delivery 16
  17. 17. LinkedIn Product Architecture 17
  18. 18. LinkedIn Product Architecture 18
  19. 19. LinkedIn Data Infrastructure SolutionsVoldemort: Highly-Available Distributed KV Store 19
  20. 20. Voldemort: Architecture • Pluggable components • 10 clusters, 100+ nodes • Tunable consistency / • Largest cluster – 10K+ qps availability • Avg latency: 3ms • Key/value model, • Hundreds of Stores server side “views” • Largest store – 2.8TB+
  21. 21. LinkedIn Product Architecture 21
  22. 22. LinkedIn Data Infrastructure SolutionsKafka: High-Volume Low-Latency Messaging System 22
  23. 23. LinkedIn Product Architecture 23
  24. 24. Kafka: Architecture Broker TierWebTier Consumers Push Sequential write sendfile Pull Iterator 1 Client Lib Event Events Topic 1 Kafka s 100 MB/sec 200 MB/sec Topic 2 Iterator n Topic N Topic  Offset Topic, Partition Offset Ownership Zookeeper Management 24
  25. 25. Kafka: Architecture Broker TierWebTier Consumers Push Sequential write sendfile Pull Iterator 1 Client Lib Event Events Topic 1 Kafka s 100 MB/sec 200 MB/sec Topic 2 Iterator n Topic N Topic  Offset Topic, Partition Offset Ownership Zookeeper Management  At least once delivery  Billions of Events, TBs per day  Very high throughput  50K+ per sec at peak  Low latency  Inter and Intra-cluster replication  Durability  End-to-end latency: few seconds 25
  26. 26. LinkedIn Product Architecture 26
  27. 27. LinkedIn Data Infrastructure SolutionsEspresso: Indexed Timeline-Consistent Distributed Data Store 27
  28. 28. Application View Hierarchical data model Rich functionality on resources  Conditional updates  Partial updates  Atomic counters Rich functionality within resource groups  Transactions  Secondary index  Text search 28
  29. 29. Partitioning 29
  30. 30. Espresso Partition Layout: Master, Slave3 Storage Engine nodes, 2 way replication Database P.1 P.2 P.3 P.5 P.6 P.7 Partition: P.1 Node: 1 P.4 P.5 P.6 P.8 P.1 P.2 … Partition: P.12 Node: 3 P.9 P.1 P.11 P.1 0 2 Node 1 Node 2 Cluster Node: 1 M: P.1 – Active P.9 P.1 P.11 … 0 S: P.5 – Active … P.1 P.3 P.4 2 P.7 P.8 Master Cluster Slave Manager Node 3
  31. 31. Espresso: System Components 31
  32. 32. Generic Cluster Manager: Helix• Generic Distributed State Model• Centralized Config Management• Automatic Load Balancing• Fault tolerance• Health monitoring• Cluster expansion and rebalancing• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix 32
  33. 33. Espresso@Linkedin Launched first application Oct 2011 Open source 2012 Future – Multi-Datacenter support – Global secondary indexes – Time-partitioned data 33
  34. 34. LinkedIn Product Architecture 34
  35. 35. Acknowledgments Siddharth Anand, Aditya Auradkar, Chavdar Botev, Vinoth Chandar, Shirshanka Das, Dave DeMaagd, Alex Feinberg, John Fung, Phanindra Ganti, Mihir Gandhi, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna, Brendan Harris, Rajappa Iyer, Swaroop Jagadish, Joel Koshy, Kevin Krawez, Jay Kreps, Shi Lu, Sunil Nagaraj, Neha Narkhede, Sasha Pachev, Igor Perisic, Lin Qiao, Tom Quiggle, Jun Rao, Bob Schulman, Abraham Sebastian, Oliver Seeliger, Adam Silberstein, Boris Shkolnik, Chinmay Soman, Subbu Subramaniam, Roshan Sumbaly, Kapil Surlaker, Sajid Topiwala, Cuong Tran, Balaji Varadarajan, Jemiah Westerman, Zach White, Victor Ye, David Zhang, and Jason Zhang 35
  36. 36. Questions? 36

×