Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Infrastructure at Linkedin
Jun Rao and Sam Shah

LinkedIn Confidential ©2013 All Rights Reserved
Outline
1.
2.
3.
4.

LinkedIn introduction
Online/nearline infrastructure
Offline infrastructure
Conclusion

LinkedIn Conf...
The World’s Largest Professional Network
Connecting Talent  Opportunity. At scale…

200M+ 2 new
Members Worldwide

Member...
Two Product Families
For Members

Professionals

For Partners

 People You May Know
 Who’s Viewed My Profile
 Jobs You ...
The Big-Data Feedback Loop
Refinement 

Engagement
Value 

Member

Product

Insights 

Virality

Data

Signals

Scie...
LinkedIn Data Infrastructure: Three-Phase Abstraction
Near-Line
Infra

Offline
Data Infra

Application

Users

Infrastruct...
LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase
ecosystem are
diverse, complex and specific

Some ...
LinkedIn Data Infrastructure Solutions

Voldemort: Highly-Available
Distributed KV Store
• Key/value access at scale

8
Voldemort: Architecture

• Pluggable components
• Tunable consistency /
availability
• Key/value model,
server side “views...
LinkedIn Data Infrastructure Solutions

Espresso: Indexed Timeline-Consistent
Distributed Data Store
• Fill in the gap btw...
Espresso: System Components
• Hierarchical data model
• Timeline consistency
• Rich functionality
• Transactions
• Seconda...
Generic Cluster Manager: Helix
• Generic Distributed State Model
•
•
•
•

ConfigManagement
Automatic Load Balancing
Fault ...
LinkedIn Data Infrastructure Solutions

Databus : Timeline-Consistent
Change Data Capture
• Deliver data store changes to ...
Databus at LinkedIn
DB

Capture
Changes

Relay
Event Win

On-line
Changes

On-line
Changes

Databus
Client Lib

Client

Sn...
LinkedIn Data Infrastructure Solutions

Kafka: High-Volume Low-Latency
Messaging System
• Log aggregation and queuing

15
Kafka Architecture
Producer

Producer

Broker 1

Broker 2

Broker 3

Broker 4

topic1-part1

topic1-part2

topic2-part1

t...
LinkedIn Data Infrastructure: A few take-aways
1.
2.
3.

Building infrastructure in a hyper-growth
environment is challeng...
Upcoming SlideShare
Loading in …5
×

LinkedIn Infrastructure (analytics@webscale, at fb 2013)

2,362 views

Published on

This is the presentation at analytics@webscale in 2013 (http://analyticswebscale.splashthat.com/?em=187&utm_campaign=website&utm_source=sg&utm_medium=em)

Published in: Technology
  • Be the first to comment

LinkedIn Infrastructure (analytics@webscale, at fb 2013)

  1. 1. Data Infrastructure at Linkedin Jun Rao and Sam Shah LinkedIn Confidential ©2013 All Rights Reserved
  2. 2. Outline 1. 2. 3. 4. LinkedIn introduction Online/nearline infrastructure Offline infrastructure Conclusion LinkedIn Confidential ©2013 All Rights Reserved 2
  3. 3. The World’s Largest Professional Network Connecting Talent  Opportunity. At scale… 200M+ 2 new Members Worldwide Members Per Second LinkedIn Confidential ©2013 All Rights Reserved 100M+ Monthly Unique Visitors 2M+ Company Pages 3
  4. 4. Two Product Families For Members Professionals For Partners  People You May Know  Who’s Viewed My Profile  Jobs You May Be Interested In  News/Sharing  Today  Search  Subscriptions Hire Companies Market Sell Science and Analytics Data Infrastructure Actions Profiles Connections LinkedIn Confidential ©2013 All Rights Reserved Data Content 4
  5. 5. The Big-Data Feedback Loop Refinement  Engagement Value  Member Product Insights  Virality Data Signals Science Analytics  Scale  Infrastructure LinkedIn Confidential ©2013 All Rights Reserved 5
  6. 6. LinkedIn Data Infrastructure: Three-Phase Abstraction Near-Line Infra Offline Data Infra Application Users Infrastructure Online Near-Line Offline Online Data Infra Latency & Freshness Requirements Activity that should be reflected immediately • • • Products • Messages Member Profiles • Endorsements Company Profiles • Skills Connections Activity that should be reflected soon • • • • Activity Streams Profile Standardization • • News Recommendations Search Messages Activity that can be reflected later • • • People You May Know • Connection Strength • News Recommendations Next best idea… LinkedIn Confidential ©2013 All Rights Reserved 6
  7. 7. LinkedIn Data Infrastructure: Sample Stack Infra challenges in 3-phase ecosystem are diverse, complex and specific Some off-the-shelf. Significant investment in home-grown, deep and interesting platforms 7
  8. 8. LinkedIn Data Infrastructure Solutions Voldemort: Highly-Available Distributed KV Store • Key/value access at scale 8
  9. 9. Voldemort: Architecture • Pluggable components • Tunable consistency / availability • Key/value model, server side “views” • • • • • 10 clusters, 100+ nodes Largest cluster – 10K+ qps Avg latency: 3ms Hundreds of Stores Largest store – 2.8TB+
  10. 10. LinkedIn Data Infrastructure Solutions Espresso: Indexed Timeline-Consistent Distributed Data Store • Fill in the gap btw Oracle and KV store 10
  11. 11. Espresso: System Components • Hierarchical data model • Timeline consistency • Rich functionality • Transactions • Secondary index • Text search • Partitioning/replication • Change propagation 11
  12. 12. Generic Cluster Manager: Helix • Generic Distributed State Model • • • • ConfigManagement Automatic Load Balancing Fault tolerance Cluster expansion and rebalancing • Espresso, Databus and Search • Open Source Apr 2012 • https://github.com/linkedin/helix 12
  13. 13. LinkedIn Data Infrastructure Solutions Databus : Timeline-Consistent Change Data Capture • Deliver data store changes to apps
  14. 14. Databus at LinkedIn DB Capture Changes Relay Event Win On-line Changes On-line Changes Databus Client Lib Client Snapshot at U Databus Client Lib Consistent  Transport independent of data source: Oracle, MySQL, …  Transactional semantics  In order, at least once delivery Consumer n Client Bootstrap DB Consumer 1 Consumer 1 Consumer n  Tens of relays  Hundreds of sources  Low latency - milliseconds 14
  15. 15. LinkedIn Data Infrastructure Solutions Kafka: High-Volume Low-Latency Messaging System • Log aggregation and queuing 15
  16. 16. Kafka Architecture Producer Producer Broker 1 Broker 2 Broker 3 Broker 4 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part1 topic2-part2 topic1-part1 topic1-part2 Key features • Scale-out architecture • Automatic load balancing • High throughput/low latency • Rewindability • Intra-cluster replication Zookeeper Consumer Consumer Per day stats • writes: 10+ billion messages • reads: 50+ billion messages
  17. 17. LinkedIn Data Infrastructure: A few take-aways 1. 2. 3. Building infrastructure in a hyper-growth environment is challenging. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) Balance open-source products with homegrown platforms (**) LinkedIn Confidential ©2013 All Rights Reserved 17

×