LinkedIn Infrastructure (analytics@webscale, at fb 2013)

2,037 views

Published on

This is the presentation at analytics@webscale in 2013 (http://analyticswebscale.splashthat.com/?em=187&utm_campaign=website&utm_source=sg&utm_medium=em)

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,037
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
60
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Enterprise Facing is all about Segmentation and Connections Our base data lead to revenue-generating productsEnterprise Application-building problems with deterministic life-cycles Science is key for targeting and matching (e.g. CAP, Marketing Solutions) Key back-office play for Hiring, Sales and Marketing for 85% of Fortune-500
  • Transition needs to be goodProducts => data infrastructure requirements in previous slideAll products don’t make the same latency and freshness requirements from our data infrastructureThe way we bucketize this is….News and recommendations show up in both nearline and offline
  • Data Integration is hard. Having sane and same metadata across systems. Have a schema which works across the 3 phases. Want a rich evolving schemas and make the conforming push as much of data cleaning to source and upstream as much as possible so near-line and off-line helpsSessionization logic is in WH which makes it hard for near-line systems to useExtensible system where changing schema in one phase does not break downstream systemsDon’t build over-specialized systems: e.g. a monitoring system for PYMK – build Azkaban
  • LinkedIn Infrastructure (analytics@webscale, at fb 2013)

    1. 1. Data Infrastructure at Linkedin Jun Rao and Sam Shah LinkedIn Confidential ©2013 All Rights Reserved
    2. 2. Outline 1. 2. 3. 4. LinkedIn introduction Online/nearline infrastructure Offline infrastructure Conclusion LinkedIn Confidential ©2013 All Rights Reserved 2
    3. 3. The World’s Largest Professional Network Connecting Talent  Opportunity. At scale… 200M+ 2 new Members Worldwide Members Per Second LinkedIn Confidential ©2013 All Rights Reserved 100M+ Monthly Unique Visitors 2M+ Company Pages 3
    4. 4. Two Product Families For Members Professionals For Partners  People You May Know  Who’s Viewed My Profile  Jobs You May Be Interested In  News/Sharing  Today  Search  Subscriptions Hire Companies Market Sell Science and Analytics Data Infrastructure Actions Profiles Connections LinkedIn Confidential ©2013 All Rights Reserved Data Content 4
    5. 5. The Big-Data Feedback Loop Refinement  Engagement Value  Member Product Insights  Virality Data Signals Science Analytics  Scale  Infrastructure LinkedIn Confidential ©2013 All Rights Reserved 5
    6. 6. LinkedIn Data Infrastructure: Three-Phase Abstraction Near-Line Infra Offline Data Infra Application Users Infrastructure Online Near-Line Offline Online Data Infra Latency & Freshness Requirements Activity that should be reflected immediately • • • Products • Messages Member Profiles • Endorsements Company Profiles • Skills Connections Activity that should be reflected soon • • • • Activity Streams Profile Standardization • • News Recommendations Search Messages Activity that can be reflected later • • • People You May Know • Connection Strength • News Recommendations Next best idea… LinkedIn Confidential ©2013 All Rights Reserved 6
    7. 7. LinkedIn Data Infrastructure: Sample Stack Infra challenges in 3-phase ecosystem are diverse, complex and specific Some off-the-shelf. Significant investment in home-grown, deep and interesting platforms 7
    8. 8. LinkedIn Data Infrastructure Solutions Voldemort: Highly-Available Distributed KV Store • Key/value access at scale 8
    9. 9. Voldemort: Architecture • Pluggable components • Tunable consistency / availability • Key/value model, server side “views” • • • • • 10 clusters, 100+ nodes Largest cluster – 10K+ qps Avg latency: 3ms Hundreds of Stores Largest store – 2.8TB+
    10. 10. LinkedIn Data Infrastructure Solutions Espresso: Indexed Timeline-Consistent Distributed Data Store • Fill in the gap btw Oracle and KV store 10
    11. 11. Espresso: System Components • Hierarchical data model • Timeline consistency • Rich functionality • Transactions • Secondary index • Text search • Partitioning/replication • Change propagation 11
    12. 12. Generic Cluster Manager: Helix • Generic Distributed State Model • • • • ConfigManagement Automatic Load Balancing Fault tolerance Cluster expansion and rebalancing • Espresso, Databus and Search • Open Source Apr 2012 • https://github.com/linkedin/helix 12
    13. 13. LinkedIn Data Infrastructure Solutions Databus : Timeline-Consistent Change Data Capture • Deliver data store changes to apps
    14. 14. Databus at LinkedIn DB Capture Changes Relay Event Win On-line Changes On-line Changes Databus Client Lib Client Snapshot at U Databus Client Lib Consistent  Transport independent of data source: Oracle, MySQL, …  Transactional semantics  In order, at least once delivery Consumer n Client Bootstrap DB Consumer 1 Consumer 1 Consumer n  Tens of relays  Hundreds of sources  Low latency - milliseconds 14
    15. 15. LinkedIn Data Infrastructure Solutions Kafka: High-Volume Low-Latency Messaging System • Log aggregation and queuing 15
    16. 16. Kafka Architecture Producer Producer Broker 1 Broker 2 Broker 3 Broker 4 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part1 topic2-part2 topic1-part1 topic1-part2 Key features • Scale-out architecture • Automatic load balancing • High throughput/low latency • Rewindability • Intra-cluster replication Zookeeper Consumer Consumer Per day stats • writes: 10+ billion messages • reads: 50+ billion messages
    17. 17. LinkedIn Data Infrastructure: A few take-aways 1. 2. 3. Building infrastructure in a hyper-growth environment is challenging. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) Balance open-source products with homegrown platforms (**) LinkedIn Confidential ©2013 All Rights Reserved 17

    ×