Your SlideShare is downloading. ×
0
Data Infrastructure at Linkedin
Jun Rao and Sam Shah
LinkedIn Confidential ©2013 All Rights Reserved
Outline
LinkedIn Confidential ©2013 All Rights Reserved 2
1. LinkedIn introduction
2. Online/nearline infrastructure overv...
The World‟s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
200M+ 2M...
4
Member Profiles
Large dataset
Medium writes
Very high reads
Freshness <1s
People You May Know
5
Large dataset
Compute intensive
High reads
Freshness ~hrs
LinkedIn Today
6
Moving dataset
High writes
High reads
Freshness ~mins
The Big-Data Feedback Loop
LinkedIn Confidential ©2013 All Rights Reserved 7
Value 
Insights 
Scale 
Product
ScienceDat...
LinkedIn Data Infrastructure: Three-Phase Abstraction
LinkedIn Confidential ©2013 All Rights Reserved 8
Users Online Data
...
LinkedIn Data Infrastructure: Sample Stack
9
Infra challenges in 3-phase
ecosystem are diverse,
complex and specific
Some ...
Streaming Transactions
10
Databus : Timeline-Consistent
Change Data Capture
LinkedIn Data Infrastructure Solutions
Databus at LinkedIn
12
DB
Bootstrap
Capture
Changes
On-line
Changes
On-line
Changes
DB
Consistent
Snapshot at U
 Transpor...
Scaling Core Databases
13
RO
RO
RO
Voldemort: Highly-Available
Distributed KV Store
LinkedIn Data Infrastructure Solutions
14
• Pluggable components
• Tunable consistency /
availability
• Key/value model,
server side “views”
• 10 clusters, 100+ nod...
Streaming Non-transactional Events
16
Offline
Nearline
Processing
Kafka: High-Volume Low-Latency
Messaging System
LinkedIn Data Infrastructure Solutions
17
Kafka Architecture
Producer
Consumer
Producer
Consumer
Zookeeper
topic1-part1
topic2-part2
topic2-part1
topic1-part2
topic...
Filling in the Data Store Gap
19
Text
Search
Espresso: Indexed Timeline-Consistent
Distributed Data Store
LinkedIn Data Infrastructure Solutions
20
Application View
21
Hierarchical data model
Rich functionality on resources
 Conditional updates
 Partial updates
 Atom...
Espresso: System Components
22
• Partitioning/replication
• Timeline consistency
• Change propagation
Generic Cluster Manager: Helix
• Generic Distributed State Model
• Config Management
• Automatic Load Balancing
• Fault to...
Infrastructure challenges in
large-scale data mining
Putting it together
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow managem...
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow managem...
LinkedIn circa 2010
LinkedIn Confidential ©2013 All Rights Reserved 27
O(n2) data integration complexity
Infrastructure fragility
• Can‟t get all data
• Hard to operate
• Multi-hour delay
• Labor intensive
• Slow
• Does it work?
Process fragility
• Labor intensive
• One man‟s
cleaning…
FE
MT
BE
DT
FE Dev
BE Dev
ETL
Team
ETL DW/
Hadoop
Data model
{
tracking_code=null,
session_id=42,
tracking_time=Tue Jul 31 07:27:25 PDT 2010,
error_key=null,
locale=en_us,
...
Data model (cont‟d)
{
article_id=5560874437395353942,
title=Five Good Reasons to Hire the Unemployed,
language=en_US,
arti...
Problems
1 Data integration across systems
2 Fragile infrastructure
3 Lack of proper data models (ad-hoc)
LinkedIn 2013
LinkedIn Confidential ©2013 All Rights Reserved 34
O(n) data integration
Publish/subscribe commit log
Data model
 Hundreds of message types
 Thousands of fields
 What do they all mean?
 What happens when they change?
Data model
1 Education
2 Push data cleanliness upstream
3 O(1) ETL
4 Evidence-based correctness
Data model
 DDL for data definition and schema
 Central versioned registry of all schemas
 Schema review
 Programmatic...
Workflow
1 Check in schema
2 Code review
3 Ship
Seamless data load into downstream systems
Audit trail
Result: complete, verified copy of all
data available
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow managem...
Egress
store DATA into „kafka://…‟ using Stream();
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow managem...
Workflows
46
Job A
Job B
Job C
Workflows
47
Job A
Job B
Job C
Push to Production
Workflows
48
Job A
Job B
Job C
Push to Production
Job X
Workflows
49
Job A
Job B
Job C
Push to Production
Job X
Push to QA
Real workflows are complicated
50
Workflow management: Azkaban
51
 Dependency management
 Diverse job types (Pig, Hive, Java, . . . )
 Scheduling
 Monit...
Workflow management: Azkaban
52
Workflow management: Azkaban
53
Top complaints from data scientists
1 Getting the data in (Ingress ETL)
2 Getting the data out (Egress)
3 Workflow managem...
Model of computation
• Alternating Direction Method of Multipliers (ADMM)
• Distributed Conjugate Gradient Descent (DCGD)
...
LinkedIn Data Infrastructure: A few take-aways
LinkedIn Confidential ©2013 All Rights Reserved 56
1. Building infrastructu...
57
Learning more
data.linkedin.com
Upcoming SlideShare
Loading in...5
×

Data Infrastructure at LinkedIn

3,215

Published on

This talk was given by Jun Rao (Staff Software Engineer at LinkedIn) and Sam Shah (Senior Engineering Manager at LinkedIn) at the Analytics@Webscale Technical Conference (June 2013).

Published in: Technology
0 Comments
30 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,215
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
167
Comments
0
Likes
30
Embeds 0
No embeds

No notes for slide
  • Transition needs to be goodProducts =&gt; data infrastructure requirements in previous slideAll products don’t make the same latency and freshness requirements from our data infrastructureThe way we bucketize this is….News and recommendations show up in both nearline and offline
  • Not part of kafka
  • - Others: Oozie
  • Data Integration is hard. Having sane and same metadata across systems. Have a schema which works across the 3 phases. Want a rich evolving schemas and make the conforming push as much of data cleaning to source and upstream as much as possible so near-line and off-line helpsSessionization logic is in WH which makes it hard for near-line systems to useExtensible system where changing schema in one phase does not break downstream systemsDon’t build over-specialized systems: e.g. a monitoring system for PYMK – build Azkaban
  • Transcript of "Data Infrastructure at LinkedIn"

    1. 1. Data Infrastructure at Linkedin Jun Rao and Sam Shah LinkedIn Confidential ©2013 All Rights Reserved
    2. 2. Outline LinkedIn Confidential ©2013 All Rights Reserved 2 1. LinkedIn introduction 2. Online/nearline infrastructure overview 3. Infrastructure for data mining 4. Conclusion
    3. 3. The World‟s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 200M+ 2M+ Company Pages Connecting Talent  Opportunity. At scale… LinkedIn Confidential ©2013 All Rights Reserved 3
    4. 4. 4 Member Profiles Large dataset Medium writes Very high reads Freshness <1s
    5. 5. People You May Know 5 Large dataset Compute intensive High reads Freshness ~hrs
    6. 6. LinkedIn Today 6 Moving dataset High writes High reads Freshness ~mins
    7. 7. The Big-Data Feedback Loop LinkedIn Confidential ©2013 All Rights Reserved 7 Value  Insights  Scale  Product ScienceData Member Engagement  Virality  Signals  Refinement  Infrastructure Analytics 
    8. 8. LinkedIn Data Infrastructure: Three-Phase Abstraction LinkedIn Confidential ©2013 All Rights Reserved 8 Users Online Data Infra Near-Line Infra Application Offline Data Infra Infrastructure Latency & Freshness Requirements Products Online Activity that should be reflected immediately • Member Profiles • Company Profiles • Connections • Messages • Endorsements • Skills Near-Line Activity that should be reflected soon • Activity Streams • Profile Standardization • News • Recommendations • Search • Messages Offline Activity that can be reflected later • People You May Know • Connection Strength • News • Recommendations • Next best idea…
    9. 9. LinkedIn Data Infrastructure: Sample Stack 9 Infra challenges in 3-phase ecosystem are diverse, complex and specific Some off-the-shelf. Significant investment in home-grown, deep and interesting platforms
    10. 10. Streaming Transactions 10
    11. 11. Databus : Timeline-Consistent Change Data Capture LinkedIn Data Infrastructure Solutions
    12. 12. Databus at LinkedIn 12 DB Bootstrap Capture Changes On-line Changes On-line Changes DB Consistent Snapshot at U  Transport independent of data source: Oracle, MySQL, …  Transactional semantics  In order, at least once delivery  Tens of relays  Hundreds of sources  Low latency - milliseconds Consumer 1 Consumer n Client Databus ClientLib Consumer 1 Consumer n Databus ClientLib Client Relay Event Win
    13. 13. Scaling Core Databases 13 RO RO RO
    14. 14. Voldemort: Highly-Available Distributed KV Store LinkedIn Data Infrastructure Solutions 14
    15. 15. • Pluggable components • Tunable consistency / availability • Key/value model, server side “views” • 10 clusters, 100+ nodes • Largest cluster – 10K+ qps • Avg latency: 3ms • Hundreds of Stores • Largest store – 2.8TB+ Voldemort: Architecture
    16. 16. Streaming Non-transactional Events 16 Offline Nearline Processing
    17. 17. Kafka: High-Volume Low-Latency Messaging System LinkedIn Data Infrastructure Solutions 17
    18. 18. Kafka Architecture Producer Consumer Producer Consumer Zookeeper topic1-part1 topic2-part2 topic2-part1 topic1-part2 topic2-part2 topic2-part1 topic1-part1 topic1-part2 topic1-part1 topic1-part2 topic2-part2 topic2-part1 Broker 1 Broker 2 Broker 3 Broker 4 Key features • Scale-out architecture • High throughput • Automatic load balancing • Intra-cluster replication Per day stats • writes: 10+ billion messages • reads: 50+ billion messages
    19. 19. Filling in the Data Store Gap 19 Text Search
    20. 20. Espresso: Indexed Timeline-Consistent Distributed Data Store LinkedIn Data Infrastructure Solutions 20
    21. 21. Application View 21 Hierarchical data model Rich functionality on resources  Conditional updates  Partial updates  Atomic counters Rich functionality within resource groups  Transactions  Secondary index  Text search
    22. 22. Espresso: System Components 22 • Partitioning/replication • Timeline consistency • Change propagation
    23. 23. Generic Cluster Manager: Helix • Generic Distributed State Model • Config Management • Automatic Load Balancing • Fault tolerance • Cluster expansion and rebalancing • Espresso, Databus and Search • Open Source Apr 2012 • https://github.com/linkedin/helix 23
    24. 24. Infrastructure challenges in large-scale data mining Putting it together
    25. 25. Top complaints from data scientists 1 Getting the data in (Ingress ETL) 2 Getting the data out (Egress) 3 Workflow management 4 Model of computation 5 …
    26. 26. Top complaints from data scientists 1 Getting the data in (Ingress ETL) 2 Getting the data out (Egress) 3 Workflow management 4 Model of computation 5 …
    27. 27. LinkedIn circa 2010 LinkedIn Confidential ©2013 All Rights Reserved 27
    28. 28. O(n2) data integration complexity
    29. 29. Infrastructure fragility • Can‟t get all data • Hard to operate • Multi-hour delay • Labor intensive • Slow • Does it work?
    30. 30. Process fragility • Labor intensive • One man‟s cleaning… FE MT BE DT FE Dev BE Dev ETL Team ETL DW/ Hadoop
    31. 31. Data model { tracking_code=null, session_id=42, tracking_time=Tue Jul 31 07:27:25 PDT 2010, error_key=null, locale=en_us, browser_id=ddc61a81-5311-4859-be42-ca7dc7b941e3, member_id=1213, page_key=profile, tracking_info=Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId- 'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|, error_id=null, page_type=FULL_PAGE, request_path=view ... }
    32. 32. Data model (cont‟d) { article_id=5560874437395353942, title=Five Good Reasons to Hire the Unemployed, language=en_US, article_source=bit.ly, url=aHR0cDovL3d3dy5vbmV0aGluZ25ldy5jb20vaW5kZXgucGhwL3dvcmsvMTAyLWZpdmUtZ29v ZC1yZWFzb25zLXRvLWhpcmUtdGhlLXVuZW1wbG95ZWQK, ... }
    33. 33. Problems 1 Data integration across systems 2 Fragile infrastructure 3 Lack of proper data models (ad-hoc)
    34. 34. LinkedIn 2013 LinkedIn Confidential ©2013 All Rights Reserved 34
    35. 35. O(n) data integration
    36. 36. Publish/subscribe commit log
    37. 37. Data model  Hundreds of message types  Thousands of fields  What do they all mean?  What happens when they change?
    38. 38. Data model 1 Education 2 Push data cleanliness upstream 3 O(1) ETL 4 Evidence-based correctness
    39. 39. Data model  DDL for data definition and schema  Central versioned registry of all schemas  Schema review  Programmatic compatibility model – Schema changes handled transparently
    40. 40. Workflow 1 Check in schema 2 Code review 3 Ship Seamless data load into downstream systems
    41. 41. Audit trail
    42. 42. Result: complete, verified copy of all data available
    43. 43. Top complaints from data scientists 1 Getting the data in (Ingress ETL) 2 Getting the data out (Egress) 3 Workflow management 4 Model of computation 5 …
    44. 44. Egress store DATA into „kafka://…‟ using Stream();
    45. 45. Top complaints from data scientists 1 Getting the data in (Ingress ETL) 2 Getting the data out (Egress) 3 Workflow management 4 Model of computation 5 …
    46. 46. Workflows 46 Job A Job B Job C
    47. 47. Workflows 47 Job A Job B Job C Push to Production
    48. 48. Workflows 48 Job A Job B Job C Push to Production Job X
    49. 49. Workflows 49 Job A Job B Job C Push to Production Job X Push to QA
    50. 50. Real workflows are complicated 50
    51. 51. Workflow management: Azkaban 51  Dependency management  Diverse job types (Pig, Hive, Java, . . . )  Scheduling  Monitoring  Configuration  Retry/restart on failure  Resource locking  Log collection  Historical information
    52. 52. Workflow management: Azkaban 52
    53. 53. Workflow management: Azkaban 53
    54. 54. Top complaints from data scientists 1 Getting the data in (Ingress ETL) 2 Getting the data out (Egress) 3 Workflow management 4 Model of computation 5 …
    55. 55. Model of computation • Alternating Direction Method of Multipliers (ADMM) • Distributed Conjugate Gradient Descent (DCGD) • Distributed L-BFGS • Bayesian Distributed Learning (BDL) Graphs Distributed learning Near-line processing
    56. 56. LinkedIn Data Infrastructure: A few take-aways LinkedIn Confidential ©2013 All Rights Reserved 56 1. Building infrastructure in a hyper-growth environment is challenging. 2. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) 3. Balance open-source products with home- grown platforms (**) 4. Data Model and Integration e2e are key (*)
    57. 57. 57 Learning more data.linkedin.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×