Data Infrastructure at LinkedIn
Shirshanka Das
XLDB 2011




                                  1
Me

 UCLA Ph.D. 2005 (Distributed protocols in content
  delivery networks)
 PayPal (Web frameworks and Session Stores)
 Yahoo! (Serving Infrastructure, Graph Indexing, Real-time
  Bidding in Display Ad Exchanges)
 @ LinkedIn (Distributed Data Systems team): Distributed
  data transport and storage technology (Kafka, Databus,
  Espresso, ...)




                                                          2
Outline


 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play




                                           3
LinkedIn By The Numbers

   120,000,000+ users in August 2011
   2 new user registrations per second
   4 billion People Searches expected in 2011
   2+ million companies with LinkedIn Company Pages
   81+ million unique visitors monthly*
   150K domains feature the LinkedIn Share Button
   7.1 billion page views in Q2 2011
   1M LinkedIn Groups


* Based on comScore, Q2 2011


                                                       4
Member Profiles




                  5
Signal - faceted stream search




                                 6
People You May Know




                      7
Outline


 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play




                                           8
Three Paradigms : Simplifying the Data Continuum




• Member Profiles         • Signal                    • People You May Know

• Company Profiles        • Profile Standardization   • Connection Strength
• Connections             • News                      • News
• Communications          • Recommendations           • Recommendations
                          • Search                    • Next best idea
                          • Communications


 Online                   Nearline                     Offline
Activity that should     Activity that should         Activity that can be
be reflected immediately be reflected soon            reflected later


                                                                              9
Data Infrastructure Toolbox (Online)

           Capabilities                Systems   Analysis
           Key-value access
           Rich structures (e.g.
           indexes)
           Change capture
           capability
                                   {
           Search platform
           Graph engine




                                                            10
Data Infrastructure Toolbox (Nearline)

          Capabilities             Systems   Analysis

          Change capture streams

          Messaging for site
          events, monitoring
          Nearline processing




                                                        11
Data Infrastructure Toolbox (Offline)

          Capabilities         Systems   Analysis

          Machine learning,
          ranking, relevance
          Analytics on
          Social gestures




                                                    12
Laying out the tools




                       13
Outline


 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play




                                           14
Focus on four systems in Online and Nearline

 Data Transport
   – Kafka
   – Databus
 Online Data Stores
   – Voldemort
   – Espresso




                                               15
LinkedIn Data Infrastructure Solutions
Kafka: High-Volume Low-Latency Messaging System




                                                  16
Kafka: Architecture
                                  Broker Tier
    WebTier                                                                      Consumers

            Push          Sequential write            sendfile     Pull
                                                                                            Iterator 1




                                                                               Client Lib
            Event                                                  Events
                                     Topic 1




                                                                                Kafka
            s
            100 MB/sec                                            200 MB/sec
                                     Topic 2
                                                                                            Iterator n
                                     Topic N

                                                                                            Topic  Offset


                                             Topic, Partition                    Offset
                                             Ownership
                                                                 Zookeeper       Management


Scale                            Guarantees
    Billions of Events             At least once delivery
    TBs per day                    Very high throughput
    Inter-colo: few seconds        Low latency
    Typical retention: weeks       Durability

                                                                                                         17
LinkedIn Data Infrastructure Solutions
Databus : Timeline-Consistent Change Data Capture




                                                    18
Databus at LinkedIn
                                                                  Client
                          Relay                                    Consumer 1




                                                     Client Lib
        Capture




                                                     Databus
                                          On-line
  DB    Changes                           Changes
                       Event Win                                   Consumer n
                      On-line
                     Changes

                      Bootstrap                                   Client
                                                                   Consumer 1




                                                     Client Lib
                                                     Databus
                                      Consistent
                                     Snapshot at U                 Consumer n
                           DB

Features                             Guarantees
 Transport independent of data       Transactional semantics
  source: Oracle, MySQL, …            Timeline consistency with the data
                                       source
 Portable change event               Durability (by data source)
  serialization and versioning        At-least-once delivery
 Start consumption from arbitrary    Availability
  point                               Low latency

                                                                                19
LinkedIn Data Infrastructure Solutions
Voldemort: Highly-Available Distributed Data Store




                                                     20
Voldemort: Architecture




Highlights                In production
•   Open source           •   Data products
•   Pluggable components •    Network updates, sharing,
•   Tunable consistency /     page view tracking,
    availability              rate-limiting, more…
•   Key/value model,      •   Future: SSDs,
    server side “views”       multi-tenancy
LinkedIn Data Infrastructure Solutions
Espresso: Indexed Timeline-Consistent Distributed
           Data Store




                                                    22
Espresso: Key Design Points

 Hierarchical data model
   – InMail, Forums, Groups, Companies
 Native Change Data Capture Stream
   – Timeline consistency
   – Read after Write
 Rich functionality within a hierarchy
   – Local Secondary Indexes
   – Transactions
   – Full-text search
 Modular and Pluggable
   – Off-the-shelf: MySQL, Lucene, Avro



                                          23
Application View




                   24
Partitioning




               25
Partition Layout: Master, Slave
3 Storage Engine nodes, 2 way replication

   Database
                       P.1    P.2     P.3    P.5     P.6     P.7
 Partition: P.1
 Node: 1               P.4    P.5     P.6    P.8     P.1     P.2
 …
 Partition: P.12
 Node: 3
                       P.9    P.1            P.11    P.1
                              0                      2
                             Node 1                 Node 2
   Cluster

 Node: 1
 M: P.1 – Active       P.9    P.1     P.11
     …                        0
 S: P.5 – Active
     …                 P.1    P.3     P.4
                       2
                       P.7    P.8                            Master
   Cluster                                                   Slave
   Manager                   Node 3
Espresso: API

 REST over HTTP

 Get Messages for bob
    – GET /MailboxDB/MessageMeta/bob


 Get MsgId 3 for bob
    – GET /MailboxDB/MessageMeta/bob/3


 Get first page of Messages for bob that are unread and in the inbox
    – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true
      +isInbox:true”&start=0&count=15




                                                                        27
Espresso: API Transactions

•   Add a message to bob’s mailbox
     •   transactionally update mailbox aggregates, insert into metadata and details.

         POST /MailboxDB/*/bob HTTP/1.1
         Content-Type: multipart/binary; boundary=1299799120
         Accept: application/json
         --1299799120
         Content-Type: application/json
         Content-Location: /MailboxDB/MessageStats/bob
         Content-Length: 50
         {“total”:”+1”, “unread”:”+1”}

         --1299799120
         Content-Type: application/json
         Content-Location: /MailboxDB/MessageMeta/bob
         Content-Length: 332
         {“from”:”…”,”subject”:”…”,…}

         --1299799120
         Content-Type: application/json
         Content-Location: /MailboxDB/MessageDetails/bob
         Content-Length: 542
         {“body”:”…”}

         --1299799120—




                                                                                        28
Espresso: System Components




                              29
Espresso @ LinkedIn

 First applications
   – Company Profiles
   – InMail
 Next
   – Unified Social Content Platform
   – Member Profiles
   – Many more…




                                       30
Espresso: Next steps

   Launched first application Oct 2011
   Open source 2012
   Multi-Datacenter support
   Log-structured storage
   Time-partitioned data




                                          31
Outline


 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play




                                           32
The Specialization Paradox in Distributed Systems


 Good: Build specialized
  systems so you can do each
  thing really well
 Bad: Rebuild distributed
  routing, failover, cluster
  management, monitoring,
  tooling




                                                    33
Generic Cluster Manager: Helix

• Generic Distributed State Model
• Centralized Config Management
• Automatic Load Balancing
• Fault tolerance
• Health monitoring
• Cluster expansion and
  rebalancing
• Open Source 2012
• Espresso, Databus and Search




                                    34
Stay tuned for

 Innovation
   – Nearline processing
   – Espresso eco-system
   – Storage / indexing
   – Analytics engine
   – Search
 Convergence
   – Building blocks for distributed data
     management systems

                                            35
Thanks!




          36
Appendix




           37
Espresso: Routing

 Router is a high-performance HTTP proxy
 Examines URL, extracts partition key
 Per-db routing strategy
   – Hash Based
   – Route To Any (for schema access)
   – Range (future)
 Routing function maps partition key to partition
 Cluster Manager maintains mapping of partition to hosts:
   – Single Master
   – Multiple Slaves




                                                             38
Espresso: Storage Node

 Data Store (MySQL)
   – Stores document as Avro serialized blob
   – Blob indexed by (partition key {, sub-key})
   – Row also contains limited metadata
        Etag, Last modified time, Avro schema version
 Document Schema specifies per-field index constraints
 Lucene index per partition key / resource




                                                          39
Espresso: Replication

 MySQL replication of mastered partitions
 MySQL “Slave” is MySQL instance with custom storage
  engine
   – custom storage engine just publishes to databus
 Per-database commit sequence number
 Replication is Databus
   – Supports existing downstream consumers
 Storage node consumes from Databus to update
  secondary indexes and slave partitions




                                                        40

Xldb2011 tue 1005_linked_in

  • 1.
    Data Infrastructure atLinkedIn Shirshanka Das XLDB 2011 1
  • 2.
    Me  UCLA Ph.D.2005 (Distributed protocols in content delivery networks)  PayPal (Web frameworks and Session Stores)  Yahoo! (Serving Infrastructure, Graph Indexing, Real-time Bidding in Display Ad Exchanges)  @ LinkedIn (Distributed Data Systems team): Distributed data transport and storage technology (Kafka, Databus, Espresso, ...) 2
  • 3.
    Outline  LinkedIn Products Data Ecosystem  LinkedIn Data Infrastructure Solutions  Next Play 3
  • 4.
    LinkedIn By TheNumbers  120,000,000+ users in August 2011  2 new user registrations per second  4 billion People Searches expected in 2011  2+ million companies with LinkedIn Company Pages  81+ million unique visitors monthly*  150K domains feature the LinkedIn Share Button  7.1 billion page views in Q2 2011  1M LinkedIn Groups * Based on comScore, Q2 2011 4
  • 5.
  • 6.
    Signal - facetedstream search 6
  • 7.
  • 8.
    Outline  LinkedIn Products Data Ecosystem  LinkedIn Data Infrastructure Solutions  Next Play 8
  • 9.
    Three Paradigms :Simplifying the Data Continuum • Member Profiles • Signal • People You May Know • Company Profiles • Profile Standardization • Connection Strength • Connections • News • News • Communications • Recommendations • Recommendations • Search • Next best idea • Communications Online Nearline Offline Activity that should Activity that should Activity that can be be reflected immediately be reflected soon reflected later 9
  • 10.
    Data Infrastructure Toolbox(Online) Capabilities Systems Analysis Key-value access Rich structures (e.g. indexes) Change capture capability { Search platform Graph engine 10
  • 11.
    Data Infrastructure Toolbox(Nearline) Capabilities Systems Analysis Change capture streams Messaging for site events, monitoring Nearline processing 11
  • 12.
    Data Infrastructure Toolbox(Offline) Capabilities Systems Analysis Machine learning, ranking, relevance Analytics on Social gestures 12
  • 13.
  • 14.
    Outline  LinkedIn Products Data Ecosystem  LinkedIn Data Infrastructure Solutions  Next Play 14
  • 15.
    Focus on foursystems in Online and Nearline  Data Transport – Kafka – Databus  Online Data Stores – Voldemort – Espresso 15
  • 16.
    LinkedIn Data InfrastructureSolutions Kafka: High-Volume Low-Latency Messaging System 16
  • 17.
    Kafka: Architecture Broker Tier WebTier Consumers Push Sequential write sendfile Pull Iterator 1 Client Lib Event Events Topic 1 Kafka s 100 MB/sec 200 MB/sec Topic 2 Iterator n Topic N Topic  Offset Topic, Partition Offset Ownership Zookeeper Management Scale Guarantees  Billions of Events  At least once delivery  TBs per day  Very high throughput  Inter-colo: few seconds  Low latency  Typical retention: weeks  Durability 17
  • 18.
    LinkedIn Data InfrastructureSolutions Databus : Timeline-Consistent Change Data Capture 18
  • 19.
    Databus at LinkedIn Client Relay Consumer 1 Client Lib Capture Databus On-line DB Changes Changes Event Win Consumer n On-line Changes Bootstrap Client Consumer 1 Client Lib Databus Consistent Snapshot at U Consumer n DB Features Guarantees  Transport independent of data  Transactional semantics source: Oracle, MySQL, …  Timeline consistency with the data source  Portable change event  Durability (by data source) serialization and versioning  At-least-once delivery  Start consumption from arbitrary  Availability point  Low latency 19
  • 20.
    LinkedIn Data InfrastructureSolutions Voldemort: Highly-Available Distributed Data Store 20
  • 21.
    Voldemort: Architecture Highlights In production • Open source • Data products • Pluggable components • Network updates, sharing, • Tunable consistency / page view tracking, availability rate-limiting, more… • Key/value model, • Future: SSDs, server side “views” multi-tenancy
  • 22.
    LinkedIn Data InfrastructureSolutions Espresso: Indexed Timeline-Consistent Distributed Data Store 22
  • 23.
    Espresso: Key DesignPoints  Hierarchical data model – InMail, Forums, Groups, Companies  Native Change Data Capture Stream – Timeline consistency – Read after Write  Rich functionality within a hierarchy – Local Secondary Indexes – Transactions – Full-text search  Modular and Pluggable – Off-the-shelf: MySQL, Lucene, Avro 23
  • 24.
  • 25.
  • 26.
    Partition Layout: Master,Slave 3 Storage Engine nodes, 2 way replication Database P.1 P.2 P.3 P.5 P.6 P.7 Partition: P.1 Node: 1 P.4 P.5 P.6 P.8 P.1 P.2 … Partition: P.12 Node: 3 P.9 P.1 P.11 P.1 0 2 Node 1 Node 2 Cluster Node: 1 M: P.1 – Active P.9 P.1 P.11 … 0 S: P.5 – Active … P.1 P.3 P.4 2 P.7 P.8 Master Cluster Slave Manager Node 3
  • 27.
    Espresso: API  RESTover HTTP  Get Messages for bob – GET /MailboxDB/MessageMeta/bob  Get MsgId 3 for bob – GET /MailboxDB/MessageMeta/bob/3  Get first page of Messages for bob that are unread and in the inbox – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true +isInbox:true”&start=0&count=15 27
  • 28.
    Espresso: API Transactions • Add a message to bob’s mailbox • transactionally update mailbox aggregates, insert into metadata and details. POST /MailboxDB/*/bob HTTP/1.1 Content-Type: multipart/binary; boundary=1299799120 Accept: application/json --1299799120 Content-Type: application/json Content-Location: /MailboxDB/MessageStats/bob Content-Length: 50 {“total”:”+1”, “unread”:”+1”} --1299799120 Content-Type: application/json Content-Location: /MailboxDB/MessageMeta/bob Content-Length: 332 {“from”:”…”,”subject”:”…”,…} --1299799120 Content-Type: application/json Content-Location: /MailboxDB/MessageDetails/bob Content-Length: 542 {“body”:”…”} --1299799120— 28
  • 29.
  • 30.
    Espresso @ LinkedIn First applications – Company Profiles – InMail  Next – Unified Social Content Platform – Member Profiles – Many more… 30
  • 31.
    Espresso: Next steps  Launched first application Oct 2011  Open source 2012  Multi-Datacenter support  Log-structured storage  Time-partitioned data 31
  • 32.
    Outline  LinkedIn Products Data Ecosystem  LinkedIn Data Infrastructure Solutions  Next Play 32
  • 33.
    The Specialization Paradoxin Distributed Systems  Good: Build specialized systems so you can do each thing really well  Bad: Rebuild distributed routing, failover, cluster management, monitoring, tooling 33
  • 34.
    Generic Cluster Manager:Helix • Generic Distributed State Model • Centralized Config Management • Automatic Load Balancing • Fault tolerance • Health monitoring • Cluster expansion and rebalancing • Open Source 2012 • Espresso, Databus and Search 34
  • 35.
    Stay tuned for Innovation – Nearline processing – Espresso eco-system – Storage / indexing – Analytics engine – Search  Convergence – Building blocks for distributed data management systems 35
  • 36.
  • 37.
  • 38.
    Espresso: Routing  Routeris a high-performance HTTP proxy  Examines URL, extracts partition key  Per-db routing strategy – Hash Based – Route To Any (for schema access) – Range (future)  Routing function maps partition key to partition  Cluster Manager maintains mapping of partition to hosts: – Single Master – Multiple Slaves 38
  • 39.
    Espresso: Storage Node Data Store (MySQL) – Stores document as Avro serialized blob – Blob indexed by (partition key {, sub-key}) – Row also contains limited metadata  Etag, Last modified time, Avro schema version  Document Schema specifies per-field index constraints  Lucene index per partition key / resource 39
  • 40.
    Espresso: Replication  MySQLreplication of mastered partitions  MySQL “Slave” is MySQL instance with custom storage engine – custom storage engine just publishes to databus  Per-database commit sequence number  Replication is Databus – Supports existing downstream consumers  Storage node consumes from Databus to update secondary indexes and slave partitions 40