SlideShare a Scribd company logo
Data Infrastructure @ LinkedIn
Sid Anand
QCon London 2012

@r39132




                                 1
About Me
Current Life…
  LinkedIn                        *

      Web / Software Engineering
           Search, Network, and Analytics (SNA)
               Distributed Data Systems (DDS)
                   Me

In a Previous Life…
  Netflix, Cloud Database Architect
  eBay, Web Development, Research Lab, & Search
    Engine                                              ***




And Many Years Prior…
  Studying Distributed Systems at Cornell University


                                       @r39132                2
Our mission
Connect the world’s professionals to make
  them more productive and successful




                  @r39132                   3
The world’s largest professional network
Over 60% of members are now international



                                                                    75%
                                                                                     **

                                     150M+          *



                                             90                     Fortune 100 Companies
                                                                    use LinkedIn to hire



                                                                    >2M
                                                                                      ***




                                                                    Company Pages
                                      55




                               32
                                                                    16
                                                                    Languages



                                                                  ~4.2B
                        17                                                                   ***

                  8
  2      4
                                                                  Professional searches in 2011
 2004   2005    2006   2007   2008   2009    2010                                    *as of February 9, 2012
                                                                                 **as of September 30, 2011
               LinkedIn Members (Millions)                                       ***as of December 31, 2011




                                                        @r39132                                          4
Other Company Facts

•  Headquartered in Mountain View, Calif., with offices around the world!
                                *




•  As of December 31, 2011, LinkedIn has 2,116 full-time employees located
   around the world.

    •  Currently around 650 people work in Engineering
        •  400 in Web/Software Engineering
            •  Plan to add another 200 in 2012                               ***

        •  250 in Operations



                                                                     *as of February 9, 2012
                                                                 **as of September 30, 2011
                                                                 ***as of December 31, 2011




                                    @r39132                                              5
Agenda

  Company Overview
  Architecture
   –  Data Infrastructure Overview
   –  Technology Spotlight
           Oracle
           Voldemort
           DataBus
           Kafka
  Q & A




                                @r39132   6
LinkedIn : Architecture	


Overview

•  Our site runs primarily on Java, with some use of Scala for specific
   infrastructure

•  What runs on Scala?
   •  Network Graph Service
   •  Kafka

•  Most of our services run on Apache + Jetty




                                 @r39132                             7
LinkedIn : Architecture	


                                                                                     A web page requests
                                                                                      information A and B



                                                             Presentation Tier       A thin layer focused on
                                                                                      building the UI. It assembles
                                                                                      the page by making parallel
                                                                                      requests to BST services


      A    A          B         B         C        C         Business Service Tier   Encapsulates business
                                                                                      logic. Can call other BST
                                                                                      clusters and its own DST
                                                                                      cluster.
      A    A          B         B         C        C         Data Service Tier       Encapsulates DAL logic and
                                                                                      concerned with one Oracle
                                                                                      Schema.

                                                             Data Infrastructure     Concerned with the
 Master   Slave               Master           Master
                                                                                      persistent storage of and
                                                                                      easy access to data
                                       Memcached
                  Voldemort



                                                   @r39132                                                  8
LinkedIn : Architecture	


                                                                                     A web page requests
                                                                                      information A and B



                                                             Presentation Tier       A thin layer focused on
                                                                                      building the UI. It assembles
                                                                                      the page by making parallel
                                                                                      requests to BST services


      A    A          B         B         C        C         Business Service Tier   Encapsulates business
                                                                                      logic. Can call other BST
                                                                                      clusters and its own DST
                                                                                      cluster.
      A    A          B         B         C        C         Data Service Tier       Encapsulates DAL logic and
                                                                                      concerned with one Oracle
                                                                                      Schema.

                                                             Data Infrastructure     Concerned with the
 Master   Slave               Master           Master
                                                                                      persistent storage of and
                                                                                      easy access to data
                                       Memcached
                  Voldemort



                                                   @r39132                                                  9
LinkedIn : Data Infrastructure Technologies	

•    Database Technologies
      •  Oracle
      •  Voldemort                    *

      •  Espresso

•    Data Replication Technologies
      •  Kafka
      •  DataBus

•    Search Technologies
      •  Zoie – real-time search and indexing with Lucene
      •  Bobo – faceted search library for Lucene                              ***

      •  SenseiDB – fast, real-time, faceted, KV and full-text Search Engine and more




                                          @r39132                                10
LinkedIn : Data Infrastructure Technologies	

This talk will focus on a few of the key technologies below!

•    Database Technologies           *
      •  Oracle
      •  Voldemort
      •  Espresso – A new K-V store under development

•    Data Replication Technologies
      •  Kafka
      •  DataBus

•    Search Technologies                                                       ***

      •  Zoie – real-time search and indexing with Lucene
      •  Bobo – faceted search library for Lucene
      •  SenseiDB – fast, real-time, faceted, KV and full-text Search Engine and more




                                         @r39132                                  11
LinkedIn Data Infrastructure Technologies

Oracle: Source of Truth for User-Provided Data




                              @r39132            12
Oracle : Overview	

Oracle
•  All user-provided data is stored in Oracle – our current source of truth
•  About 50 Schemas running on tens of physical instances
                                       *

•  With our user base and traffic growing at an accelerating pace, so how do we
   scale Oracle for user-provided data?

Scaling Reads
•  Oracle Slaves (c.f. DSC)
•  Memcached
•  Voldemort – for key-value lookups

                                                                            ***


Scaling Writes
•  Move to more expensive hardware or replace Oracle with something else




                                       @r39132                                    13
Oracle : Overview – Data Service Context	

 Scaling Oracle Reads using DSC



   DSC uses a token (e.g. cookie) to ensure that a reader always
    sees his or her own writes immediately

     –  If I update my own status, it is okay if you don’t see the
        change for a few minutes, but I have to see it immediately




                                @r39132                              14
Oracle : Overview – How DSC Works?	

    When a user writes data to the master, the DSC token (for that
     data domain) is updated with a timestamp

    When the user reads data, we first attempt to read from a
     replica (a.k.a. slave) database

    If the data in the slave is older than the data in the DSC token,
     we read from the Master instead




                                 @r39132                             15
LinkedIn Data Infrastructure Technologies
Voldemort: Highly-Available Distributed Data Store




                              @r39132                16
Voldemort : Overview	

•    A distributed, persistent key-value store influenced by the AWS Dynamo paper

•    Key Features of Dynamo
        Highly Scalable, Available, and Performant
        Achieves this via Tunable Consistency
          •  For higher consistency, the user accepts lower availability, scalability, and
              performance, and vice-versa

        Provides several self-healing mechanisms when data does become inconsistent
          •  Read Repair
                 Repairs value for a key when the key is looked up/read
          •  Hinted Handoff
                 Buffers value for a key that wasn’t successfully written, then writes it later
          •  Anti-Entropy Repair
                 Scans the entire data set on a node and fixes it

        Provides means to detect node failure and a means to recover from node failure
          •  Failure Detection
          •  Bootstrapping New Nodes

                                                @r39132                                            17
Voldemort : Overview	

                        API                                    Layered, Pluggable Architecture
     VectorClock<V> get (K key)
                                                                             Client
     put (K key, VectorClock<V> value)
     applyUpdate(UpdateAction action, int retries)                         Client API
                                                                      Conflict Resolution
Voldemort-specific Features                                               Serialization
  Implements a layered, pluggable                                     Repair Mechanism
   architecture
                                                                        Failure Detector
                                                                            Routing
  Each layer implements a common interface
   (c.f. API). This allows us to replace or
   remove implementations at any layer                                       Server

                                                                  Repair Mechanism
    •  Pluggable data storage layer                                Failure Detector         Admin
           BDB JE, Custom RO storage,
                                                                       Routing
            etc…
                                                                         Storage Engine

    •  Pluggable routing supports
           Single or Multi-datacenter routing



                                                     @r39132                                  18
Voldemort : Overview	

                                                 Layered, Pluggable Architecture
                                                               Client
Voldemort-specific Features
                                                             Client API
                                                        Conflict Resolution
•  Supports Fat client or Fat Server                        Serialization
                                                         Repair Mechanism
    •  Repair Mechanism + Failure                         Failure Detector

       Detector + Routing can run on                          Routing

       server or client
                                                               Server


•  LinkedIn currently runs Fat Client, but we       Repair Mechanism
   would like to move this to a Fat Server           Failure Detector         Admin
   Model                                                 Routing
                                                           Storage Engine




                                       @r39132                                  19
Where Does LinkedIn use
      Voldemort?




          @r39132         20
Voldemort : Usage Patterns @ LinkedIn	



 2 Usage-Patterns

   Read-Write Store
     –  Uses BDB JE for the storage engine
     –  50% of Voldemort Stores (aka Tables) are RW


   Read-only Store
     –  Uses a custom Read-only format
     –  50% of Voldemort Stores (aka Tables) are RO


   Let’s look at the RO Store


                                 @r39132              21
Voldemort : RO Store Usage at LinkedIn	

     People You May Know	

     Viewers of this profile also viewed	

   Related Searches	





Events you may be interested in	

          LinkedIn Skills	

          Jobs you may be
                                                                          interested in	





                                               @r39132                                       22
Voldemort : Usage Patterns @ LinkedIn	

RO Store Usage Pattern
1.    Use Hadoop to build a model

2.    Voldemort loads the output of Hadoop

3.    Voldemort serves fast key-value look-ups on the site
      –    e.g. For key=“Sid Anand”, get all the people that “Sid Anand” may know!
      –    e.g. For key=“Sid Anand”, get all the jobs that “Sid Anand” may be interested in!




                                               @r39132                                         23
How Do The Voldemort RO
     Stores Perform?




          @r39132         24
Voldemort : RO Store Performance : TP vs. Latency	


                                                             ●       MySQL ● Voldemort
                                                   ●   MySQL ● Voldemort ● Voldemort
                                                                      ●   MySQL
                                                       median MySQL ● Voldemort 99th percentile
                                                               ●

                                           median            median          ●     99th percentile ●percentile
                                                                                                     99th
                                                       median                                  99th percentile
                                   3.5                           ●          ●    ●
                                                                                250          ●                ●
                                                             ●              ●                            ●
                     3.5                3.5                     ●
                                                                     250 ●      ●                       ●
                                  3.5
                                   3.0          ●                                   250 ●       ●
                                                                                                   ● ●
                                                            ●
                                                                   ●           250     ●
                                                                                           ●               ●
                     3.0                                                      ● 200 ● ●             ● ●
                                                                                                       ● ●
                                        3.0                                ●       ●       ●
                                                                                               ●
                                                                                                  ● ●
                                  3.0                                                    ●
                                latency (ms)


                                   2.5                            ●
                                                                     200
                                                                        ●            ●         ●
                                                           ●                        200
                                                                               200
      latency (ms)




                     2.5
                             latency (ms)


                                        2.5         ●    ●             ●   ●
                            latency (ms)




                                  2.5
                                   2.0        ●
                                                   ●
                                                      ● ●       ●
                                                                 ●
                                                                       ●        150
                                                          ●
                     2.0                2.0● ●
                                         ●●               ● ●        150            150
                                  2.0 ●
                                   1.5            ●
                                                       ●
                                                     ● ●                       150                                  ●
                                                                                                                     ●
                            ●   ●              ● ● ●                            100
                                         ●● ●                                                                    ●
                     1.5                1.5
                                         ●      ●
                                                                                                            ●
                                                                                                           ●●            ●
                                                                                                                          ●
                                  1.5
                                   1.0               ●
                                                          ●          100            100                            ●
                                                                                                                    ●
                            ●
                                   ●
                                               ●
                                                     ●
                                                                               100                    ●●           ●
                     1.0                 ●
                                        1.0 ●                                    50 ● ● ● ● ●              ●
                                                                                                                ●
                                                                                                                ●
                                  1.0
                                   0.5                                               ●                 ●
                                                                                                       ●
                                                                                                           ●
                                                                       50 ●     ●    50 ●  ●      ●
                                                                                                  ●
                     0.5                0.5                                     50 ●
                                  0.5
                                   0.0                                             0
                     0.0           0.0                   0           0
                               0.0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
                           100 200 300 400 500 600 700 throughput 300 400 500 600 700 500 600 700
                                                           100 200
                                       100 200 300 400 500 600 700 (qps) 200 300 400
                                                                       100
                                   100 200 300 400 500 600 700     100 200 300 400 500 600 700
                                                     throughput (qps) (qps)
                                                                throughput (qps)
                                                            throughput

                                                       100 GB data, 24 GB RAM 	


                                                                    @r39132                                                   25
LinkedIn Data Infrastructure Solutions
Databus : Timeline-Consistent Change Data Capture




                               @r39132              26
Where Does LinkedIn use
       DataBus?




          @r39132         27
DataBus : Use-Cases @ LinkedIn	




                  Updates
                                   Standard      Search       Graph        Read
                                    ization       Index       Index       Replicas




                Oracle
                                              Data Change Events



A user updates his profile with skills and position history. He also accepts a connection

•    The write is made to an Oracle Master and DataBus replicates:
•    the profile change to the Standardization service
         E.G. the many forms of IBM are canonicalized for search-friendliness and
          recommendation-friendliness
•    the profile change to the Search Index service
         Recruiters can find you immediately by new keywords
•    the connection change to the Graph Index service
         The user can now start receiving feed updates from his new connections immediately


                                              @r39132                                     28
DataBus : Architecture	

                                            DataBus consists of 2 services
        Capture          Relay              •  Relay Services
   DB   Changes
                      Event Win                 •  Sharded
                                                •  Maintain an in-memory buffer per
                     On-line                       shard
                    Changes
                                                •  Each shard polls Oracle and then
                                                   deserializes transactions into Avro
                     Bootstrap
                                            •  Bootstrap Service
                          DB                    •  Picks up online changes as they
                                                   appear in the Relay
                                                •  Supports 2 types of operations
                                                   from clients
                                                       If a client falls behind and
                                                        needs records older than what
                                                        the relay has, Bootstrap can
                                                        send consolidated deltas!
                                                       If a new client comes on line,
                                                        Bootstrap can send a
                                                        consistent snapshot


                                  @r39132                                        29
DataBus : Architecture	


                                                                   Client
        Capture          Relay                                     Consumer 1




                                                      Client Lib
                                            On-line




                                                      Databus
   DB   Changes                             Changes
                      Event Win                                    Consumer n
                     On-line
                                               ated
                    Changes              solid
                                     Con         eT
                                            Sinc
                                      Delta                        Client
                     Bootstrap
                                                                   Consumer 1




                                                      Client Lib
                                      Consistent




                                                      Databus
                                     Snapshot at U                 Consumer n
                          DB

                                     Guarantees
                                       Transactional semantics
                                       In-commit-order Delivery
                                       At-least-once delivery
                                       Durability (by data source)
                                       High-availability and reliability
                                       Low latency
                                  @r39132                                       30
DataBus : Architecture - Bootstrap	

    Generate consistent snapshots and consolidated deltas
     during continuous updates with long-running queries

                                                                     Client
                              Read
                                                                      Consumer 1




                                                        Client Lib
                                                        Databus
                              Online
            Relay             Changes
                                                                      Consumer n
          Event Win

                 Read
                 Changes     Bootstrap server                        Bootstrap

                                            Read
           Log Writer                                                Server
                                        recent events

                           Replay
          Log Storage      events                       Snapshot Storage

                               Log Applier


                                    @r39132                                        31
LinkedIn Data Infrastructure Solutions
Kafka: High-Volume Low-Latency Messaging System




                               @r39132            32
Kafka : Usage at LinkedIn	

Where as DataBus is used for Database change capture and
replication, Kafka is used for application-level data streams

Examples:
•  End-user Action Tracking (a.k.a. Web Tracking) of
    •  Emails opened
    •  Pages seen
    •  Links followed
    •  Executing Searches

•    Operational Metrics
     •  Network & System metrics such as
         •  TCP metrics (connection resets, message resends, etc…)
         •  System metrics (iops, CPU, load average, etc…)

                                 @r39132                         33
Kafka : Overview	

                                   Broker Tier
     WebTier                                                                          Consumers

            Push           Sequential write            sendfile         Pull
            Events                                                      Events                   Iterator 1




                                                                                    Client Lib
                                        Topic 1




                                                                                     Kafka
           100 MB/sec                                               200 MB/sec
                                        Topic 2
                                                                                                 Iterator n
                                        Topic N

                                                                                                 Topic  Offset


                                              Topic, Partition                        Offset
                                              Ownership
                                                                  Zookeeper           Management



Features                           Guarantees                      Scale
     Pub/Sub                          At least once delivery          Billions of Events
     Batch Send/Receive               Very high throughput            TBs per day
     System Decoupling                Low latency                     Inter-colo: few seconds
                                       Durability                      Typical retention: weeks
                                       Horizontally Scalable

                                                      @r39132                                                 34
Kafka : Overview	

Key Design Choices

•  When reading from a file and sending to network socket, we typically incur 4 buffer copies and
   2 OS system calls
       Kafka leverages a sendFile API to eliminate 2 of the buffer copies and 1 of the system
        calls

•  No double-buffering of messages - we rely on the OS page cache and do not store a copy of
   the message in the JVM
       Less pressure on memory and GC

       If the Kafka process is restarted on a machine, recently accessed messages are still in
        the page cache, so we get the benefit of a warm start

•  Kafka doesn't keep track of which messages have yet to be consumed -- i.e. no book keeping
   overhead
       Instead, messages have time-based SLA expiration -- after 7 days, messages are deleted




                                              @r39132                                        35
How Does Kafka Perform?




          @r39132         36
Kafka : Performance : Throughput vs. Latency	



                                     (100 topics, 1 producer, 1 broker)
                           250

                           200
  Consumer latency in ms




                           150

                           100

                            50

                             0
                                 0   20          40          60           80   100

                                          Producer throughput in MB/sec

                                                   @r39132                       37
Kafka : Performance : Linear Incremental Scalability	




                                    (10 topics, broker flush interval 100K)
                          400
                                                                                   381
                          350

                          300
     Throughput in MB/s




                                                                 293
                          250

                          200                      190
                          150

                          100        101

                           50

                            0
                                1 broker     2 brokers      3 brokers         4 brokers

                                                 @r39132                                  38
Kafka : Performance : Resilience as Messages Pile Up 	



                                                (1	
  topic,	
  broker	
  flush	
  interval	
  10K)	
  
                                       200000

                                       160000
       Throughput	
  in	
  msg/s	
  




                                       120000

                                       80000

                                       40000

                                           0
                                                 10
                                                      105
                                                            199
                                                                  294
                                                                        388
                                                                              473
                                                                                    567
                                                                                          662
                                                                                                756
                                                                                                      851
                                                                                                            945
                                                                                                                  1039
                                                            Unconsumed data in GB

                                                                         @r39132                                         39
Acknowledgments	

 Presentation & Content
   Chavdar Botev (DataBus)       @cbotev
   Roshan Sumbaly (Voldemort)    @rsumbaly
   Neha Narkhede (Kafka)         @nehanarkhede



 Development Team
     Aditya Auradkar, Chavdar Botev, Shirshanka Das, Dave DeMaagd, Alex
     Feinberg, Phanindra Ganti, Lei Gao, Bhaskar Ghosh, Kishore
     Gopalakrishna, Brendan Harris, Joel Koshy, Kevin Krawez, Jay Kreps, Shi
     Lu, Sunil Nagaraj, Neha Narkhede, Sasha Pachev, Igor Perisic, Lin Qiao,
     Tom Quiggle, Jun Rao, Bob Schulman, Abraham Sebastian, Oliver Seeliger,
     Adam Silberstein, Boris Skolnick, Chinmay Soman, Roshan Sumbaly, Kapil
     Surlaker, Sajid Topiwala, Balaji Varadarajan, Jemiah Westerman, Zach
     White, David Zhang, and Jason Zhang




                                   @r39132                                40
y Questions?




   @r39132     41

More Related Content

What's hot

Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
DataWorks Summit/Hadoop Summit
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
Knoldus Inc.
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
Amazon Web Services
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PgDay.Seoul
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Amazon Web Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
Durga Gadiraju
 
Oracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud InfrastructureOracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud Infrastructure
SinanPetrusToma
 
API Management - Why it matters!
API Management - Why it matters!API Management - Why it matters!
API Management - Why it matters!
Sven Bernhardt
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
confluent
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Databricks
 
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
Amazon Web Services
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
AWS Kinesis Streams
AWS Kinesis StreamsAWS Kinesis Streams
AWS Kinesis Streams
Fernando Rodriguez
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Oracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud InfrastructureOracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud Infrastructure
 
API Management - Why it matters!
API Management - Why it matters!API Management - Why it matters!
API Management - Why it matters!
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
 
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
AWS Kinesis Streams
AWS Kinesis StreamsAWS Kinesis Streams
AWS Kinesis Streams
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 

Viewers also liked

Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
Sunil Nagaraj
 
Redis: Lua scripts - a primer and use cases
Redis: Lua scripts - a primer and use casesRedis: Lua scripts - a primer and use cases
Redis: Lua scripts - a primer and use cases
Redis Labs
 
Scala Parallel Collections
Scala Parallel CollectionsScala Parallel Collections
Scala Parallel Collections
Aleksandar Prokopec
 
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
LinkedIn
 
Shopzilla - Performance By Design
Shopzilla - Performance By DesignShopzilla - Performance By Design
Shopzilla - Performance By Design
Tim Morrow
 
Let's Go (golang)
Let's Go (golang)Let's Go (golang)
Let's Go (golang)
상욱 송
 
ScalaBlitz
ScalaBlitzScalaBlitz
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
Adrian Cockcroft
 
Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.
Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.
Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.
Rahul Krishna Upadhyaya
 
Hancom MDS Conference - KAKAO DEVOPS Practice (카카오 스토리의 Devops 사례)
Hancom MDS Conference - KAKAO DEVOPS Practice (카카오 스토리의 Devops 사례)Hancom MDS Conference - KAKAO DEVOPS Practice (카카오 스토리의 Devops 사례)
Hancom MDS Conference - KAKAO DEVOPS Practice (카카오 스토리의 Devops 사례)
knight1128
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
Jonas Bonér
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
Jongyoon Choi
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
nkallen
 

Viewers also liked (13)

Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 
Redis: Lua scripts - a primer and use cases
Redis: Lua scripts - a primer and use casesRedis: Lua scripts - a primer and use cases
Redis: Lua scripts - a primer and use cases
 
Scala Parallel Collections
Scala Parallel CollectionsScala Parallel Collections
Scala Parallel Collections
 
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
LinkedIn - A Professional Network built with Java Technologies and Agile Prac...
 
Shopzilla - Performance By Design
Shopzilla - Performance By DesignShopzilla - Performance By Design
Shopzilla - Performance By Design
 
Let's Go (golang)
Let's Go (golang)Let's Go (golang)
Let's Go (golang)
 
ScalaBlitz
ScalaBlitzScalaBlitz
ScalaBlitz
 
Global Netflix Platform
Global Netflix PlatformGlobal Netflix Platform
Global Netflix Platform
 
Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.
Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.
Openstack Rally - Benchmark as a Service. Openstack Meetup India. Ananth/Rahul.
 
Hancom MDS Conference - KAKAO DEVOPS Practice (카카오 스토리의 Devops 사례)
Hancom MDS Conference - KAKAO DEVOPS Practice (카카오 스토리의 Devops 사례)Hancom MDS Conference - KAKAO DEVOPS Practice (카카오 스토리의 Devops 사례)
Hancom MDS Conference - KAKAO DEVOPS Practice (카카오 스토리의 Devops 사례)
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
facebook architecture for 600M users
facebook architecture for 600M usersfacebook architecture for 600M users
facebook architecture for 600M users
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
 

Similar to LinkedIn Data Infrastructure (QCon London 2012)

LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)
Sid Anand
 
Document Classification using DMX in SQL Server Analysis Services
Document Classification using DMX in SQL Server Analysis ServicesDocument Classification using DMX in SQL Server Analysis Services
Document Classification using DMX in SQL Server Analysis Services
Mark Tabladillo
 
Introduction inbox v2.0
Introduction inbox v2.0Introduction inbox v2.0
Introduction inbox v2.0
Ashutosh Mundra
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
Adrian Cockcroft
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
Adrian Cockcroft
 
Sap business objects BI4.0 reporting presentation
Sap business objects BI4.0 reporting presentationSap business objects BI4.0 reporting presentation
Sap business objects BI4.0 reporting presentation
shaktell2
 
10g forms
10g forms10g forms
(.Net Portfolio) Td Rodda
(.Net Portfolio) Td Rodda(.Net Portfolio) Td Rodda
(.Net Portfolio) Td Rodda
tdrodda
 
Keat Resume 2012b
Keat Resume 2012bKeat Resume 2012b
Keat Resume 2012b
mkeating1
 
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudIBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
Torsten Steinbach
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
Amazon Web Services
 
AWS Summit Berlin 2012 Talk on Web Data Commons
AWS Summit Berlin 2012 Talk on Web Data CommonsAWS Summit Berlin 2012 Talk on Web Data Commons
AWS Summit Berlin 2012 Talk on Web Data Commons
Hannes Mühleisen
 
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
The Perfect Storm: The Impact of Analytics, Big Data and AnalyticsThe Perfect Storm: The Impact of Analytics, Big Data and Analytics
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
Inside Analysis
 
Leveraging PowerPivot
Leveraging PowerPivotLeveraging PowerPivot
Leveraging PowerPivot
Dan English
 
Ofm msft-interop-v5c-132827
Ofm msft-interop-v5c-132827Ofm msft-interop-v5c-132827
Ofm msft-interop-v5c-132827
Jean-Marc Hui Bon Hoa
 
An overview of Microsoft data mining technology
An overview of Microsoft data mining technologyAn overview of Microsoft data mining technology
An overview of Microsoft data mining technology
Mark Tabladillo
 
Secrets of Enterprise Data Mining
Secrets of Enterprise Data Mining Secrets of Enterprise Data Mining
Secrets of Enterprise Data Mining
Mark Tabladillo
 
Domino app dev competitive advantage final
Domino app dev competitive advantage finalDomino app dev competitive advantage final
Domino app dev competitive advantage final
John Head
 
Agile NoSQL With XRX
Agile NoSQL With XRXAgile NoSQL With XRX
Agile NoSQL With XRX
DATAVERSITY
 
Transitioning a Full Enterprise to Cloud in 10 Months - Cloud Expo
Transitioning a Full Enterprise to Cloud in 10 Months - Cloud ExpoTransitioning a Full Enterprise to Cloud in 10 Months - Cloud Expo
Transitioning a Full Enterprise to Cloud in 10 Months - Cloud Expo
sjdeluca
 

Similar to LinkedIn Data Infrastructure (QCon London 2012) (20)

LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)
 
Document Classification using DMX in SQL Server Analysis Services
Document Classification using DMX in SQL Server Analysis ServicesDocument Classification using DMX in SQL Server Analysis Services
Document Classification using DMX in SQL Server Analysis Services
 
Introduction inbox v2.0
Introduction inbox v2.0Introduction inbox v2.0
Introduction inbox v2.0
 
Netflix in the cloud 2011
Netflix in the cloud 2011Netflix in the cloud 2011
Netflix in the cloud 2011
 
Migrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global CassandraMigrating Netflix from Datacenter Oracle to Global Cassandra
Migrating Netflix from Datacenter Oracle to Global Cassandra
 
Sap business objects BI4.0 reporting presentation
Sap business objects BI4.0 reporting presentationSap business objects BI4.0 reporting presentation
Sap business objects BI4.0 reporting presentation
 
10g forms
10g forms10g forms
10g forms
 
(.Net Portfolio) Td Rodda
(.Net Portfolio) Td Rodda(.Net Portfolio) Td Rodda
(.Net Portfolio) Td Rodda
 
Keat Resume 2012b
Keat Resume 2012bKeat Resume 2012b
Keat Resume 2012b
 
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudIBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
 
AWS Summit Berlin 2012 Talk on Web Data Commons
AWS Summit Berlin 2012 Talk on Web Data CommonsAWS Summit Berlin 2012 Talk on Web Data Commons
AWS Summit Berlin 2012 Talk on Web Data Commons
 
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
The Perfect Storm: The Impact of Analytics, Big Data and AnalyticsThe Perfect Storm: The Impact of Analytics, Big Data and Analytics
The Perfect Storm: The Impact of Analytics, Big Data and Analytics
 
Leveraging PowerPivot
Leveraging PowerPivotLeveraging PowerPivot
Leveraging PowerPivot
 
Ofm msft-interop-v5c-132827
Ofm msft-interop-v5c-132827Ofm msft-interop-v5c-132827
Ofm msft-interop-v5c-132827
 
An overview of Microsoft data mining technology
An overview of Microsoft data mining technologyAn overview of Microsoft data mining technology
An overview of Microsoft data mining technology
 
Secrets of Enterprise Data Mining
Secrets of Enterprise Data Mining Secrets of Enterprise Data Mining
Secrets of Enterprise Data Mining
 
Domino app dev competitive advantage final
Domino app dev competitive advantage finalDomino app dev competitive advantage final
Domino app dev competitive advantage final
 
Agile NoSQL With XRX
Agile NoSQL With XRXAgile NoSQL With XRX
Agile NoSQL With XRX
 
Transitioning a Full Enterprise to Cloud in 10 Months - Cloud Expo
Transitioning a Full Enterprise to Cloud in 10 Months - Cloud ExpoTransitioning a Full Enterprise to Cloud in 10 Months - Cloud Expo
Transitioning a Full Enterprise to Cloud in 10 Months - Cloud Expo
 

More from Sid Anand

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)
Sid Anand
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Sid Anand
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
Sid Anand
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
Sid Anand
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)
Sid Anand
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)
Sid Anand
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
Sid Anand
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Sid Anand
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Sid Anand
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
Sid Anand
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Sid Anand
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
Sid Anand
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Sid Anand
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)
Sid Anand
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
Sid Anand
 
Learning git
Learning gitLearning git
Learning git
Sid Anand
 

More from Sid Anand (20)

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
 
Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
 
YOW! Data Keynote (2021)
YOW! Data Keynote (2021)YOW! Data Keynote (2021)
YOW! Data Keynote (2021)
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
 
Learning git
Learning gitLearning git
Learning git
 

Recently uploaded

Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 

Recently uploaded (20)

Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 

LinkedIn Data Infrastructure (QCon London 2012)

  • 1. Data Infrastructure @ LinkedIn Sid Anand QCon London 2012 @r39132 1
  • 2. About Me Current Life…   LinkedIn *   Web / Software Engineering   Search, Network, and Analytics (SNA)   Distributed Data Systems (DDS)   Me In a Previous Life…   Netflix, Cloud Database Architect   eBay, Web Development, Research Lab, & Search Engine *** And Many Years Prior…   Studying Distributed Systems at Cornell University @r39132 2
  • 3. Our mission Connect the world’s professionals to make them more productive and successful @r39132 3
  • 4. The world’s largest professional network Over 60% of members are now international 75% ** 150M+ * 90 Fortune 100 Companies use LinkedIn to hire >2M *** Company Pages 55 32 16 Languages ~4.2B 17 *** 8 2 4 Professional searches in 2011 2004 2005 2006 2007 2008 2009 2010 *as of February 9, 2012 **as of September 30, 2011 LinkedIn Members (Millions) ***as of December 31, 2011 @r39132 4
  • 5. Other Company Facts •  Headquartered in Mountain View, Calif., with offices around the world! * •  As of December 31, 2011, LinkedIn has 2,116 full-time employees located around the world. •  Currently around 650 people work in Engineering •  400 in Web/Software Engineering •  Plan to add another 200 in 2012 *** •  250 in Operations *as of February 9, 2012 **as of September 30, 2011 ***as of December 31, 2011 @r39132 5
  • 6. Agenda   Company Overview   Architecture –  Data Infrastructure Overview –  Technology Spotlight   Oracle   Voldemort   DataBus   Kafka   Q & A @r39132 6
  • 7. LinkedIn : Architecture Overview •  Our site runs primarily on Java, with some use of Scala for specific infrastructure •  What runs on Scala? •  Network Graph Service •  Kafka •  Most of our services run on Apache + Jetty @r39132 7
  • 8. LinkedIn : Architecture   A web page requests information A and B Presentation Tier   A thin layer focused on building the UI. It assembles the page by making parallel requests to BST services A A B B C C Business Service Tier   Encapsulates business logic. Can call other BST clusters and its own DST cluster. A A B B C C Data Service Tier   Encapsulates DAL logic and concerned with one Oracle Schema. Data Infrastructure   Concerned with the Master Slave Master Master persistent storage of and easy access to data Memcached Voldemort @r39132 8
  • 9. LinkedIn : Architecture   A web page requests information A and B Presentation Tier   A thin layer focused on building the UI. It assembles the page by making parallel requests to BST services A A B B C C Business Service Tier   Encapsulates business logic. Can call other BST clusters and its own DST cluster. A A B B C C Data Service Tier   Encapsulates DAL logic and concerned with one Oracle Schema. Data Infrastructure   Concerned with the Master Slave Master Master persistent storage of and easy access to data Memcached Voldemort @r39132 9
  • 10. LinkedIn : Data Infrastructure Technologies •  Database Technologies •  Oracle •  Voldemort * •  Espresso •  Data Replication Technologies •  Kafka •  DataBus •  Search Technologies •  Zoie – real-time search and indexing with Lucene •  Bobo – faceted search library for Lucene *** •  SenseiDB – fast, real-time, faceted, KV and full-text Search Engine and more @r39132 10
  • 11. LinkedIn : Data Infrastructure Technologies This talk will focus on a few of the key technologies below! •  Database Technologies * •  Oracle •  Voldemort •  Espresso – A new K-V store under development •  Data Replication Technologies •  Kafka •  DataBus •  Search Technologies *** •  Zoie – real-time search and indexing with Lucene •  Bobo – faceted search library for Lucene •  SenseiDB – fast, real-time, faceted, KV and full-text Search Engine and more @r39132 11
  • 12. LinkedIn Data Infrastructure Technologies Oracle: Source of Truth for User-Provided Data @r39132 12
  • 13. Oracle : Overview Oracle •  All user-provided data is stored in Oracle – our current source of truth •  About 50 Schemas running on tens of physical instances * •  With our user base and traffic growing at an accelerating pace, so how do we scale Oracle for user-provided data? Scaling Reads •  Oracle Slaves (c.f. DSC) •  Memcached •  Voldemort – for key-value lookups *** Scaling Writes •  Move to more expensive hardware or replace Oracle with something else @r39132 13
  • 14. Oracle : Overview – Data Service Context Scaling Oracle Reads using DSC   DSC uses a token (e.g. cookie) to ensure that a reader always sees his or her own writes immediately –  If I update my own status, it is okay if you don’t see the change for a few minutes, but I have to see it immediately @r39132 14
  • 15. Oracle : Overview – How DSC Works?   When a user writes data to the master, the DSC token (for that data domain) is updated with a timestamp   When the user reads data, we first attempt to read from a replica (a.k.a. slave) database   If the data in the slave is older than the data in the DSC token, we read from the Master instead @r39132 15
  • 16. LinkedIn Data Infrastructure Technologies Voldemort: Highly-Available Distributed Data Store @r39132 16
  • 17. Voldemort : Overview •  A distributed, persistent key-value store influenced by the AWS Dynamo paper •  Key Features of Dynamo   Highly Scalable, Available, and Performant   Achieves this via Tunable Consistency •  For higher consistency, the user accepts lower availability, scalability, and performance, and vice-versa   Provides several self-healing mechanisms when data does become inconsistent •  Read Repair   Repairs value for a key when the key is looked up/read •  Hinted Handoff   Buffers value for a key that wasn’t successfully written, then writes it later •  Anti-Entropy Repair   Scans the entire data set on a node and fixes it   Provides means to detect node failure and a means to recover from node failure •  Failure Detection •  Bootstrapping New Nodes @r39132 17
  • 18. Voldemort : Overview API Layered, Pluggable Architecture VectorClock<V> get (K key) Client put (K key, VectorClock<V> value) applyUpdate(UpdateAction action, int retries) Client API Conflict Resolution Voldemort-specific Features Serialization   Implements a layered, pluggable Repair Mechanism architecture Failure Detector Routing   Each layer implements a common interface (c.f. API). This allows us to replace or remove implementations at any layer Server Repair Mechanism •  Pluggable data storage layer Failure Detector Admin   BDB JE, Custom RO storage, Routing etc… Storage Engine •  Pluggable routing supports   Single or Multi-datacenter routing @r39132 18
  • 19. Voldemort : Overview Layered, Pluggable Architecture Client Voldemort-specific Features Client API Conflict Resolution •  Supports Fat client or Fat Server Serialization Repair Mechanism •  Repair Mechanism + Failure Failure Detector Detector + Routing can run on Routing server or client Server •  LinkedIn currently runs Fat Client, but we Repair Mechanism would like to move this to a Fat Server Failure Detector Admin Model Routing Storage Engine @r39132 19
  • 20. Where Does LinkedIn use Voldemort? @r39132 20
  • 21. Voldemort : Usage Patterns @ LinkedIn 2 Usage-Patterns   Read-Write Store –  Uses BDB JE for the storage engine –  50% of Voldemort Stores (aka Tables) are RW   Read-only Store –  Uses a custom Read-only format –  50% of Voldemort Stores (aka Tables) are RO   Let’s look at the RO Store @r39132 21
  • 22. Voldemort : RO Store Usage at LinkedIn People You May Know Viewers of this profile also viewed Related Searches Events you may be interested in LinkedIn Skills Jobs you may be interested in @r39132 22
  • 23. Voldemort : Usage Patterns @ LinkedIn RO Store Usage Pattern 1.  Use Hadoop to build a model 2.  Voldemort loads the output of Hadoop 3.  Voldemort serves fast key-value look-ups on the site –  e.g. For key=“Sid Anand”, get all the people that “Sid Anand” may know! –  e.g. For key=“Sid Anand”, get all the jobs that “Sid Anand” may be interested in! @r39132 23
  • 24. How Do The Voldemort RO Stores Perform? @r39132 24
  • 25. Voldemort : RO Store Performance : TP vs. Latency ● MySQL ● Voldemort ● MySQL ● Voldemort ● Voldemort ● MySQL median MySQL ● Voldemort 99th percentile ● median median ● 99th percentile ●percentile 99th median 99th percentile 3.5 ● ● ● 250 ● ● ● ● ● 3.5 3.5 ● 250 ● ● ● 3.5 3.0 ● 250 ● ● ● ● ● ● 250 ● ● ● 3.0 ● 200 ● ● ● ● ● ● 3.0 ● ● ● ● ● ● 3.0 ● latency (ms) 2.5 ● 200 ● ● ● ● 200 200 latency (ms) 2.5 latency (ms) 2.5 ● ● ● ● latency (ms) 2.5 2.0 ● ● ● ● ● ● ● 150 ● 2.0 2.0● ● ●● ● ● 150 150 2.0 ● 1.5 ● ● ● ● 150 ● ● ● ● ● ● ● 100 ●● ● ● 1.5 1.5 ● ● ● ●● ● ● 1.5 1.0 ● ● 100 100 ● ● ● ● ● ● 100 ●● ● 1.0 ● 1.0 ● 50 ● ● ● ● ● ● ● ● 1.0 0.5 ● ● ● ● 50 ● ● 50 ● ● ● ● 0.5 0.5 50 ● 0.5 0.0 0 0.0 0.0 0 0 0.0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 100 200 300 400 500 600 700 throughput 300 400 500 600 700 500 600 700 100 200 100 200 300 400 500 600 700 (qps) 200 300 400 100 100 200 300 400 500 600 700 100 200 300 400 500 600 700 throughput (qps) (qps) throughput (qps) throughput 100 GB data, 24 GB RAM @r39132 25
  • 26. LinkedIn Data Infrastructure Solutions Databus : Timeline-Consistent Change Data Capture @r39132 26
  • 27. Where Does LinkedIn use DataBus? @r39132 27
  • 28. DataBus : Use-Cases @ LinkedIn Updates Standard Search Graph Read ization Index Index Replicas Oracle Data Change Events A user updates his profile with skills and position history. He also accepts a connection •  The write is made to an Oracle Master and DataBus replicates: •  the profile change to the Standardization service   E.G. the many forms of IBM are canonicalized for search-friendliness and recommendation-friendliness •  the profile change to the Search Index service   Recruiters can find you immediately by new keywords •  the connection change to the Graph Index service   The user can now start receiving feed updates from his new connections immediately @r39132 28
  • 29. DataBus : Architecture DataBus consists of 2 services Capture Relay •  Relay Services DB Changes Event Win •  Sharded •  Maintain an in-memory buffer per On-line shard Changes •  Each shard polls Oracle and then deserializes transactions into Avro Bootstrap •  Bootstrap Service DB •  Picks up online changes as they appear in the Relay •  Supports 2 types of operations from clients   If a client falls behind and needs records older than what the relay has, Bootstrap can send consolidated deltas!   If a new client comes on line, Bootstrap can send a consistent snapshot @r39132 29
  • 30. DataBus : Architecture Client Capture Relay Consumer 1 Client Lib On-line Databus DB Changes Changes Event Win Consumer n On-line ated Changes solid Con eT Sinc Delta Client Bootstrap Consumer 1 Client Lib Consistent Databus Snapshot at U Consumer n DB Guarantees   Transactional semantics   In-commit-order Delivery   At-least-once delivery   Durability (by data source)   High-availability and reliability   Low latency @r39132 30
  • 31. DataBus : Architecture - Bootstrap   Generate consistent snapshots and consolidated deltas during continuous updates with long-running queries Client Read Consumer 1 Client Lib Databus Online Relay Changes Consumer n Event Win Read Changes Bootstrap server Bootstrap Read Log Writer Server recent events Replay Log Storage events Snapshot Storage Log Applier @r39132 31
  • 32. LinkedIn Data Infrastructure Solutions Kafka: High-Volume Low-Latency Messaging System @r39132 32
  • 33. Kafka : Usage at LinkedIn Where as DataBus is used for Database change capture and replication, Kafka is used for application-level data streams Examples: •  End-user Action Tracking (a.k.a. Web Tracking) of •  Emails opened •  Pages seen •  Links followed •  Executing Searches •  Operational Metrics •  Network & System metrics such as •  TCP metrics (connection resets, message resends, etc…) •  System metrics (iops, CPU, load average, etc…) @r39132 33
  • 34. Kafka : Overview Broker Tier WebTier Consumers Push Sequential write sendfile Pull Events Events Iterator 1 Client Lib Topic 1 Kafka 100 MB/sec 200 MB/sec Topic 2 Iterator n Topic N Topic  Offset Topic, Partition Offset Ownership Zookeeper Management Features Guarantees Scale   Pub/Sub   At least once delivery   Billions of Events   Batch Send/Receive   Very high throughput   TBs per day   System Decoupling   Low latency   Inter-colo: few seconds   Durability   Typical retention: weeks   Horizontally Scalable @r39132 34
  • 35. Kafka : Overview Key Design Choices •  When reading from a file and sending to network socket, we typically incur 4 buffer copies and 2 OS system calls   Kafka leverages a sendFile API to eliminate 2 of the buffer copies and 1 of the system calls •  No double-buffering of messages - we rely on the OS page cache and do not store a copy of the message in the JVM   Less pressure on memory and GC   If the Kafka process is restarted on a machine, recently accessed messages are still in the page cache, so we get the benefit of a warm start •  Kafka doesn't keep track of which messages have yet to be consumed -- i.e. no book keeping overhead   Instead, messages have time-based SLA expiration -- after 7 days, messages are deleted @r39132 35
  • 36. How Does Kafka Perform? @r39132 36
  • 37. Kafka : Performance : Throughput vs. Latency (100 topics, 1 producer, 1 broker) 250 200 Consumer latency in ms 150 100 50 0 0 20 40 60 80 100 Producer throughput in MB/sec @r39132 37
  • 38. Kafka : Performance : Linear Incremental Scalability (10 topics, broker flush interval 100K) 400 381 350 300 Throughput in MB/s 293 250 200 190 150 100 101 50 0 1 broker 2 brokers 3 brokers 4 brokers @r39132 38
  • 39. Kafka : Performance : Resilience as Messages Pile Up (1  topic,  broker  flush  interval  10K)   200000 160000 Throughput  in  msg/s   120000 80000 40000 0 10 105 199 294 388 473 567 662 756 851 945 1039 Unconsumed data in GB @r39132 39
  • 40. Acknowledgments Presentation & Content   Chavdar Botev (DataBus) @cbotev   Roshan Sumbaly (Voldemort) @rsumbaly   Neha Narkhede (Kafka) @nehanarkhede Development Team Aditya Auradkar, Chavdar Botev, Shirshanka Das, Dave DeMaagd, Alex Feinberg, Phanindra Ganti, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna, Brendan Harris, Joel Koshy, Kevin Krawez, Jay Kreps, Shi Lu, Sunil Nagaraj, Neha Narkhede, Sasha Pachev, Igor Perisic, Lin Qiao, Tom Quiggle, Jun Rao, Bob Schulman, Abraham Sebastian, Oliver Seeliger, Adam Silberstein, Boris Skolnick, Chinmay Soman, Roshan Sumbaly, Kapil Surlaker, Sajid Topiwala, Balaji Varadarajan, Jemiah Westerman, Zach White, David Zhang, and Jason Zhang @r39132 40
  • 41. y Questions? @r39132 41