SlideShare a Scribd company logo
Databus




1/29/2013   Recruiting Solutions
            `                        Databus   1
INTRODUCTION


  `            2
LinkedIn by Numbers
 World’s largest professional network
 187M+ members world-wide as of Q3 2012
   Growing at the rate of two per second
 85 of Fortune 100 companies use Talent Solutions
  to hire
 > 2.6M company pages
 > 4B search queries
 75K+ developers leveraging out APIs
 1.3M unique publishers


     `                    Databus                3
The Consequence of Specialization in
           Data Systems
Data Flow is essential
Data Consistency is critical !!!




        `
Solution: Databus



              Standardi
               Standardi    Standardi
                             Standardi   Standardi
                                          Standardi   Standardi
                                                       Standardi
                Standardi       Search       Graph         Read
  Updates




                zation
                 zation       zation
                               zation      zation
                                            zation      zation
                                                         zation
                  zation         Index       Index       Replicas




Primary
   DB                       Data Change Events

                               Databus

  `                                                                 5
Two Ways


    Application code dual    Extract changes from
    writes to database and   database commit log
    pub-sub system




    Easy on the surface      Tough but possible

    Consistent?              Consistent!!!



`
Key Design Decisions : Semantics
• Logical clocks attached to the source
  – Physical offsets could be used for internal
    transport
  – Simplifies data portability
• Pull model
  – Restarts are simple
  – Derived State = f (Source state, Clock)
  – + Idempotence = Timeline Consistent!

     `                                            7
Key Design Decisions : Systems
• Isolate fast consumers from slow consumers
  – Workload separation between online, catch-up,
    bootstrap
• Isolate sources from consumers
  – Schema changes
  – Physical layout changes
  – Speed mismatch
• Schema-aware
  – Filtering, Projections
  – Typically network-bound  can burn more CPU

     `                                              8
Requirements
•   Timeline consistency
•   Guaranteed, at least once delivery
•   Low latency
•   Schema evolution
•   Source independence
•   Scalable consumers
•   Handle for slow/new consumers without
    affecting happy ones (look-back requirements)

      `                                         9
ARCHITECTURE


  `            10
0
                          Initial Design (2007)                                   Happy
                                                                                 Consumer
         Source clock
            timer




                                                                                       …
             SCN

                                 Direct Pull              Relay                    Happy
                            DB                           In Memory                Consumer
 70000                                                     Buffer
                                  Proxied
             3 hrs
                                    Pull
100000       Relay
102400                                                                             Slow
                     DB
                                                                                 Consumer




Pros:
                                                  Cons:
1. Consumer Scaling
                                                  Slow consumers overwhelming the DB
2. Some isolation


                `                              Databus                                 11
Software Architecture
                 Four Logical Components

                  • Fetcher
                      – Fetch from db, relay…
                  • Log Store
                      – Store log snippet
                  • Snapshot Store
                      – Store moving data
                        snapshot
                  • Subscription Client
                      – Orchestrate pull
                        across these


`
0
        Source clock
           timer
            SCN
                                  The Databus System                                    Happy
                       Snapshot                                                        Consumer




                                                                                         …
                                  infinite
30000                  Log
                                                           Relay                        Happy
                       10 days                        In Memory                        Consumer
 70000   Relay
                                                        Buffer
 80000
 90000     3 hrs
100000
102400                                                                                   Slow
                       DB
                                                                                       Consumer


                                                                   Server


                                             Log Storage              Snapshot Store


                                                                   Bootstrap Service

                   `                                                                       13
The Relay
•   Change event buffering (~ 2 – 7 days)
•   Low latency (10-15 ms)
•   Filtering, Projection
•   Hundreds of consumers per relay
•   Scale-out, High-availability through
    redundancy



       `
Deployment Options




Option 1: Peered Deployment   Option 2: Clustered Deployment

     `
The Bootstrap Service
•   Catch-all for slow / new consumers
•   Isolate source OLTP instance from large scans
•   Log Store + Snapshot Store
•   Optimizations
    – Periodic merge
    – Predicate push-down
    – Catch-up versus full bootstrap
• Guaranteed progress for consumers via chunking
• Implementations
    – Database (MySQL)
    – Raw Files
• Bridges the continuum between stream and batch systems

       `
The Consumer Client Library
• Glue between Databus infra and business logic
  in the consumer
• Isolates the consumer from changes in the
  databus layer.
• Switches between relay and bootstrap as
  needed
• API
  – Callback with transactions
  – Iterators over windows

    `
Fetcher Implementations
• Oracle
  – Trigger-based
• MySQL
  – Custom-storage-engine based
• In Labs
  – Alternative implementations for Oracle
  – OpenReplicator integration for MySQL


     `
Meta-data Management
• Event definition, serialization and transport
  – Avro
• Oracle, MySQL
  – Avro definition generated from the table schema
• Schema evolution
  – Only backwards-compatible changes allowed
• Isolation between upgrades on producer and
  consumer

     `
Scaling the consumers
                (Partitioning)
• Server-side filtering
  – Range, mod, hash
  – Allows client to control partitioning function
• Consumer groups
  – Distribute partitions evenly across a group
  – Move partitions to available consumers on failure
  – Minimize re-processing



     `
A NEW CONSUMER


 `               21
Development with Databus – Client
                   Library
    Databus Client

                     Consumers
                      Consumers

                          implement



          Stream Event      Bootstrap Event
             Callback          Callback                        Client API
               API                API

                                   Databus Client Library

onDataEvent(DbusEvent, Decoder)                  register(consumers, sources , filter)
…                                                start() ,
…
                                                 shutdown(),

           `                           Databus                                      22
Databus Consumer Implementation
class MyConsumer
      extends AbstractDatabusStreamConsumer
{
   ConsumerCallbackResult onDataEvent(DbusEvent e,
                                       DbusEventDecoder d){
    //use map-like Avro GenericRecord
    GenericRecord g = d.getGenericRecord(e, null);
    //or use the auto-generated Java class
    MyEvent e = d.getTypedValue(e, null,
                                            MyEvent.class);
    …
    return ConsumerCallbackResult.SUCCESS;
  }
}

     `                     Databus                       23
Starting the client
public void main(String[]) {
  //configure
  DatabusHttpClientImpl.Config clientConfig =
                          new DatabusHttpClientImpl.Config();
  clientConfig.loadFromFile(“mydbus”, “mdbus.props”);
  DatabusHttpClientImpl client =
               new DatabusHttpClientImpl(clientConfig);
  //register callback
  MyConsumer callback = new MyConsumer();
  client.registerDatabusStreamListener(callback,
          null, "com.linkedin.events.member2.MemberProfile”);
  //start client library
  client.startAndBlock();
}

        `                    Databus                        24
Event Callback APIs
•




    `           Databus       25
PERFORMANCE


 `            26
Relay Throughput




`         Databus      27
Consumer Throughput




`           Databus       28
End-End Latency




`         Databus     29
Snapshot vs Catchup




`           Databus       30
Recruiting Solutions   31

More Related Content

What's hot

Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
Mat Keep
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databasesDipti Borkar
 
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleHow LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
LinkedIn
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache Spark
Matt Ingenthron
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Kanak Biscuitwala
 
Aceleracion de aplicacione 2
Aceleracion de aplicacione 2Aceleracion de aplicacione 2
Aceleracion de aplicacione 2jfth
 
In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified!
Uri Cohen
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
DataWorks Summit/Hadoop Summit
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
EDB
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
Bigstep
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Spark Summit
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache Geode
PivotalOpenSourceHub
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
Venkatesh Narayanan
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
Matt Ingenthron
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
Databricks
 

What's hot (20)

Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Characteristics of no sql databases
Characteristics of no sql databasesCharacteristics of no sql databases
Characteristics of no sql databases
 
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scaleHow LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
How LinkedIn uses memcached, a spoonful of SOA, and a sprinkle of SQL to scale
 
Couchbase and Apache Spark
Couchbase and Apache SparkCouchbase and Apache Spark
Couchbase and Apache Spark
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
 
Aceleracion de aplicacione 2
Aceleracion de aplicacione 2Aceleracion de aplicacione 2
Aceleracion de aplicacione 2
 
In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified! In Memory Data Grids, Demystified!
In Memory Data Grids, Demystified!
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache Geode
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 

Similar to Introduction to Databus

Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
SL Corporation
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
Mark Ginnebaugh
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
SL Corporation
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflowrjmurphyslideshare
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang
 
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 ServersHow to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
New Relic
 
Times Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasTimes Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo Ludas
ORACLE USER GROUP ESTONIA
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inliqiang xu
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
bloomreacheng
 
Lync 2010 High Availability
Lync 2010 High AvailabilityLync 2010 High Availability
Lync 2010 High Availability
Harold Wong
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopOvidiu Dimulescu
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
ScyllaDB
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
ScaleOut Software
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
Sunil Nagaraj
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
Amazon Web Services
 
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Michael Noel
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Zeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCZeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCArvind Krishnaa
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
ScaleOut Software
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scale
xcbsmith
 

Similar to Introduction to Databus (20)

Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
 
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data WarehouseSQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Lug best practice_hpc_workflow
Lug best practice_hpc_workflowLug best practice_hpc_workflow
Lug best practice_hpc_workflow
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 ServersHow to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers
 
Times Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo LudasTimes Ten in-memory database when time counts - Laszlo Ludas
Times Ten in-memory database when time counts - Laszlo Ludas
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_in
 
Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure Bloomreach - BloomStore Compute Cloud Infrastructure
Bloomreach - BloomStore Compute Cloud Infrastructure
 
Lync 2010 High Availability
Lync 2010 High AvailabilityLync 2010 High Availability
Lync 2010 High Availability
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
Ultimate SharePoint Infrastructure Best Practices Session - Live360 Orlando 2012
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Zeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMCZeroth review presentation - eBay Turmeric / SMC
Zeroth review presentation - eBay Turmeric / SMC
 
Top 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data GridTop 6 Reasons to Use a Distributed Data Grid
Top 6 Reasons to Use a Distributed Data Grid
 
The 5 Stages of Scale
The 5 Stages of ScaleThe 5 Stages of Scale
The 5 Stages of Scale
 

More from Amy W. Tang

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
Amy W. Tang
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Amy W. Tang
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
Amy W. Tang
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
Amy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
Amy W. Tang
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
Amy W. Tang
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
Amy W. Tang
 

More from Amy W. Tang (11)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Introduction to Databus

  • 1. Databus 1/29/2013 Recruiting Solutions ` Databus 1
  • 3. LinkedIn by Numbers  World’s largest professional network  187M+ members world-wide as of Q3 2012  Growing at the rate of two per second  85 of Fortune 100 companies use Talent Solutions to hire  > 2.6M company pages  > 4B search queries  75K+ developers leveraging out APIs  1.3M unique publishers ` Databus 3
  • 4. The Consequence of Specialization in Data Systems Data Flow is essential Data Consistency is critical !!! `
  • 5. Solution: Databus Standardi Standardi Standardi Standardi Standardi Standardi Standardi Standardi Standardi Search Graph Read Updates zation zation zation zation zation zation zation zation zation Index Index Replicas Primary DB Data Change Events Databus ` 5
  • 6. Two Ways Application code dual Extract changes from writes to database and database commit log pub-sub system Easy on the surface Tough but possible Consistent? Consistent!!! `
  • 7. Key Design Decisions : Semantics • Logical clocks attached to the source – Physical offsets could be used for internal transport – Simplifies data portability • Pull model – Restarts are simple – Derived State = f (Source state, Clock) – + Idempotence = Timeline Consistent! ` 7
  • 8. Key Design Decisions : Systems • Isolate fast consumers from slow consumers – Workload separation between online, catch-up, bootstrap • Isolate sources from consumers – Schema changes – Physical layout changes – Speed mismatch • Schema-aware – Filtering, Projections – Typically network-bound  can burn more CPU ` 8
  • 9. Requirements • Timeline consistency • Guaranteed, at least once delivery • Low latency • Schema evolution • Source independence • Scalable consumers • Handle for slow/new consumers without affecting happy ones (look-back requirements) ` 9
  • 11. 0 Initial Design (2007) Happy Consumer Source clock timer … SCN Direct Pull Relay Happy DB In Memory Consumer 70000 Buffer Proxied 3 hrs Pull 100000 Relay 102400 Slow DB Consumer Pros: Cons: 1. Consumer Scaling Slow consumers overwhelming the DB 2. Some isolation ` Databus 11
  • 12. Software Architecture Four Logical Components • Fetcher – Fetch from db, relay… • Log Store – Store log snippet • Snapshot Store – Store moving data snapshot • Subscription Client – Orchestrate pull across these `
  • 13. 0 Source clock timer SCN The Databus System Happy Snapshot Consumer … infinite 30000 Log Relay Happy 10 days In Memory Consumer 70000 Relay Buffer 80000 90000 3 hrs 100000 102400 Slow DB Consumer Server Log Storage Snapshot Store Bootstrap Service ` 13
  • 14. The Relay • Change event buffering (~ 2 – 7 days) • Low latency (10-15 ms) • Filtering, Projection • Hundreds of consumers per relay • Scale-out, High-availability through redundancy `
  • 15. Deployment Options Option 1: Peered Deployment Option 2: Clustered Deployment `
  • 16. The Bootstrap Service • Catch-all for slow / new consumers • Isolate source OLTP instance from large scans • Log Store + Snapshot Store • Optimizations – Periodic merge – Predicate push-down – Catch-up versus full bootstrap • Guaranteed progress for consumers via chunking • Implementations – Database (MySQL) – Raw Files • Bridges the continuum between stream and batch systems `
  • 17. The Consumer Client Library • Glue between Databus infra and business logic in the consumer • Isolates the consumer from changes in the databus layer. • Switches between relay and bootstrap as needed • API – Callback with transactions – Iterators over windows `
  • 18. Fetcher Implementations • Oracle – Trigger-based • MySQL – Custom-storage-engine based • In Labs – Alternative implementations for Oracle – OpenReplicator integration for MySQL `
  • 19. Meta-data Management • Event definition, serialization and transport – Avro • Oracle, MySQL – Avro definition generated from the table schema • Schema evolution – Only backwards-compatible changes allowed • Isolation between upgrades on producer and consumer `
  • 20. Scaling the consumers (Partitioning) • Server-side filtering – Range, mod, hash – Allows client to control partitioning function • Consumer groups – Distribute partitions evenly across a group – Move partitions to available consumers on failure – Minimize re-processing `
  • 22. Development with Databus – Client Library Databus Client Consumers Consumers implement Stream Event Bootstrap Event Callback Callback Client API API API Databus Client Library onDataEvent(DbusEvent, Decoder) register(consumers, sources , filter) … start() , … shutdown(), ` Databus 22
  • 23. Databus Consumer Implementation class MyConsumer extends AbstractDatabusStreamConsumer { ConsumerCallbackResult onDataEvent(DbusEvent e, DbusEventDecoder d){ //use map-like Avro GenericRecord GenericRecord g = d.getGenericRecord(e, null); //or use the auto-generated Java class MyEvent e = d.getTypedValue(e, null, MyEvent.class); … return ConsumerCallbackResult.SUCCESS; } } ` Databus 23
  • 24. Starting the client public void main(String[]) { //configure DatabusHttpClientImpl.Config clientConfig = new DatabusHttpClientImpl.Config(); clientConfig.loadFromFile(“mydbus”, “mdbus.props”); DatabusHttpClientImpl client = new DatabusHttpClientImpl(clientConfig); //register callback MyConsumer callback = new MyConsumer(); client.registerDatabusStreamListener(callback, null, "com.linkedin.events.member2.MemberProfile”); //start client library client.startAndBlock(); } ` Databus 24
  • 27. Relay Throughput ` Databus 27
  • 28. Consumer Throughput ` Databus 28
  • 29. End-End Latency ` Databus 29
  • 30. Snapshot vs Catchup ` Databus 30