Large scale storage systems, by nature, are distributed. Data is stored in multiple machines. Also, it is often the case that it is the same data (I mean, same content) that is stored in different places to support different access patterns, efficient retrieval, or for quick look up of derived data (as opposed to computing them during a look-up). In order to have such distributed systems work together to provide a service, two things are needed: CLICKData flow between these systems: When data is updated in one side on one system, it should be reflected in the other parts that store the same content. CLICKData consistency: The different parts of the system must converge at some point in time.So, we need a change capture system that supports such a distributed system.
Basically there are two ways changes can be captured. Applications can dual-write data into the database and the change streamWe capture the data change list from the commit logs that most databases have.CLICKDual writes appear really easy on the surface. But then when we start considering the transient failure scenarios, achieving consistency gets harder, sometimes impossible. You may need to get into 2P commits, etc. compromising on performance, or availability.CLICKOn the other hand, extracting changes from the database is almost like post-processing. Minimal or no performance penalty, application is unaware of the existence of a change capture system, and consistency is not an issue as long as you see the commit logs in entirety. No such thing as a free lunch, of course. Commit logs formats are proprietary, and extracting from them can be tough.We chose to take this approach of change extract.
Here are some of the use cases for Databus. In this picture, we have one source of truth for data – the primary database. Changes to the data are observed (or, consumed) by consumers, which may the turn around and update derived data serving systems. Or, data may be extracted into Hadoop (for example), to be re-loaded into some derived systems. At Linked in, we our primary database is in Oracle . For example, we have a database that holds member information. When rows in these databases are altered, the change events need to be propagated to a search index. The search index is used to serve queries from recruiters looking for appropriate candidates. Similarly there are other consumers of different databases, each having their own business logic to build derived data out of the primary database.Databus is used to capture these changes and provide them as change events to consumers.So, what would be the requirements of such a system?
… Let’s look at a brief biography of Databus
… Now for a close look at Databus
Use of ‘change’ in ‘change data’ more of a noun – rather than a verb – ‘data that has changed’ rather than alrer the data. – Industry std term.Change capture logic (CDC ) to extract changes – in a consistent way (preserving the consistency, ordering -> e.g. ways to extract order of commits, I D semantics)Publisher and Subscriber APIs that lets the CDC transform he extracted changes and publish those events with atomicity guarantees of the source... Applications preserved consistency when they applied the changes they received in a timely manner… and there were realities
Different type of applicationsSchemas evolving at sourceButSource cannot be burdened (a typical problem with V1 )Applications cannot be forced to move to latest version ( resulted in proliferation of different versions of change streams of same source)
External clock attached to source. Ordering defined by the source- e.g. commit ordering in Oracle -> increasing SCN – in mysqlbinlog increasing txns could be the scnsNo additional source of truth, no additional point of failureAbility to recreate event stream given SCN and sourceFor applications ; the ordering of events is same as that seen by the source – so eventually the source and apps will converge , SCN is used to track progress on the app, Apps can reason about consistency with source- as external clock SCN is used , SCN is logical – not tied to any particular change stream node, Apps need to be idempotent – as they can see a change more than once .Apps can reason about consistency – derived stores can reason amongst each other as they have SCN visibility – a concept that is useful to compare consistency across applicationsTimeline consistency: at least once guarantee – order of change events same as source db; no updates missed. SCN; all apps listening to the change stream see the same order of change events
Pul model - > as opposed to push – where producers keep track of their consumer progress- and call clients as long as they are available, pull model assumes the state required to servr a request lies with the consumer. Restartability is easier as state can be computed from source,SCN on any machine. This Is true at both change event stream and the consumer.Separation of concerns – between use cases of ‘online consumption’ – recent changes and ‘catchup/ bootstrap ‘ case – where older changes are required – different scalability properties IsolIsolate Sources and consumers – source can move , schemas can change, and of couse, producer speed and consumption speeds can vastly differWe are not just transport – we support meta data – such as schemas. We ensure that consumers have a good experience – while the change stream also becomes more managable – ultimately helping provisioning and consumer robustness.Also gives an option of adding more filtering options at the change stream.
Point: Change Capture is within the relay: each relay is self-sufficient, i.e. since eventBufferState = fn(source,SCN) – it has change capture logic to pull in the changes – if change capture were outside, then the change capture logic ,fan out has to take care of replication or write to a leader-follower relay cluster. EventBuffer wraps around if it runs out of memory.Point: Client Library: fetches changes from the databus stream (which is now Relay and Bootstrap Service) - . Point: Workload separation : cases – recent change, to older changes, snapshots – cannot rely solely on all changes fitting into memoryPoint: Bootstrap Consumer: special application that listens to changes and updates it’s log store (persistent change events) and snapshot store – (persistent copy of the database – stores change events in user-space)Point: Remember , client library automatically switches between appropriate service – relay or bootstrap depending on SCN requested by application - Point: Meta data: used by relay stream (schema awareness) Point: DB’s are saved from abusive scans from lagging consumers , (isolation) Counterpoints:Push-model -> addl state of consumers; or speed of consumption / speed of production have to match ; harder to maintain lossless guarantees;
Are all of these open-sourced? Oracle is.
Custom mysqlrepl /mysql slave instance– the Specifically custom storage engine of slave writes to tcp channel instead of disk – Slave state has an SCN (offset , log number) – that can be controlled – upto 3 days worth of rewindability (configurable)
Control Flow is depicted:Note the pull-model - .> at the CDC end, it’s easy to make data portable – the SCN ,Source is sufficient to re-create state in EventBuffer; easier restarts.; no state requires to be maintained on upstream system about it’s subscribers (as in case of push model).- Publish does not require persistence/durability guarantees – obtained from source of truth and the fact that change stream is a f(Source,SCN) .At end points –the CDC catpures changes from database and publishes to an event bufferAt the other end – applications subscribe to the change stream and receive callbacks when change data from the sources they have subscribed to become availablePoint: End points have API’s supporting transaction semantics (atomicity)Point: ‘Window’ or consistency window are points in the stream that are consistent with source at the specified SCN. Point: Consumers ‘see’ events one consistency window at a time –i.e. they are visible to the consumers after ‘end of window’ has been written.Questions: What if CDC was outside-> CDC can be a pull-model – but can they push-off box to event stream – Yes, but then event stream isn’t simple – cluster state (leader-follower) is shared amongst CDC and Relay. Also failure of CDC leads needs to be treated and monitored separately. Packaging CDC with event stream has operability advantages. Questions: is onRollback() triggered at same time rollback was shown in buffer? – No – this isn’t about one-to-one correspondence in time – but one in semantics - both have notion of ‘transactions’ – apps don’t see events not committed – end users- output of apps have the option of seeing the whole txn in entirety as well – very important for example in relay-chaining.
- Both Relay (Online changes) and Bootstrap (Older changes) together constitute the change event streamDo not share the same exact API – but semantically say the same thingGet events since a point in logical clock Both have ability to perform simple filters on the service side- Both have chunking/progress guarantees-HTTP based implementation-Efficient communication to clients
Database schemas to some neutral format for the databus events , we chose AvroTools are available to publish schema to ‘schema registry’ , schema generation from different source types.Schemas generated and stored in a place accessible by Change Stream – backward compatibility Tools are available to publish schema to ‘schema registry’ , schema generation from different source types.Ensures backward compatibility – relevant for bootstrap.Schemas available to consumers to deserialize.
Relay enacapsulates change capture logic – event buffer (remember the publish API) – implemented as a circular buffer - and the meta dataConstitutes the online – most frequently used part of the change stream – addreses 98% of requests on a typical day.
Relays talk to database directly – since they contain the change capture. This has horizontal scalability limits.
-Bootstrap Consumer : is a special application that consumes events from relays and writes to a persistence layer called ‘log store’ . -Another process applies changes to snapshot store – how – using the pk that was there in the publish API. Separate thread not shown here.-Seeding: bootstrapping the bootstrap
Databus Client Library: -Orchestrate consumption of change stream from bootstrap/relay-Uses a http fetch to get events from upstream / write to eventbuffer using efficient readEvents call-Currently uses polling mechanism to get events from upstream-Dispatcher uses iterator interface of EventBuffer to read the events and then call the user specified consumer implementations.-Client library by default persists SCN for lossless recovery.-Consumers need to be thread-safe, can take advantage of parallelism .… let’s look at a typical application
-Key: a single instance of client library can handle – multiple consumers subscribing to multiple change streams Different logic tuning required for bootstrap and online case - facilities providedSchema aware apps can force type conversion from one schema to another – as long as backward compatibility is preserved amongst change data. -Override of persisted SCN possible : cases where flush() is not guaranteed by the application (e.g. index - ) – so , apps store the SCN in the index; retrieve it on startup-Applications typically are distributed – so they have notion of some sort of partitions/partition awareness. It can be tempting to consume the entire event stream of an unpartitioned upstream store and then drop n-1/n th partition on the floor.It’s inefficient and expensive (for relay and consumer – latency wise as we shall see) – Instead…
- Here- client nodes refer to one instance of the client library – so that can be an application instance-Applications themselves are partition aware – write to partitioned indexes/storesNeed to distribute processing loadPartition function -> applied at source on the primary key - this is applied on the fly – and the source itself neednt’ be partitioned.- Partitions can be changed as more nodes are applied if the application accounts for ‘repartitioning’ . Checkpoints need to be reset, configuration needs to be changed.But this is hardly operation friendly..
The clients are partition aware- but the partition assignment is dynamicCluster awareness is introduced , client app clusters Operability advantages – ability to add/remove nodes with dynamic redistribution -Helix used to manage client clusters , and as SCN store ..….Now, Let’s look at some aspects of the current implementation.
… And on to some code – let’s take a look at the application
-Points to note: - how a source is specified. -How sources are specified-A consumer: - and a databus client uses subscription (register);
Key – show how payload is extracted. The API’s we have visited them earlier. … Now to dwell on performance
Setup:- Measure relay serving throughput and CPU utilization- Vary number of consumers and poll interval tpt_10 means throughput with a poll interval of 10ms, cpu_55 is cpu utilization with 55ms poll interval, etc.)- Consumers pulling at max speed (no additional processing)- Event size is 2.5 KBNo write traffic -- relay buffer pre-filledThe hypothesis was that we can support more consumers if the poll interval is long, and that is confirmed by the Observations:- Relay can easily saturate the network with minimal CPU utilization- Once network is saturated, CPU increases with number of consumers due to networking overhead (context switching)- Even with 200 consumers, CPU utilization is less than 10%- Higher poll intervals generally lead to less CPU utilization
Setup:- Measure read throughput of each consumer with update traffic on the relay- Vary the number of consumers and update rate- Consumers pulling at max speed (no additional processing)- Poll interval is 10 msEvent size is 2.5 KBObservations:- Drops mean consumer no longer being able to keep up- Reason is network saturation on the relay side; e.g. 2000 update/s * 20 consumers * 2.5KB = 100MBps < max network bandwidth < 200MBps = 2000 update/s * 40 consumers * 2.5KB
Setup:- Same as above but measure time in milliseconds for events to reach consumer- Added partitioning through server-side filtering to see what happens if network is not a bottleneckObservations:- Latency knees due to relay network saturation as before- Latency without SSF (Server side filtering) is around 10-20 ms (including an average 5ms overhead due to the poll interval)With SSF network is no longer a bottleneck; latency up to 15-45ms due to SSF computation overheadSo, the relay can scale to hundreds of consumers if they can tolerate a little bit of latency.
E2E latency has no meaning for bootstrap service, and they can easily saturate the network with multiple clients. So, we focused on comparing the serves out log store vs snapshot store.Setup:* Compare serving deltas vs serving all updates* Synthetic work load* Vary number of updates to existing keys vs new keys (i.e. inserts)Observations:* Catchup time is constant as it does not distinguish updates vs inserts* Break even point is around 1:1 updates vs insertsFor a small number of inserts, the benefit of snapshot is overwhelmingThe breakeven point seems to be when ½ of the changes are updates. We monitor the update rate in production and tune the bootstrap service.
Databusstream for Oracle : things that scale with memberId and things that scale with connections (multiplicative , only inserts) small sources – advertiser data (but consistency important)Applications: search – multiple instances – large distributed deployment – low latency requirement – consistency Bootstrap used in new ways – used to automatically provision new index nodes; new in memory in advertising data sets ; usef to fix legacy stores Espresso: source of truth 2013 and beyondPartitioned primary data store (transactional) based on Mysql store engine ; horizontally scalable ; change stream partitioned at source-of-truth rather than change-stream -> Change stream still requires trigger based ‘databusification’ in Oracle. Relay provisioning is still manual – in the sense – there is no self-serve mechanism of specify a source /automatic source discovery- that will triggerRelays being provisioned in the ‘cloud’ – depending on capacity estimates.Let’s look at some change capture implementations we have.. .
Overall: is external clock propagation a good idea overall? Is it necessary or a nice to have? - It becomes important in case of bootstrap ?Are checkpoints portable? If a mapping exists between SCN->CDC-GEN-UNIQ-NUM , or if an index exists at every layer-bootstrap and relay for SCN – then it can be handled as system levelImpl – and the client needn’t use SCN explicitly. SCN – external clock is a convenient way of storing logical state across instances of the change stream.
Point: Change Capture is within the relay: each relay is self-sufficient, i.e. since eventBufferState = fn(source,SCN) – it has change capture logic to pull in the changes – if change capture were outside, then the change capture logic ,fan out has to take care of replication or write to a leader-follower relay clusterPoint: Client Library: fetches changes from the databus stream (which is now Relay and Bootstrap Service) - . Point: Separation of use cases – recent change, to older changes, snapshots – cannot rely solely on all changes fitting into memoryPoint: Bootstrap Consumer: special application that listens to changes and updates it’s log store (persistent change events) and snapshot store – (persistent copy of the database – stores change events in user-space)Point: Remember , client library automatically switches between appropriate service – relay or bootstrap depending on SCN requested by applicationPoint: DB’s are saved from abusive scans from lagging consumers
Recruiting SolutionsRecruiting SolutionsRecruiting SolutionsDatabusLinkedIn’s Change Data Capture PipelineDatabus Team @ LinkedInSunil Nagarajhttp://www.linkedin.com/in/sunilnagarajEventbriteMay 07 2013
Talking Points Motivation and Use-Cases Design Decisions Architecture Sample Code Performance Databus at LinkedIn Review
The Consequence of Specialization in Data SystemsData Consistency is critical !!!Data Flow is essential
Extract changes fromdatabase commit logTough but possibleConsistent!!!Application code dualwrites to database andpub-sub systemEasy on the surfaceConsistent?Two Ways
A brief history of Databus 2006-2010 : Databus became an established and vitalpiece of infrastructure for consistent data flow fromOracle 2011 : Databus (V2) addressed scalability and operabilityissues 2012 : Databus supported change capture from Espresso 2013 : Open Source Databus– https://github.com/linkedin/databus
Databus Eco-System : RealitiesDatabasesSource DatabusFastConsumerApplicationsChangeDataCaptureChange EventStreamSlowConsumerNewConsumerEverychangeChangessince lastweekChangessince last 5secondsSchemas evolve• Source cannot be burdened by ‘long look back’extracts• Applications cannot be forced to move tolatest version of schema at oncechangedataevents
Key Design Decisions : Semantics Change Data Capture uses logical clocks attached to thesource (SCN)– Change data stream is ordered by SCN– Simplifies data portability , change stream is f(SourceState,SCN) Applications are idempotent– At least once delivery– Track progress reliably (SCN)– Timeline consistency10
Key Design Decisions : Systems Isolate fast consumers from slow consumers– Workload separation between online(recent), catch-up (old),bootstrap (all) Isolate sources from consumers– Schema changes– Physical layout changes– Speed mismatch Schema-awareness– Compatibility checks– Filtering at change stream11
The Components of Databus12DBChangeCaptureEvent Buffer(In Memory)change dataConsumerRelayDatabusClientApplicationonline changesBootstrapNewApplicationConsistentsnapshotLog StoreSnapshotStoreonline changesBootstrapConsumerolder changesSlowApplicationMetadata
Change Data Capture Contains logic to extractchanges from source fromspecified SCN Implementations– Oracle Trigger-based Commit ordering Special instrumentation required– MySQL Custom-storage-engine basedEventProducerstart(SCN ) //capture changes fromspecified SCNSCN getSCN() //return latest SCNChange Data CaptureSCNDatabase Schemas
MySQL : Change Data CaptureDatabus 14MySQLMasterMySQLSlaveMySqlreplicationTCPChannel• MySQL Replication takes care of• bin-log parsing• Protocol between master and slave• Handling restarts• Relay• Provides a TCP Protocol interface to push events• Controls and Manages MySql SlaveRelay
Publish – Subscribe APIDBChangeDataCaptureEvent Buffer(In Memory)publishextract(src,SCN)Consumersubscribe(src,SCN)EventBufferstartEvents() //e.g. new txnDbusEvent(enc(schema,changeData),src,pk)appendEvent(DbusEvent, ...)endEvents(SCN) //e.g. end of txn; commitrollbackEvents() //abort this windowConsumerregister(source, ‘Callback’)onStartConsumption() //onceonStartDataEventSequence(SCN)onStartSource(src,Schema)onDataEvent(DbusEvent e,…)onEndSource(src,Schema)onEndDataEventSequence(SCN)onRollback(SCN)onStopConsumption() //once
The Databus Change Event StreamEvent Buffer(In Memory)RelayBootstrapLog StoreSnapshotStoreonline changes• Provide APIs to obtain change events• Query API specifies logical clock(SCN) andsource• ‘Get change events greater than SCN’• Filtering at source possible• MOD, RANGE filter functionsapplied to primary key of the event• Batching/Chunking to guaranteeprogress• Does not contain state of consumers• Contains references to metadata andschemas• Implementation• HTTP server• Persistent connection to clients• REST APIChange Event Stream
Meta-data Management Event definition, serialization and transport– Avro Oracle, MySQL– Table schema generates Avro definition Schema evolution– Only backwards-compatible changes allowed Isolation of applications from changes in source schema Many versions of a source used by applications , but oneversion(latest) of the change stream exists
The Databus RelayChangeCaptureEvent Buffer(In Memory)RelayDatabase SchemasSrcMeta-data• Encapsulates change capture logic andchange event stream• Source aware, schema aware• Multi-tenant: Multiple Event Buffersrepresenting change events of differentdatabases• Optimizations• Index on SCN exists to quicklylocate physical offset in EventBuffer• Locally stores SCN per source forefficient restarts• Large Event Buffers possible (> 2G)SCNstoreAPI
Scaling Databus RelayDBRelay Relay Relay• Peer relays, independent• Increased load on the sourceDB with each additional relayinstanceDBRelayLeaderRelay(Follower)• Relays in leader-follower cluster• Only the leader reads from DB ,followers from leader• Leadership assigned dynamically• Small period of streamunavailability during leadershiptransferRelay(Follower)
The Bootstrap Service Bridges the continuum between stream andbatch systems Catch-all for slow / new consumers Isolate source instance from large scans Snapshot store has to be seeded once Optimizations– Periodic merge– Filtering pushed down to store– Catch-up versus full bootstrap Guaranteed progress for consumers viachunking Multi-tenant - can contain data from manydifferent databases Implementations– Database (MySQL)– Raw FilesRelayBootstrapLog StoreSnapshotStoreonline changesBootstrapConsumerseedingDatabase
The Databus Client Library Glue between Databus ChangeStream and business logic in theConsumer Switches between relay and bootstrapas needed Optimizations– Change events uses batch writeAPI without deserialization Periodically persists SCN for losslessrecovery Built-in support for parallelism– Consumers need to be thread-safe– Useful for scaling large batch processing(bootstrap)EventBufferDatabus ChangeStreamChangeStream ClientSCNstoreAPIDispatcherStreamConsumerBootstrapConsumeriteratewritecallbackreadDatabus Client Library
Databus ApplicationsConsumerS1DatabusClientApplicationConsumerS2ConsumerSnS1S2SnChangeStreams• Applications can process multipleindependent change streams• Failure of one won’t affectothers• Different logic and configurationsettings for bootstrap and onlineconsumption possible• Processing can be tied to aparticular version of schema• Able to override client librarypersisted SCN
ClientApplication(i=1..k)ClientApplication(k+1..N)Change Streami= pk MOD N(i=0..k-1)(i=k..N-1)• Databus Clients consume partitioned streams• Partitioning strategy: Range or Hash• Partitioning function applied at source• Number of partitions (N) , and list of partitions (i) specifiedstatically in configuration• Not easy to add/remove nodes• Needs configuration change on all nodesClient nodes uniform:can process anypartition(s)Clients distributeprocessing loadScaling Applications - I
ClientApplicationN/m partitionsApplicationN/mpartitionsDatabus Streami= pk mod NDynamicallyallocatedpartitionsN partitions distributedevenly amongst ‘m’nodesSCN written to centrallocation• Databus Clients consume partitioned streams• Partitioning strategy: MOD• Partition function applied at source• Number of partitions (N) , and cluster name specifiedstatically in configuration• Easy to add or remove nodes• Dynamic redistribution of partitions• Fault tolerance for client nodesScaling Applications - II
Databus: Current Implementation OS - Linux, written in Java , runs Java 6 All components have http interfaces Databus Client: Java– Other language bindings possible– All communication with change stream via http Libraries– Netty , for http client-servers– Avro , for serialization of change events– Helix , for cluster awareness
Databus Performance : Relay Relay– Saturates network with low CPU utilization CPU utilization increases with more clients Increased poll interval (increase consumer latency ) reduces CPUutilization– Scales to 100’s of consumers (client instances)
Databus Performance : Consumer Consumer– Latency primarily governed by ‘poll interval’– Low overhead of library in event fetch Spike in latency due to network saturation at relay Scaling number of consumers Use partitioned consumption (filtering at relay )– Reduces network utilization , but some increase in latency due tofiltering Increase ‘poll interval’ , tolerate higher latencies
Databus Bootstrap :Performance Bootstrap– Should we serve from ‘catchup store’ or ‘snapshot store’– Depends: Traffic patterns in the spectrum ‘all updates’ , ‘allinserts’– Tune service depending on fraction of update and inserts Favour snapshot based serving for update heavy traffic
Bootstrap Performance: Snapshot vs Catch-upDatabus 345/13/2013
MOracle ChangeEventStreamMEspressoChange EventEvent StreamDatabusService• Databus Change Stream is amanaged service• Applications discover/lookupcoordinates of sources• Multi-tenant , chained relays• Many sources can bebootstrapped from SCN 0(beginning of time)• Automated change streamprovisioning is a work inprogressDatabus at LinkedIn
Databus at LinkedIn : Monitoring Available out of the box as JMX Mbean Metrics for health– lag between update time at DB and the time at which it wasreceived by application– time of last contact to change event stream and source Metrics for capacity planning– Event rate/ size– Request rate– Threads/ conns
Databus at LinkedIn: The Good Source isolation: Bootstrap benefits– Typically, data extracted from sources just once (seeding)– Bootstrap service used during launch of new applications– Primary data store not subject to unpredictable high loads due tolagging applications Common Data Format– Avro offers ease-of-use , flexibility and performanceimprovements (larger retention periods of change events inRelay) Partitioned Stream Consumption– Applications horizontally scaled to 100’s of instances
Databus at LinkedIn: Operational Niggles Oracle Change Capture Performance Bottlenecks– Complex joins– BLOBS and CLOBS– High update rate driven contention on trigger table Bootstrap: Snapshot store seeding– Consistent snapshot extraction from large sources Semi-automated change stream provisioning
Quick Review Specialization in Data Systems– CDC pipeline is a first class infrastructure citizen up there withstores and indexes Source Independent– Change capture logic can be plugged in Use of SCN – an external clock attached to source– Makes change stream more ‘portable’– Easy for applications to reason about consistency with source Pub-Sub API support atomicity semantics of transactions Bootstrap Service– Isolates the source from abusive scans– Serves both streaming and batch use-cases39