Although LinkedIn data continues to grow rapidly over the years, scaling up to handle the increasing data volume has not been the only challenge in streaming data in near real-time. Supporting the proliferation of new data systems has become yet another huge endeavor for data streaming infrastructure at LinkedIn. Building separate, specialized solutions to move data across heterogeneous systems is not sustainable, as it slows down development and makes the infrastructure unmanageable. This called for a centralized, managed, and extensible solution that can continuously deliver data to nearline applications.
We built Brooklin as a managed data streaming service that supports multiple pluggable sources and destinations, which can be data stores or messaging systems. Since 2016, Brooklin has been running in production as a critical piece of LinkedIn's streaming infrastructure, supporting a variety of data movement use cases, such as change data capture (CDC) and data propagation between different systems and environments. We have also leveraged Brooklin for mirroring Kafka data, replacing Kafka MirrorMaker at LinkedIn. In this talk, we will dive deeper into Brooklin's architecture and use cases, as well as our future plans.
7. Nearline Applications
• Need an easy way to move data to applications
○ App devs should focus on event processing and not on
data access
8. Building the Right Infrastructure
• Build separate, specialized
solutions to stream data from
and to each different system?
○ Slows down development
○ Hard to manage!
Microsoft
EventHubs
...
Streaming
System C
Streaming
System D
Streaming
System A
Streaming
System B
...
Nearline
Applications
9. Need a centralized, managed, and extensible service
to continuously deliver data in near real-time
11. Brooklin
• Streams are dynamically
provisioned and individually
configured
• Extensible: Plug-in support for
additional sources/destinations
• Streaming data pipeline service
• Propagates data from many source
types to many destination types
• Multitenant: Can run several
thousand streams simultaneously
19. Capturing Live Updates
Member DB
Updates
News Feed
Service
Notifications
Service
Standardization
Service
...
Search Indices
Service
20. Capturing Live Updates
Member DB
Updates
News Feed
Service
Notifications
Service
Standardization
Service
...
Search Indices
Service
21. Capturing Live Updates
Member DB
Updates
News Feed
Service
Notifications
Service
Standardization
Service
...
Search Indices
Service
22. • Isolation: Applications are decoupled
from the sources and don’t compete
for resources with online queries
• Applications can be at different
points in change timelines
• Brooklin can stream database
updates to a change stream
• Data processing applications
consume from change streams
Change Data Capture (CDC)
23. Change Data Capture (CDC)
Messaging System
News Feed
Service
Notifications
Service
Standardization
Service
Search Indices
Service
Member DB
Updates
25. ● Across…
○ cloud services
○ clusters
○ data centers
Stream Data from X to Y
26. Streaming Bridge
• Data pipe to move data between
different environments
• Enforce policy: Encryption,
Obfuscation, Data formats
27. ● Aggregating data from all data centers into a centralized place
● Moving data between LinkedIn and external cloud services (e.g. Azure)
● Brooklin has replaced Kafka MirrorMaker (KMM) at LinkedIn
○ Issues with KMM: didn’t scale well, difficult to operate and manage, poor
failure isolation
Mirroring Kafka Data
28. Use Brooklin to Mirror Kafka Data
DestinationsSources
Messaging systems
Microsoft
EventHubs
Messaging systems
Microsoft
EventHubs
Databases Databases
31. Brooklin Kafka Mirroring
● Optimized for stability and operability
● Manually pause and resume mirroring at every level
○ Entire pipeline, topic, topic-partition
● Can auto-pause partitions facing mirroring issues
○ Auto-resumes the partitions after a configurable duration
● Flow of messages from other partitions is unaffected
62. Current
• Consumers:
○ Espresso
○ Oracle
○ Kafka
○ EventHubs
○ Kinesis
• Producers:
○ Kafka
○ EventHubs
• APIs are standardized to support additional
sources and destinations
• Multitenant: Can power thousands of
datastreams across several source and
destination types
• Guarantees: At-least-once delivery, order is
maintained at partition level
• Kafka mirroring improvements: finer control
of pipelines (pause/auto-pause partitions),
improved latency with flushless-produce
mode
Sources & Destinations Features