SlideShare a Scribd company logo
1 of 39
Brooklin Mirror Maker
How and Why we moved away from Kafka Mirror Maker
Shun-ping Chiu
Software engineer @ LinkedIn Data Pipelines
Agenda
Kafka Mirroring Use Cases
Limitations for Kafka Mirror Maker
Brooklin Mirror Maker
Future Work
Mirroring Use Cases
● Aggregating data from all data centers
● Moving data between LinkedIn and external
cloud services
Mirroring
Use Cases
Tremendous Kafka Data
● Kafka data at LinkedIn continues to grow rapidly
● We are at 5T messages and 1.4 PB everyday
Big Scale to Operate
40+Kafka src clusters
in different DCs
100+pipelines
2Tmessages/day
KMM Limitations
Kafka Mirror Maker(KMM) Topology
Datacenter B
aggregate
tracking
tracking
KMM KMM
Datacenter A
aggregate
tracking
tracking
KMM KMM
● Each KMM pipeline
○ mirrors data from 1 source cluster to 1
destination cluster
○ constitutes its own KMM cluster
Datacenter B
aggregate
tracking
tracking
Datacenter A
aggregate
tracking
tracking
KMM
aggregate
metrics
metrics
aggregate
metrics
metrics
Datacenter C
aggregate
tracking
tracking
aggregate
metrics
metrics
...
KMM KMM
KMM KMM KMM
KMM KMM KMM KMM KMM KMM
KMM KMM KMM KMM KMM KMM
KMM Setup
● # of KMM clusters =
# of data centers x # of Kafka src
clusters
● Need to operate more than 100+ KMM
clusters
● Static configuration file per KMM cluster requires every change
to be deployed
Example - Add a Topic in KMM
● Let’s say we have a pipeline (a KMM cluster) with 100+ hosts
● And 100+ pipelines ?
KMM Pain Points
● Hard to operate
○ hard to add new topic
○ difficult to split the pipeline
● One bad partition brings down the pipeline
○ deleted topic
○ ACL issue
● Performance issues
○ Unable to catch up with traffic
○ Increased lag
: (
Your Kafka Mirror Maker runs into problems and need to restart. We’re just collecting some error
infos and we will restart for you. (0% completed)
Brooklin Mirror Maker
Brooklin - Stream Ingestion Service
DestinationsSources
Data stores
Messaging systems
Microsoft
EventHubs
Data stores
Messaging systems
Microsoft
EventHubs
BMM is built on Brooklin
DestinationsSources
Data stores
Messaging systems
Microsoft
EventHubs
Data stores
Messaging systems
Microsoft
EventHubs
Brooklin Mirror Maker
● Built on top of our stream ingestion service, Brooklin
○ Better operability
○ Fault isolation
○ Performance optimizations
● BMM has fully replaced KMM at LinkedIn today
Better Operability
KMM vs BMM
Datacenter B
aggregate
tracking
tracking
BMM
Datacenter A
aggregate
tracking
tracking
BMM
Datacenter B
aggregate
tracking
tracking
KMM KMM
Datacenter A
aggregate
tracking
tracking
KMM KMM
● BMM is one cluster per data center
BMM Topology
Datacenter A
aggregate
tracking
tracking
BMM
metrics
aggregate
metrics
Datacenter B
aggregate
tracking
tracking
BMM
metrics
aggregate
metrics
Datacenter C
aggregate
tracking
tracking
BMM
metrics
aggregate
metrics
...
100+KMM clusters
~10BMM clusters
Dynamic Management API
Brooklin
Engine
Kafka src
connector
Kafka dest
connector
Management
Rest API
Diagnostics
Rest API
ZooKeeper
Management/
monitoring
portal
SRE/op
dashboards
Restful API- Creating a Pipeline
Brooklin
Engine
Management
Rest API
ZooKeeper
create POST /datastream
name: mm_DC1-tracking_DC2-aggregate-tracking
connectorName: KafkaMirrorMaker
source:
connectionString: kafkassl://DC1-tracking-vip:12345/topicA|topicB
destination:
connectionString: kafkassl://DC2-aggregate-tracking-vip:12345
metadata:
taskNums: 5
Restful API - Updating a Pipeline
Brooklin
Engine
Management
Rest API
ZooKeeper
update PUT /datastream/mm_DC1-tracking_DC2-aggregate-
tracking
name: mm_DC1-tracking_DC2-aggregate-tracking
connectorName: KafkaMirrorMaker
source:
connectionString: kafkassl://DC1-tracking-vip:12345/topicA|topicB|topicC|topicD
destination:
connectionString: kafkassl://DC2-aggregate-tracking-vip:12345
metadata:
taskNums: 10
^topic*.
Pause a Pipeline
● Manually pause and resume mirroring for each pipeline
● BMM can automatically pause mirroring for bad partitions for fault
isolation
○ Flow of messages from healthy partitions continue
○ Auto-resumes the partitions after configurable duration
Diagnostic API
Brooklin
Engine
Kafka src
connector
Kafka dest
connector
Management
Rest API
Diagnostics
Rest API
ZooKeeper
Management/
monitoring
portal
SRE/op
dashboards
Restful API - On-demand Diagnostics
Brooklin
Engine
Diagnostics
Rest API
ZooKeeper
getAllStatus GET /diag?datastream=mm_DC1-tracking_DC2-aggregate-tracking
host1.prod.linkedin.com:
datastream: mm_DC1-tracking_DC2-aggregate-tracking
assignedTopicPartitions: [topicA-0, topicA-3, topicB-0, topicB-2]
autoPausedPartitions: [{topicA-3: {reason: SEND_ERROR, description: failed to produce messages from this
partition}}]
manuallyPausedPartitions: []
host2.prod.linkedin.com:
datastream: mm_DC1-tracking_DC2-aggregate-tracking
assignedTopicPartitions: [topicA-1, topicA-2, topicB-1, topicB-3]
autoPausedPartitions: []
manuallyPausedPartitions: []
Performance Improvements
Brooklin Mirroring Pseudocode
while (!shutdown) {
records = consumer.poll();
producer.send(records);
if (timeToCommit) {
producer.flush();
consumer.commit();
}
}
Producer flush can be expensive
Flushless Produce
Only commit “safe” acknowledged checkpoints:
consumer.poll() → producer.send(records) → consumer.commit(offsets)
consumer.poll() → producer.send(records) → producer.flush() → consumer.commit()
Flushless Produce
sp0 consumer producer
checkpoint
manager
o1, o2 o1, o2 o1, o2
o1
o2
Source
Destination
ack(sp0, o2)
dp0
dp1
● Checkpoint manager maintains producer-acknowledged offsets for
each source partition
Source partition sp0
in-flight: [o1]
acked: [o2]
safe checkpoint: --
Flushless Produce
sp0 consumer producer
checkpoint
manager
o3, o4 o3, o4 o3, o4
o3
o4
Source
Destination
ack(sp0, o1)
dp0
dp1
● Update safe checkpoint to largest acknowledged offset that is less
than oldest in-flight (if any)
Source partition sp0
in-flight: [o3, o4]
acked: [o1, o2]
safe checkpoint: o2
Manage Performance through Task
● Datastream task
○ Consists of a dedicated kafka consumer and use a share producer pool to
produce the data
○ Performance is controlled by the # of Tasks
○ Tasks are assigned to each host within the BMM cluster
● BMM uses sticky assignment to speeds up the task allocation
Sticky Task Assignment
ZooKeeper
BMM
host
BMM
host
BMM
host
BMM
host
Task 1 Task 2 Task 3 Task 4
Task 5 Task 6
ZooKeeper
BMM
host
BMM
host
BMM
host
BMM
host
Task 1 Task 2 Task 3 Task 4
Task 5
BMM
host
Leader
Leader
Task 6
BMM Performance Numbers
● Testing environment
○ Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 12 cores, 64GB RAM
● Performance Metrics with 20 datastream tasks:
○ Throughput: compressed bytes up to 28 MB/s
○ Memory utilization: 70%
○ CPU utilization: ~100%
Passthrough Compression
● BMM is CPU bound, 70%+ CPU time is spent in decompression & re-
compression
○ GZIPInputStream.read(): ~10%
○ GZIPOutputStream.write(): ~61%
● “Passthrough” mirroring - skip the decompression & recompression
○ Throughput ~ 100MB/s
○ CPU utilization drops to 50%
Future Works
● Better workload distribution - workload
based assignment
● Auto-scaling - adjust number of tasks based
on throughput
Performance
&
Stability
Open Source
Expected at EOM, April 2019
Questions
Thank you

More Related Content

What's hot

Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
Static Membership: Rebalance Strategy Designed for the Cloud (Boyang Chen,Con...
Static Membership: Rebalance Strategy Designed for the Cloud (Boyang Chen,Con...Static Membership: Rebalance Strategy Designed for the Cloud (Boyang Chen,Con...
Static Membership: Rebalance Strategy Designed for the Cloud (Boyang Chen,Con...
confluent
 

What's hot (20)

Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Tuning kafka pipelines
Tuning kafka pipelinesTuning kafka pipelines
Tuning kafka pipelines
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
6 Nines: How Stripe keeps Kafka highly-available across the globe with Donny ...
 
Power systems virtualization with power kvm
Power systems virtualization with power kvmPower systems virtualization with power kvm
Power systems virtualization with power kvm
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
 
Static Membership: Rebalance Strategy Designed for the Cloud (Boyang Chen,Con...
Static Membership: Rebalance Strategy Designed for the Cloud (Boyang Chen,Con...Static Membership: Rebalance Strategy Designed for the Cloud (Boyang Chen,Con...
Static Membership: Rebalance Strategy Designed for the Cloud (Boyang Chen,Con...
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
Using Modular Topologies in Kafka Streams to scale ksqlDB’s persistent querie...
Using Modular Topologies in Kafka Streams to scale ksqlDB’s persistent querie...Using Modular Topologies in Kafka Streams to scale ksqlDB’s persistent querie...
Using Modular Topologies in Kafka Streams to scale ksqlDB’s persistent querie...
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Everything You Always Wanted to Know About Kafka's Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka's Rebalance Protocol but Wer...Everything You Always Wanted to Know About Kafka's Rebalance Protocol but Wer...
Everything You Always Wanted to Know About Kafka's Rebalance Protocol but Wer...
 

Similar to Brooklin Mirror Maker - How and why we moved away from Kafka Mirror Maker

How to Suceed in Hadoop
How to Suceed in HadoopHow to Suceed in Hadoop
How to Suceed in Hadoop
Precisely
 
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
WSO2
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
Nicolas Poggi
 

Similar to Brooklin Mirror Maker - How and why we moved away from Kafka Mirror Maker (20)

More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedInMore Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka Mirroring Pipelines at LinkedIn
 
Dive into Streams with Brooklin
Dive into Streams with BrooklinDive into Streams with Brooklin
Dive into Streams with Brooklin
 
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad H...
 
Kakao Cloud Native Platform, 9rum
Kakao Cloud Native Platform, 9rumKakao Cloud Native Platform, 9rum
Kakao Cloud Native Platform, 9rum
 
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
 
How to Suceed in Hadoop
How to Suceed in HadoopHow to Suceed in Hadoop
How to Suceed in Hadoop
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
Life of a startup - Sjoerd Mulder - Codemotion Amsterdam 2017
 
Save Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud CostsSave Money by Uncovering Kafka’s Hidden Cloud Costs
Save Money by Uncovering Kafka’s Hidden Cloud Costs
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 
Tivoli Common Reporting and Cognos - Customer Case
Tivoli Common Reporting and Cognos - Customer CaseTivoli Common Reporting and Cognos - Customer Case
Tivoli Common Reporting and Cognos - Customer Case
 
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)
 
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
WSO2 Customer Webinar: WEST Interactive’s Deployment Approach and DevOps Prac...
 
The state of Spark in the cloud
The state of Spark in the cloudThe state of Spark in the cloud
The state of Spark in the cloud
 
ietf115-grow-bmp-high-availability.pptx
ietf115-grow-bmp-high-availability.pptxietf115-grow-bmp-high-availability.pptx
ietf115-grow-bmp-high-availability.pptx
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Stream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream SharingStream Processing with Flink and Stream Sharing
Stream Processing with Flink and Stream Sharing
 
Cloud Native Data Pipelines
Cloud Native Data PipelinesCloud Native Data Pipelines
Cloud Native Data Pipelines
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

Brooklin Mirror Maker - How and why we moved away from Kafka Mirror Maker

  • 1. Brooklin Mirror Maker How and Why we moved away from Kafka Mirror Maker Shun-ping Chiu Software engineer @ LinkedIn Data Pipelines
  • 2. Agenda Kafka Mirroring Use Cases Limitations for Kafka Mirror Maker Brooklin Mirror Maker Future Work
  • 4. ● Aggregating data from all data centers ● Moving data between LinkedIn and external cloud services Mirroring Use Cases
  • 5. Tremendous Kafka Data ● Kafka data at LinkedIn continues to grow rapidly ● We are at 5T messages and 1.4 PB everyday
  • 6. Big Scale to Operate 40+Kafka src clusters in different DCs 100+pipelines 2Tmessages/day
  • 8. Kafka Mirror Maker(KMM) Topology Datacenter B aggregate tracking tracking KMM KMM Datacenter A aggregate tracking tracking KMM KMM ● Each KMM pipeline ○ mirrors data from 1 source cluster to 1 destination cluster ○ constitutes its own KMM cluster
  • 9. Datacenter B aggregate tracking tracking Datacenter A aggregate tracking tracking KMM aggregate metrics metrics aggregate metrics metrics Datacenter C aggregate tracking tracking aggregate metrics metrics ... KMM KMM KMM KMM KMM KMM KMM KMM KMM KMM KMM KMM KMM KMM KMM KMM KMM KMM Setup ● # of KMM clusters = # of data centers x # of Kafka src clusters ● Need to operate more than 100+ KMM clusters
  • 10. ● Static configuration file per KMM cluster requires every change to be deployed Example - Add a Topic in KMM ● Let’s say we have a pipeline (a KMM cluster) with 100+ hosts ● And 100+ pipelines ?
  • 11. KMM Pain Points ● Hard to operate ○ hard to add new topic ○ difficult to split the pipeline ● One bad partition brings down the pipeline ○ deleted topic ○ ACL issue ● Performance issues ○ Unable to catch up with traffic ○ Increased lag
  • 12. : ( Your Kafka Mirror Maker runs into problems and need to restart. We’re just collecting some error infos and we will restart for you. (0% completed)
  • 14. Brooklin - Stream Ingestion Service DestinationsSources Data stores Messaging systems Microsoft EventHubs Data stores Messaging systems Microsoft EventHubs
  • 15. BMM is built on Brooklin DestinationsSources Data stores Messaging systems Microsoft EventHubs Data stores Messaging systems Microsoft EventHubs
  • 16. Brooklin Mirror Maker ● Built on top of our stream ingestion service, Brooklin ○ Better operability ○ Fault isolation ○ Performance optimizations ● BMM has fully replaced KMM at LinkedIn today
  • 18. KMM vs BMM Datacenter B aggregate tracking tracking BMM Datacenter A aggregate tracking tracking BMM Datacenter B aggregate tracking tracking KMM KMM Datacenter A aggregate tracking tracking KMM KMM ● BMM is one cluster per data center
  • 19. BMM Topology Datacenter A aggregate tracking tracking BMM metrics aggregate metrics Datacenter B aggregate tracking tracking BMM metrics aggregate metrics Datacenter C aggregate tracking tracking BMM metrics aggregate metrics ... 100+KMM clusters ~10BMM clusters
  • 20. Dynamic Management API Brooklin Engine Kafka src connector Kafka dest connector Management Rest API Diagnostics Rest API ZooKeeper Management/ monitoring portal SRE/op dashboards
  • 21. Restful API- Creating a Pipeline Brooklin Engine Management Rest API ZooKeeper create POST /datastream name: mm_DC1-tracking_DC2-aggregate-tracking connectorName: KafkaMirrorMaker source: connectionString: kafkassl://DC1-tracking-vip:12345/topicA|topicB destination: connectionString: kafkassl://DC2-aggregate-tracking-vip:12345 metadata: taskNums: 5
  • 22. Restful API - Updating a Pipeline Brooklin Engine Management Rest API ZooKeeper update PUT /datastream/mm_DC1-tracking_DC2-aggregate- tracking name: mm_DC1-tracking_DC2-aggregate-tracking connectorName: KafkaMirrorMaker source: connectionString: kafkassl://DC1-tracking-vip:12345/topicA|topicB|topicC|topicD destination: connectionString: kafkassl://DC2-aggregate-tracking-vip:12345 metadata: taskNums: 10 ^topic*.
  • 23. Pause a Pipeline ● Manually pause and resume mirroring for each pipeline ● BMM can automatically pause mirroring for bad partitions for fault isolation ○ Flow of messages from healthy partitions continue ○ Auto-resumes the partitions after configurable duration
  • 24. Diagnostic API Brooklin Engine Kafka src connector Kafka dest connector Management Rest API Diagnostics Rest API ZooKeeper Management/ monitoring portal SRE/op dashboards
  • 25. Restful API - On-demand Diagnostics Brooklin Engine Diagnostics Rest API ZooKeeper getAllStatus GET /diag?datastream=mm_DC1-tracking_DC2-aggregate-tracking host1.prod.linkedin.com: datastream: mm_DC1-tracking_DC2-aggregate-tracking assignedTopicPartitions: [topicA-0, topicA-3, topicB-0, topicB-2] autoPausedPartitions: [{topicA-3: {reason: SEND_ERROR, description: failed to produce messages from this partition}}] manuallyPausedPartitions: [] host2.prod.linkedin.com: datastream: mm_DC1-tracking_DC2-aggregate-tracking assignedTopicPartitions: [topicA-1, topicA-2, topicB-1, topicB-3] autoPausedPartitions: [] manuallyPausedPartitions: []
  • 27. Brooklin Mirroring Pseudocode while (!shutdown) { records = consumer.poll(); producer.send(records); if (timeToCommit) { producer.flush(); consumer.commit(); } } Producer flush can be expensive
  • 28. Flushless Produce Only commit “safe” acknowledged checkpoints: consumer.poll() → producer.send(records) → consumer.commit(offsets) consumer.poll() → producer.send(records) → producer.flush() → consumer.commit()
  • 29. Flushless Produce sp0 consumer producer checkpoint manager o1, o2 o1, o2 o1, o2 o1 o2 Source Destination ack(sp0, o2) dp0 dp1 ● Checkpoint manager maintains producer-acknowledged offsets for each source partition Source partition sp0 in-flight: [o1] acked: [o2] safe checkpoint: --
  • 30. Flushless Produce sp0 consumer producer checkpoint manager o3, o4 o3, o4 o3, o4 o3 o4 Source Destination ack(sp0, o1) dp0 dp1 ● Update safe checkpoint to largest acknowledged offset that is less than oldest in-flight (if any) Source partition sp0 in-flight: [o3, o4] acked: [o1, o2] safe checkpoint: o2
  • 31. Manage Performance through Task ● Datastream task ○ Consists of a dedicated kafka consumer and use a share producer pool to produce the data ○ Performance is controlled by the # of Tasks ○ Tasks are assigned to each host within the BMM cluster ● BMM uses sticky assignment to speeds up the task allocation
  • 32. Sticky Task Assignment ZooKeeper BMM host BMM host BMM host BMM host Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 ZooKeeper BMM host BMM host BMM host BMM host Task 1 Task 2 Task 3 Task 4 Task 5 BMM host Leader Leader Task 6
  • 33. BMM Performance Numbers ● Testing environment ○ Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 12 cores, 64GB RAM ● Performance Metrics with 20 datastream tasks: ○ Throughput: compressed bytes up to 28 MB/s ○ Memory utilization: 70% ○ CPU utilization: ~100%
  • 34. Passthrough Compression ● BMM is CPU bound, 70%+ CPU time is spent in decompression & re- compression ○ GZIPInputStream.read(): ~10% ○ GZIPOutputStream.write(): ~61% ● “Passthrough” mirroring - skip the decompression & recompression ○ Throughput ~ 100MB/s ○ CPU utilization drops to 50%
  • 36. ● Better workload distribution - workload based assignment ● Auto-scaling - adjust number of tasks based on throughput Performance & Stability
  • 37. Open Source Expected at EOM, April 2019