Mirus
Reliable, high performance replication for Apache Kafka
pdavidson@salesforce.com
Paul Davidson, Principal Engineer
Seattle Apache Kafka Meetup
April 18, 2019
Kafka Data Replication at Salesforce
Must send data from multiple global data centres to aggregate clusters
• Securely
• No data loss
• Minimal latency
• Minimal data duplication
Challenges:
• Rapidly increasing load, multiple DCs
• Dynamic environment: topics frequently added and removed
• WAN connections
Kafka Data Replication at Salesforce
Apache Mirror Maker
• Simple tool provided with Apache Kafka
• A Kafka Consumer sending data to a Kafka Producer
• Static configuration
• Regex whitelist for topics
How was it done?
Mirror Maker
Mirror Maker at Scale
● Mirror Maker works well for small static clusters, but ...
● At large scale:
○ Re-balance loops
■ Fix: Careful configuration
● Increase session.timeout.ms, request.timeout.ms
○ Unhandled exceptions (especially in Kafka < 0.11)
■ Fix: Apply upstream patches and stay up-to-date!
○ Poor handling of missing / offline destination partitions
■ Fix: Custom patch for the Consumer Group Coordinator
● Filter any topics currently unavailable in the destination
● Reliable, but definitely a work-around
Mirror Maker at Scale
● Poor control of partition assignment impacts performance
○ Stops consuming while committing a batch
○ Large batches required for throughput across WAN
○ Must run many instances per node to achieve good
throughput
● Not suitable for API-driven cluster management
○ Static configuration: release and restart to update!
○ Limited control of topic white-list
There must be a better way ...
Mirus
Dynamic Replication service based on Kafka Connect
Introducing...
Mirus?
Latin root of Mirror: “wonderful, marvelous”
Aside: Kafka Connect
● Built-in Kafka framework
○ For reliably streaming data to and from Kafka
● Dynamic configuration and status with REST API
● Handles cluster management
○ “Distributed Herder” built on Kafka cluster management
○ Backed by compacted Kafka topics
● Continuous ingestion
○ Keeps consuming while committing offsets
○ Supports multiple consumers and producers
Introducing Mirus
● Based on the Kafka Connect framework
● Dynamic configuration
○ REST API for configuration updates
○ Precise control of replication
● Configurable parallelism
● Support for task-level consumer and producer monitoring
● Improved resiliency: automated restart on Kafka Connect task failure
Kafka Connect Overview
● Kafka Connect cluster
○ A distributed a set of Worker processes
○ One or more Workers per host
○ Managed by Distributed Herder
● Worker Processes
○ Tasks
○ Connectors
■ Source - read from X, write to Kafka
■ Sink - read from Kafka, write to X
Mirus Internals
● Mirus includes:
○ Custom Source Connector implementation
○ Custom Source Task implementation
○ Customized Worker entry point
○ Custom monitor threads
Mirus “Kafka Monitor”
● The heart of Mirus
○ Thread managed by the Mirus Source Connector
○ Monitors the state of source and destination Kafka clusters
● Handles partition assignment
○ Applies white-list to current state
○ Missing destination topics are counted but not mirrored
○ Triggers rebalances when partition assignments change:
■ Source configuration changes
■ Source partitions changes
■ Destination partitions creation / deletion
Worker Process
Dynamic Configuration
● REST API to POST connector configuration updates
○ Connector config dynamic, worker config static
● Config updates trigger rebalance
● Configurable location, can be in Source or Destination Kafka Cluster
○ Use the cluster closest to the Mirus Workers
Partition and Task Assignment
● Happens on every rebalance
● Mirus partition algorithm
pluggable
○ Round-robin by default
○ Could use metadata, e.g.
high-throughput.
● Task assignment
○ KC framework: round-robin only
○ Not pluggable (we could patch
this).
Open Sourced
Released September 2018
BSD License
https://github.com/salesforce/mirus
What’s Next?
● Integration with our Kafka orchestration tooling
○ Rapidly provision and mirror new topics across multiple clusters
● Topic creation, topic metadata replication
○ Has been requested by users
● Mirus Sink Connector
○ Improved support for multiple destination clusters in push configurations
Q&A
Appendix
REST API Example
● PUT connector
configuration update
○ E.g. increase number of
tasks.
● Handler writes to
configuration topic
● Distributed Herder triggers
rebalance
Increase the number of parallel tasks
bash-4.1$ curl localhost:8093/connectors/source-name/config 
-X PUT 
-- data 
'{
"connector.class":
"com.salesforce.mirus.kafka.connect.MirusSourceConnector",
"consumer.bootstrap.servers": "source-hostname:9093",
"destination.bootstrap.servers": "dest-hostname:9093",
"name": "source-name",
"tasks.max": "90",
"topics.regex": "^topic-name.*$",
...
}'
DC
Push Mode Replication
DCDCLeaf DCs
Kafka Cluster
Mirus
Cluster
Aggregate DCs
Kafka
Cluster
WAN
DCAggregate DCs
Pull Mode Replication
Kafka Cluster
Mirus
Cluster
Leaf DCs
Kafka Cluster WAN
Mirus Clusters

Kafka meetup seattle 2019 mirus reliable, high performance replication for apache kafka

  • 1.
    Mirus Reliable, high performancereplication for Apache Kafka pdavidson@salesforce.com Paul Davidson, Principal Engineer Seattle Apache Kafka Meetup April 18, 2019
  • 2.
    Kafka Data Replicationat Salesforce Must send data from multiple global data centres to aggregate clusters • Securely • No data loss • Minimal latency • Minimal data duplication Challenges: • Rapidly increasing load, multiple DCs • Dynamic environment: topics frequently added and removed • WAN connections
  • 3.
    Kafka Data Replicationat Salesforce Apache Mirror Maker • Simple tool provided with Apache Kafka • A Kafka Consumer sending data to a Kafka Producer • Static configuration • Regex whitelist for topics How was it done?
  • 4.
  • 5.
    Mirror Maker atScale ● Mirror Maker works well for small static clusters, but ... ● At large scale: ○ Re-balance loops ■ Fix: Careful configuration ● Increase session.timeout.ms, request.timeout.ms ○ Unhandled exceptions (especially in Kafka < 0.11) ■ Fix: Apply upstream patches and stay up-to-date! ○ Poor handling of missing / offline destination partitions ■ Fix: Custom patch for the Consumer Group Coordinator ● Filter any topics currently unavailable in the destination ● Reliable, but definitely a work-around
  • 6.
    Mirror Maker atScale ● Poor control of partition assignment impacts performance ○ Stops consuming while committing a batch ○ Large batches required for throughput across WAN ○ Must run many instances per node to achieve good throughput ● Not suitable for API-driven cluster management ○ Static configuration: release and restart to update! ○ Limited control of topic white-list There must be a better way ...
  • 7.
    Mirus Dynamic Replication servicebased on Kafka Connect Introducing...
  • 8.
    Mirus? Latin root ofMirror: “wonderful, marvelous”
  • 9.
    Aside: Kafka Connect ●Built-in Kafka framework ○ For reliably streaming data to and from Kafka ● Dynamic configuration and status with REST API ● Handles cluster management ○ “Distributed Herder” built on Kafka cluster management ○ Backed by compacted Kafka topics ● Continuous ingestion ○ Keeps consuming while committing offsets ○ Supports multiple consumers and producers
  • 10.
    Introducing Mirus ● Basedon the Kafka Connect framework ● Dynamic configuration ○ REST API for configuration updates ○ Precise control of replication ● Configurable parallelism ● Support for task-level consumer and producer monitoring ● Improved resiliency: automated restart on Kafka Connect task failure
  • 11.
    Kafka Connect Overview ●Kafka Connect cluster ○ A distributed a set of Worker processes ○ One or more Workers per host ○ Managed by Distributed Herder ● Worker Processes ○ Tasks ○ Connectors ■ Source - read from X, write to Kafka ■ Sink - read from Kafka, write to X
  • 12.
    Mirus Internals ● Mirusincludes: ○ Custom Source Connector implementation ○ Custom Source Task implementation ○ Customized Worker entry point ○ Custom monitor threads
  • 13.
    Mirus “Kafka Monitor” ●The heart of Mirus ○ Thread managed by the Mirus Source Connector ○ Monitors the state of source and destination Kafka clusters ● Handles partition assignment ○ Applies white-list to current state ○ Missing destination topics are counted but not mirrored ○ Triggers rebalances when partition assignments change: ■ Source configuration changes ■ Source partitions changes ■ Destination partitions creation / deletion
  • 14.
  • 15.
    Dynamic Configuration ● RESTAPI to POST connector configuration updates ○ Connector config dynamic, worker config static ● Config updates trigger rebalance ● Configurable location, can be in Source or Destination Kafka Cluster ○ Use the cluster closest to the Mirus Workers
  • 16.
    Partition and TaskAssignment ● Happens on every rebalance ● Mirus partition algorithm pluggable ○ Round-robin by default ○ Could use metadata, e.g. high-throughput. ● Task assignment ○ KC framework: round-robin only ○ Not pluggable (we could patch this).
  • 17.
    Open Sourced Released September2018 BSD License https://github.com/salesforce/mirus
  • 18.
    What’s Next? ● Integrationwith our Kafka orchestration tooling ○ Rapidly provision and mirror new topics across multiple clusters ● Topic creation, topic metadata replication ○ Has been requested by users ● Mirus Sink Connector ○ Improved support for multiple destination clusters in push configurations
  • 19.
  • 21.
  • 22.
    REST API Example ●PUT connector configuration update ○ E.g. increase number of tasks. ● Handler writes to configuration topic ● Distributed Herder triggers rebalance Increase the number of parallel tasks bash-4.1$ curl localhost:8093/connectors/source-name/config -X PUT -- data '{ "connector.class": "com.salesforce.mirus.kafka.connect.MirusSourceConnector", "consumer.bootstrap.servers": "source-hostname:9093", "destination.bootstrap.servers": "dest-hostname:9093", "name": "source-name", "tasks.max": "90", "topics.regex": "^topic-name.*$", ... }'
  • 23.
    DC Push Mode Replication DCDCLeafDCs Kafka Cluster Mirus Cluster Aggregate DCs Kafka Cluster WAN
  • 24.
    DCAggregate DCs Pull ModeReplication Kafka Cluster Mirus Cluster Leaf DCs Kafka Cluster WAN
  • 25.