"Processing a lot of data with Kafka means knowing how and when to scale horizontally and vertically. When you’ve exhausted the boundaries of scaling inside a single cluster, replication becomes critical but sometimes standard replication is not enough.
New Relic once earned the dubious title of “World’s Largest Kafka Cluster”, and in our journey to break this cluster into dozens of smaller clusters, we needed to route events between clusters and topics based on headers.
At the time, this meant we had to do it ourselves. Starting out, our goal was fan out (one-to-many) replication. Since then our needs have expanded to include many-to-one and many-to-many replication.
In this talk we'll discuss what bottlenecks we have hit as we scaled out, and what measures we took to remove them, such as:
- Replicating data based on Kafka Headers
- Connecting to many source and destination Kafka clusters
- Managing the replication of Kafka topics of varying traffic
- The use of an intermediary Kafka cluster
At the end of this talk you will understand how we have scaled replication and routing to support New Relic's ever growing data ingestion, and all the mitigations it took to get us there."
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Go Big or Go Home: Approaching Kafka Replication at Scale
1. Go Big or Go Home
Approaching Kafka Replication at Scale
Julia Holgado
2. Agenda
● New Relic’s cloud migration, focusing on replication between Kafka
clusters
○ Discovering the need for one to many routing
○ What we did to fulfill that need
■ Problems + mitigations
○ Discovering the need for other types of routing
○ Extending out one to many solution to fulfill many to many routing
■ Problems
○ Ongoing improvements
7. One To Many Example
HTTP
Endpoints
Pipeline
Services
Ingest
Tier
Insert
Workers
APIs & UIs
Kafka New Relic
DB
Datacenter
8. One To Many Example
HTTP
Endpoints
Pipeline
Services
Ingest
Tier
Insert
Workers
APIs & UIs
Kafka New Relic
DB
Datacenter
9. One To Many Example
HTTP
Endpoints
Pipeline
Services
Ingest
Tier
Insert
Workers
APIs & UIs
Kafka New Relic
DB
Datacenter
10. One To Many Example
HTTP
Endpoints
Pipeline
Services
Ingest
Tier
Insert
Workers
APIs & UIs
Kafka New Relic
DB
Datacenter
11. One To Many Example
HTTP
Endpoints
Pipeline
Services
Ingest
Tier
Insert
Workers
APIs & UIs
Kafka New Relic
DB
Datacenter
12. One to Many Routing: Requirements
● Isolate the partial or total failure of a destination cell from impacting
other cells
● React to changes in routing without a deploy
● Route based on Kafka headers
● Supports multiple routing strategies
22. Problem: Topics of Varying Size and Traffic
Router Mirror
destA.topic_name-0…N
dest#.topic_name-0…N
destB.high_traffic_topic-0
…2N
destC.high_traffic_topic-0
…2N
23. One to Many Problems: Summary
● Partition explosion
○ More strain on kafka brokers
○ Rebalance storms when managing Mirror instances
● Handling topics of varying size and throughput
○ Cannot steer more resources towards a certain topic
25. Sharding Outcomes
● Designate a set of Kynapses instances for particular topics
○ Lessen rebalances on restarts and deploys
○ Scale each shard independently
● Downsides
○ Shards are organized manually
44. Improvements: WorkAssignment
● Goal: Let Kynapses instances assign themselves to a topic, in order to
○ Improve resource distribution; be able to steer more instances to large topics
○ Reduce the amount of consumers each instance spins up
○ Remove operational toil of shards
54. Summary
● How New Relic has handled replicating data between many Kafka clusters
○ Redundancy, failure isolation, and our pipeline architecture led us to develop our own
tool
○ We ran into several difficulties with our chosen implementation, particularly
■ Using an intermediate kafka cluster to help separate responsibilities of consuming
from source cluster and producing to destination cluster can result in a large,
difficult to manage cluster
■ Managing the routing of many topics requires more efficient use of service
resources
■ Highlighting the weaknesses of NR’s cellular architecture
○ Our plans moving forward
Julia Holgado
jholgado@newrelic.com