Kafka Summit SF 2017 - How Blizzard Used Kafka to Save Our Pipeline (and Azeroth)

Kafka at Blizzard
How Blizzard Used Kafka to Save Our Pipeline
(And Azeroth)

Jeff Field - @jfield – Systems Engineer, Big Data
The Data Gnomes:
Dustin Koupal - @cooper6581 – Senior Systems Engineer, Big Data

Rabbits.
• First pipeline was built on RabbitMQ and Flume
• Dozens of collectors fanned into 6 central Flumes
• Everyday I’m Buffering
• 100+ incidents were caused by overflowing queues

TEL1
Flume
ICC1
ORG1
VEB1
HDFS
STW1
LRD1

Why protobuf?
• Our developers were already using protobuf for
interoperability
• Using established tech helped all game teams implement
our first pipeline in months instead of years.
For Blizzard, this was unimaginably quick.

Trends – Incoming Data
• May 2015 – Legacy pipeline received ~3.5 billion events
per day
• Today - Our pipelines currently receive over 20 billion
events per day.
• Each product release sees more messages added, so
our system has to be ready to scale.
• Rarely do old messages go away (we’re working on that)

Data Gnomes vs. Talon & The Legion
In 2016, our army of Rabbits came under attack
on several fronts:
• Overwatch launched in May of 2016
• The Legion would return to Azeroth in World
of Warcraft’s latest expansion three months
later

Flafka – Flume & Kafka
• We needed something to augment our existing
pipeline
• Many people suggested we could do Flume’s job “with
4 lines of Spark” in 2015.
• These people may have been working for the Legion.
• Comfort with Flume deployments made “Flafka” the
natural choice
• Retrofitted existing pipeline with no message loss

ORG1
TEL1
LRD1
STW1
Flume
ICC1
VEB1
Central
Kafka
HDFS
Flume
HDFS

Flafka (Continued)
• Flafka enabled us to write to multiple HDFS clusters for
the first time, simplifying cluster migrations.
• Customized version of the Kafka Channel
Next Bottleneck:
• Too many open files
• 0.9 Consumer doesn’t have a dedicated heartbeat thread
• Creating this many files lead to missed heartbeats, causing
frequent rebalances.
• 25% of Flume’s time was creating files on HDFS

FeedSplitter
• Removing Flume immediately was impractical
• Flume’s only job now is to read from Kafka and
write to HDFS
• It is pretty good at this.
• Lowered latency for HDFS and increased the
reliability of the pipeline.

Meanwhile, Back in Gnomeregan…
• We were asked to build a new “Telemetry” pipeline to
support operational data for the launch of Overwatch
• Telemetry combines ideas from our legacy pipelines to
offer the best of both:
• Schema Registry Service
• Lambda architecture
• Able to collect both client and server data
• Flume is still used to write to HDFS

node-rdkafka
• Initial version of the pipeline was written in node.js
• We dedicated a developer to writing node.js bindings for
librdkafka.
• Since releasing it on GitHub in 2016, the project has
seen 340+ stars and 190+ commits with contributions
from IBM, Wikimedia and others in addition to ourselves

Data Platform Remastered
• Multiple Datastores
• Elasticsearch for Short term (7 days), Near Real-time
storage/visualization
• Cassandra for time series (metrics and events)
• HDFS for long term, indefinite storage.
• More Datasources
• REST API (metrics, events, alerts)
• Syslog RFC-5424
• Legacy AMQP still supported

Lessons Learned
• If you send the string “null” as your key to a
certain Kafka library, it maps to partition 9
• As with Hadoop, there are always new edge
cases.
• Don’t have many MirrorMakers producing into
the same topics.
• Don’t send messages over message.max.bytes
in a while loop
• Don’t ignore consumer rebalances
• But the biggest lesson…

Bigger isn’t always better
• We built our brokers like Hadoop datanodes:
• 15 4TB drives (RAID 10)
• 256 GB RAM
• 40 logical cores
• Even after tweaking, it can take 5-10 minutes to bring
up a broker that didn’t shut down cleanly.

Topic Naming is Hard
• datatype-version-product-source-datacenter?
• datatype-version-pipeline-product-source-destination?
• Record headers (KAFKA-4208) may simplify this for us
Partitioning is (also) Hard
• Minimum partitions: 2
• Average partition size: 20
• Largest partition size: 48 (previously 256)
• Replication factor: 3 (with min.isr of 2)
• Central Kafka: ~13000 partitions over 12 brokers
• Regional Kafka: ~3000 partitions over 6 brokers

Monitoring & Alerting
• LinkedIn’s Burrow monitors offsets.
• Biggest limitation: Calculates lag at request time,
not commit time (Burrow Issue #127)
• End to end metrics tell us how long a message
took to process.
• Kafka & ZooKeeper JMX values are piped into
our metrics system with jmxterm.
• An isolated version of our pipeline is used to
monitor the health of the customer pipeline.

The Future Soon™
• Replacing Flume
• Flume has served us well, but we’ve pushed it as far
as we feel we could (should?)
• We’re considering Gobblin, Kafka Connect or a
custom Spark application to write data to HDFS
• We’ll still use Flume to send from our legacy pipeline,
as those RabbitMQ servers will probably stick around
for years.

Isolation and the Cloud
• Through our TDK (Telemetry Development Kit) users can
spin up an isolated, virtual version of our pipeline to
develop against.
• What if we deployed the entire pipeline this way?
• Smaller, dynamically provisioned Kafkas in the cloud
would serve to isolate teams from each other
• Requires far more automation
• Reduces cross-product impacts.
• Requirements:
• Service discovery
• Automation
• Routing service

Links
• https://github.com/Blizzard/node-rdkafka
• Patches welcome!
• https://github.com/linkedin/Burrow
• For monitoring consumer offsets
• https://github.com/jiaqi/jmxterm
• For getting metrics out of Kafka & ZooKeeper
• https://github.com/Yelp/kafka-utils
• For cloning/managing consumer groups/offsets

We’re Hiring!
• https://careers.blizzard.com/en-us/
• We have a giant Orc

Kafka Summit SF 2017 - How Blizzard Used Kafka to Save Our Pipeline (and Azeroth)

Recommended

Recommended

More Related Content

Similar to Kafka Summit SF 2017 - How Blizzard Used Kafka to Save Our Pipeline (and Azeroth)

Similar to Kafka Summit SF 2017 - How Blizzard Used Kafka to Save Our Pipeline (and Azeroth) (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

Kafka Summit SF 2017 - How Blizzard Used Kafka to Save Our Pipeline (and Azeroth)