7. Trends – Incoming Data
• May 2015 – Legacy pipeline received ~3.5 billion events
per day
• Today - Our pipelines currently receive over 20 billion
events per day.
• Each product release sees more messages added, so
our system has to be ready to scale.
• Rarely do old messages go away (we’re working on that)
8. Data Gnomes vs. Talon & The Legion
In 2016, our army of Rabbits came under attack
on several fronts:
• Overwatch launched in May of 2016
• The Legion would return to Azeroth in World
of Warcraft’s latest expansion three months
later
9. Flafka – Flume & Kafka
• We needed something to augment our existing
pipeline
• Many people suggested we could do Flume’s job “with
4 lines of Spark” in 2015.
• These people may have been working for the Legion.
• Comfort with Flume deployments made “Flafka” the
natural choice
• Retrofitted existing pipeline with no message loss
11. Flafka (Continued)
• Flafka enabled us to write to multiple HDFS clusters for
the first time, simplifying cluster migrations.
• Customized version of the Kafka Channel
Next Bottleneck:
• Too many open files
• 0.9 Consumer doesn’t have a dedicated heartbeat thread
• Creating this many files lead to missed heartbeats, causing
frequent rebalances.
• 25% of Flume’s time was creating files on HDFS
14. Meanwhile, Back in Gnomeregan…
• We were asked to build a new “Telemetry” pipeline to
support operational data for the launch of Overwatch
• Telemetry combines ideas from our legacy pipelines to
offer the best of both:
• Schema Registry Service
• Lambda architecture
• Able to collect both client and server data
• Flume is still used to write to HDFS
16. Data Platform Remastered
• Multiple Datastores
• Elasticsearch for Short term (7 days), Near Real-time
storage/visualization
• Cassandra for time series (metrics and events)
• HDFS for long term, indefinite storage.
• More Datasources
• REST API (metrics, events, alerts)
• Syslog RFC-5424
• Legacy AMQP still supported
18. Bigger isn’t always better
• We built our brokers like Hadoop datanodes:
• 15 4TB drives (RAID 10)
• 256 GB RAM
• 40 logical cores
• Even after tweaking, it can take 5-10 minutes to bring
up a broker that didn’t shut down cleanly.
19. Topic Naming is Hard
• datatype-version-product-source-datacenter?
• datatype-version-pipeline-product-source-destination?
• Record headers (KAFKA-4208) may simplify this for us
Partitioning is (also) Hard
• Minimum partitions: 2
• Average partition size: 20
• Largest partition size: 48 (previously 256)
• Replication factor: 3 (with min.isr of 2)
• Central Kafka: ~13000 partitions over 12 brokers
• Regional Kafka: ~3000 partitions over 6 brokers
20. Monitoring & Alerting
• LinkedIn’s Burrow monitors offsets.
• Biggest limitation: Calculates lag at request time,
not commit time (Burrow Issue #127)
• End to end metrics tell us how long a message
took to process.
• Kafka & ZooKeeper JMX values are piped into
our metrics system with jmxterm.
• An isolated version of our pipeline is used to
monitor the health of the customer pipeline.
21. The Future Soon™
• Replacing Flume
• Flume has served us well, but we’ve pushed it as far
as we feel we could (should?)
• We’re considering Gobblin, Kafka Connect or a
custom Spark application to write data to HDFS
• We’ll still use Flume to send from our legacy pipeline,
as those RabbitMQ servers will probably stick around
for years.
22. Isolation and the Cloud
• Through our TDK (Telemetry Development Kit) users can
spin up an isolated, virtual version of our pipeline to
develop against.
• What if we deployed the entire pipeline this way?
• Smaller, dynamically provisioned Kafkas in the cloud
would serve to isolate teams from each other
• Requires far more automation
• Reduces cross-product impacts.
• Requirements:
• Service discovery
• Automation
• Routing service
23. Links
• https://github.com/Blizzard/node-rdkafka
• Patches welcome!
• https://github.com/linkedin/Burrow
• For monitoring consumer offsets
• https://github.com/jiaqi/jmxterm
• For getting metrics out of Kafka & ZooKeeper
• https://github.com/Yelp/kafka-utils
• For cloning/managing consumer groups/offsets