SlideShare a Scribd company logo
1 of 24
Kafka	at	Blizzard
How	Blizzard	Used	Kafka	to	Save	Our	Pipeline
(And	Azeroth)
Jeff	Field	- @jfield – Systems	Engineer,	Big	Data
The	Data	Gnomes:
Dustin	Koupal - @cooper6581	– Senior	Systems	Engineer,	Big	Data
In the Beginning…
Rabbits.
• First	pipeline	was	built	on	RabbitMQ and	Flume
• Dozens	of	collectors	fanned	into	6	central	Flumes
• Everyday	I’m	Buffering
• 100+	incidents	were	caused	by	overflowing	queues
TEL1
Flume
ICC1
ORG1
VEB1
HDFS
STW1
LRD1
Why protobuf?
• Our	developers	were	already	using	protobuf for	
interoperability
• Using	established	tech	helped	all	game	teams	implement	
our	first	pipeline	in	months	instead	of	years.
For	Blizzard,	this	was	unimaginably	quick.
Trends – Incoming Data
• May	2015	– Legacy	pipeline	received	~3.5	billion	events	
per	day
• Today	- Our	pipelines	currently	receive	over	20	billion	
events	per	day.
• Each	product	release	sees	more	messages	added,	so	
our	system	has	to	be	ready	to	scale.
• Rarely	do	old	messages	go	away	(we’re	working	on	that)
Data Gnomes vs. Talon & The Legion
In	2016,	our	army	of	Rabbits	came	under	attack	
on	several	fronts:
• Overwatch launched	in	May	of	2016
• The	Legion	would	return	to	Azeroth in	World	
of	Warcraft’s	latest	expansion	three	months	
later
Flafka – Flume & Kafka
• We	needed	something	to	augment	our	existing	
pipeline
• Many	people	suggested	we	could	do	Flume’s	job	“with	
4	lines	of	Spark”	in	2015.
• These	people	may	have	been	working	for	the	Legion.
• Comfort	with	Flume	deployments	made	“Flafka”	the	
natural	choice
• Retrofitted	existing	pipeline	with	no	message	loss
ORG1
TEL1
LRD1
STW1
Flume
ICC1
VEB1
Central	
Kafka
HDFS
Flume
HDFS
Flafka (Continued)
• Flafka enabled	us	to	write	to	multiple	HDFS	clusters	for	
the	first	time,	simplifying	cluster	migrations.
• Customized	version	of	the	Kafka	Channel
Next	Bottleneck:
• Too	many	open	files
• 0.9	Consumer	doesn’t	have	a	dedicated	heartbeat	thread
• Creating	this	many	files	lead	to	missed	heartbeats,	causing	
frequent	rebalances.
• 25%	of	Flume’s	time	was	creating	files	on	HDFS
FeedSplitter
• Removing	Flume	immediately	was	impractical
• Flume’s	only	job	now	is	to	read	from	Kafka	and	
write	to	HDFS
• It	is	pretty	good	at	this.
• Lowered	latency	for	HDFS	and	increased	the	
reliability	of	the	pipeline.
Meanwhile, Back in Gnomeregan…
• We	were	asked	to	build	a	new	“Telemetry”	pipeline	to	
support	operational	data	for	the	launch	of	Overwatch
• Telemetry	combines	ideas	from	our	legacy	pipelines	to	
offer	the	best	of	both:
• Schema	Registry	Service	
• Lambda	architecture
• Able	to	collect	both	client	and	server	data
• Flume	is	still	used	to	write	to	HDFS
node-rdkafka
• Initial	version	of	the	pipeline	was	written	in	node.js
• We	dedicated	a	developer	to	writing	node.js bindings	for	
librdkafka.
• Since	releasing	it	on	GitHub	in	2016,	the	project	has	
seen	340+	stars	and	190+	commits	with	contributions	
from	IBM,	Wikimedia	and	others	in	addition	to	ourselves
Data Platform Remastered
• Multiple	Datastores
• Elasticsearch for	Short	term	(7	days),	Near	Real-time	
storage/visualization
• Cassandra	for	time	series	(metrics	and	events)
• HDFS	for	long	term,	indefinite	storage.
• More	Datasources
• REST	API	(metrics,	events,	alerts)
• Syslog	RFC-5424
• Legacy	AMQP	still	supported
Lessons Learned
• If	you	send	the	string	“null”	as	your	key	to	a	
certain	Kafka	library,	it	maps	to	partition	9
• As	with	Hadoop,	there	are	always	new	edge	
cases.
• Don’t	have	many	MirrorMakers producing	into	
the	same	topics.
• Don’t	send	messages	over	message.max.bytes
in	a	while	loop
• Don’t	ignore	consumer	rebalances
• But	the	biggest	lesson…
Bigger isn’t always better
• We	built	our	brokers	like	Hadoop	datanodes:
• 15	4TB	drives	(RAID	10)
• 256	GB	RAM
• 40	logical	cores
• Even	after	tweaking,	it	can	take	5-10	minutes	to	bring	
up	a	broker	that	didn’t	shut	down	cleanly.
Topic Naming is Hard
• datatype-version-product-source-datacenter?
• datatype-version-pipeline-product-source-destination?
• Record	headers	(KAFKA-4208)	may	simplify	this	for	us
Partitioning is (also) Hard
• Minimum	partitions:	2
• Average	partition	size:	20
• Largest	partition	size:	48	(previously	256)
• Replication	factor:	3	(with	min.isr of	2)
• Central	Kafka:	~13000	partitions	over	12	brokers
• Regional	Kafka:	~3000	partitions	over	6	brokers
Monitoring & Alerting
• LinkedIn’s	Burrow	monitors	offsets.
• Biggest	limitation:	Calculates	lag	at	request	time,	
not	commit	time	(Burrow	Issue	#127)
• End	to	end	metrics	tell	us	how	long	a	message	
took	to	process.
• Kafka	&	ZooKeeper JMX	values	are	piped	into	
our	metrics	system	with	jmxterm.
• An	isolated	version	of	our	pipeline	is	used	to	
monitor	the	health	of	the	customer	pipeline.
The Future Soon™
• Replacing	Flume
• Flume	has	served	us	well,	but	we’ve	pushed	it	as	far	
as	we	feel	we	could	(should?)
• We’re	considering	Gobblin,	Kafka	Connect	or	a	
custom	Spark	application	to	write	data	to	HDFS
• We’ll	still	use	Flume	to	send	from	our	legacy	pipeline,	
as	those	RabbitMQ servers	will	probably	stick	around	
for	years.
Isolation and the Cloud
• Through	our	TDK	(Telemetry	Development	Kit)	users	can	
spin	up	an	isolated,	virtual	version	of	our	pipeline	to	
develop	against.
• What	if	we	deployed	the	entire	pipeline	this	way?
• Smaller,	dynamically	provisioned	Kafkas in	the	cloud	
would	serve	to	isolate	teams	from	each	other	
• Requires	far	more	automation
• Reduces	cross-product	impacts.
• Requirements:
• Service	discovery
• Automation
• Routing	service
Links
• https://github.com/Blizzard/node-rdkafka
• Patches	welcome!
• https://github.com/linkedin/Burrow
• For	monitoring	consumer	offsets
• https://github.com/jiaqi/jmxterm
• For	getting	metrics	out	of	Kafka	&	ZooKeeper
• https://github.com/Yelp/kafka-utils
• For	cloning/managing	consumer	groups/offsets
We’re Hiring!
• https://careers.blizzard.com/en-us/
• We	have	a	giant	Orc

More Related Content

Similar to Kafka Summit SF 2017 - How Blizzard Used Kafka to Save Our Pipeline (and Azeroth)

Cyber Analytics Applications for Data-Intensive Computing
Cyber Analytics Applications for Data-Intensive ComputingCyber Analytics Applications for Data-Intensive Computing
Cyber Analytics Applications for Data-Intensive ComputingMike Fisk
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Caterpillar’s move to the cloud: cutting edge tools for a cutting-edge business
Caterpillar’s move to the cloud: cutting edge tools for a cutting-edge businessCaterpillar’s move to the cloud: cutting edge tools for a cutting-edge business
Caterpillar’s move to the cloud: cutting edge tools for a cutting-edge businessDataWorks Summit
 
Python for High Throughput Science by Mark Basham
Python for High Throughput Science by Mark BashamPython for High Throughput Science by Mark Basham
Python for High Throughput Science by Mark BashamPyData
 
1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBSJim Plush
 
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...Lucidworks
 
QCONSF - FaunaDB Deterministic Transactions
QCONSF - FaunaDB Deterministic TransactionsQCONSF - FaunaDB Deterministic Transactions
QCONSF - FaunaDB Deterministic TransactionsChris Anderson
 
Using IT Equipment in Live Broadcast
Using IT Equipment in Live BroadcastUsing IT Equipment in Live Broadcast
Using IT Equipment in Live BroadcastKieran Kunhya
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureInSemble
 
sparql,uniprot.org in production
sparql,uniprot.org in productionsparql,uniprot.org in production
sparql,uniprot.org in productionJerven Bolleman
 
How to Monitor DOCSIS Devices Using SNMP, InfluxDB, and Telegraf
How to Monitor DOCSIS Devices Using SNMP, InfluxDB, and TelegrafHow to Monitor DOCSIS Devices Using SNMP, InfluxDB, and Telegraf
How to Monitor DOCSIS Devices Using SNMP, InfluxDB, and TelegrafInfluxData
 
Science DMZ as a Service: Creating Science Super- Facilities with GENI
Science DMZ as a Service: Creating Science Super- Facilities with GENIScience DMZ as a Service: Creating Science Super- Facilities with GENI
Science DMZ as a Service: Creating Science Super- Facilities with GENIUS-Ignite
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelThomas Graf
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Drag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFiDrag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFi"Constantin \"Cristi\"" Stanca
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Monitoring anomalies in experimentation platform.
Monitoring anomalies in experimentation platform.Monitoring anomalies in experimentation platform.
Monitoring anomalies in experimentation platform.Deepak Vasthimal
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkKarthik Deivasigamani
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)KafkaZone
 

Similar to Kafka Summit SF 2017 - How Blizzard Used Kafka to Save Our Pipeline (and Azeroth) (20)

Cyber Analytics Applications for Data-Intensive Computing
Cyber Analytics Applications for Data-Intensive ComputingCyber Analytics Applications for Data-Intensive Computing
Cyber Analytics Applications for Data-Intensive Computing
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Caterpillar’s move to the cloud: cutting edge tools for a cutting-edge business
Caterpillar’s move to the cloud: cutting edge tools for a cutting-edge businessCaterpillar’s move to the cloud: cutting edge tools for a cutting-edge business
Caterpillar’s move to the cloud: cutting edge tools for a cutting-edge business
 
Python for High Throughput Science by Mark Basham
Python for High Throughput Science by Mark BashamPython for High Throughput Science by Mark Basham
Python for High Throughput Science by Mark Basham
 
1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS
 
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
100 Billion Documents And Counting: Rebuilding Message Search at Slack - Josh...
 
QCONSF - FaunaDB Deterministic Transactions
QCONSF - FaunaDB Deterministic TransactionsQCONSF - FaunaDB Deterministic Transactions
QCONSF - FaunaDB Deterministic Transactions
 
Using IT Equipment in Live Broadcast
Using IT Equipment in Live BroadcastUsing IT Equipment in Live Broadcast
Using IT Equipment in Live Broadcast
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
sparql,uniprot.org in production
sparql,uniprot.org in productionsparql,uniprot.org in production
sparql,uniprot.org in production
 
How to Monitor DOCSIS Devices Using SNMP, InfluxDB, and Telegraf
How to Monitor DOCSIS Devices Using SNMP, InfluxDB, and TelegrafHow to Monitor DOCSIS Devices Using SNMP, InfluxDB, and Telegraf
How to Monitor DOCSIS Devices Using SNMP, InfluxDB, and Telegraf
 
Science DMZ as a Service: Creating Science Super- Facilities with GENI
Science DMZ as a Service: Creating Science Super- Facilities with GENIScience DMZ as a Service: Creating Science Super- Facilities with GENI
Science DMZ as a Service: Creating Science Super- Facilities with GENI
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Accelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux KernelAccelerating Envoy and Istio with Cilium and the Linux Kernel
Accelerating Envoy and Istio with Cilium and the Linux Kernel
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Drag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFiDrag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFi
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Monitoring anomalies in experimentation platform.
Monitoring anomalies in experimentation platform.Monitoring anomalies in experimentation platform.
Monitoring anomalies in experimentation platform.
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 

More from confluent

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluentconfluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkconfluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloudconfluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 

More from confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Recently uploaded

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...Health
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 

Recently uploaded (20)

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 

Kafka Summit SF 2017 - How Blizzard Used Kafka to Save Our Pipeline (and Azeroth)

  • 2. Jeff Field - @jfield – Systems Engineer, Big Data The Data Gnomes: Dustin Koupal - @cooper6581 – Senior Systems Engineer, Big Data
  • 4. Rabbits. • First pipeline was built on RabbitMQ and Flume • Dozens of collectors fanned into 6 central Flumes • Everyday I’m Buffering • 100+ incidents were caused by overflowing queues
  • 6. Why protobuf? • Our developers were already using protobuf for interoperability • Using established tech helped all game teams implement our first pipeline in months instead of years. For Blizzard, this was unimaginably quick.
  • 7. Trends – Incoming Data • May 2015 – Legacy pipeline received ~3.5 billion events per day • Today - Our pipelines currently receive over 20 billion events per day. • Each product release sees more messages added, so our system has to be ready to scale. • Rarely do old messages go away (we’re working on that)
  • 8. Data Gnomes vs. Talon & The Legion In 2016, our army of Rabbits came under attack on several fronts: • Overwatch launched in May of 2016 • The Legion would return to Azeroth in World of Warcraft’s latest expansion three months later
  • 9. Flafka – Flume & Kafka • We needed something to augment our existing pipeline • Many people suggested we could do Flume’s job “with 4 lines of Spark” in 2015. • These people may have been working for the Legion. • Comfort with Flume deployments made “Flafka” the natural choice • Retrofitted existing pipeline with no message loss
  • 11. Flafka (Continued) • Flafka enabled us to write to multiple HDFS clusters for the first time, simplifying cluster migrations. • Customized version of the Kafka Channel Next Bottleneck: • Too many open files • 0.9 Consumer doesn’t have a dedicated heartbeat thread • Creating this many files lead to missed heartbeats, causing frequent rebalances. • 25% of Flume’s time was creating files on HDFS
  • 12. FeedSplitter • Removing Flume immediately was impractical • Flume’s only job now is to read from Kafka and write to HDFS • It is pretty good at this. • Lowered latency for HDFS and increased the reliability of the pipeline.
  • 13.
  • 14. Meanwhile, Back in Gnomeregan… • We were asked to build a new “Telemetry” pipeline to support operational data for the launch of Overwatch • Telemetry combines ideas from our legacy pipelines to offer the best of both: • Schema Registry Service • Lambda architecture • Able to collect both client and server data • Flume is still used to write to HDFS
  • 15. node-rdkafka • Initial version of the pipeline was written in node.js • We dedicated a developer to writing node.js bindings for librdkafka. • Since releasing it on GitHub in 2016, the project has seen 340+ stars and 190+ commits with contributions from IBM, Wikimedia and others in addition to ourselves
  • 16. Data Platform Remastered • Multiple Datastores • Elasticsearch for Short term (7 days), Near Real-time storage/visualization • Cassandra for time series (metrics and events) • HDFS for long term, indefinite storage. • More Datasources • REST API (metrics, events, alerts) • Syslog RFC-5424 • Legacy AMQP still supported
  • 17. Lessons Learned • If you send the string “null” as your key to a certain Kafka library, it maps to partition 9 • As with Hadoop, there are always new edge cases. • Don’t have many MirrorMakers producing into the same topics. • Don’t send messages over message.max.bytes in a while loop • Don’t ignore consumer rebalances • But the biggest lesson…
  • 18. Bigger isn’t always better • We built our brokers like Hadoop datanodes: • 15 4TB drives (RAID 10) • 256 GB RAM • 40 logical cores • Even after tweaking, it can take 5-10 minutes to bring up a broker that didn’t shut down cleanly.
  • 19. Topic Naming is Hard • datatype-version-product-source-datacenter? • datatype-version-pipeline-product-source-destination? • Record headers (KAFKA-4208) may simplify this for us Partitioning is (also) Hard • Minimum partitions: 2 • Average partition size: 20 • Largest partition size: 48 (previously 256) • Replication factor: 3 (with min.isr of 2) • Central Kafka: ~13000 partitions over 12 brokers • Regional Kafka: ~3000 partitions over 6 brokers
  • 20. Monitoring & Alerting • LinkedIn’s Burrow monitors offsets. • Biggest limitation: Calculates lag at request time, not commit time (Burrow Issue #127) • End to end metrics tell us how long a message took to process. • Kafka & ZooKeeper JMX values are piped into our metrics system with jmxterm. • An isolated version of our pipeline is used to monitor the health of the customer pipeline.
  • 21. The Future Soon™ • Replacing Flume • Flume has served us well, but we’ve pushed it as far as we feel we could (should?) • We’re considering Gobblin, Kafka Connect or a custom Spark application to write data to HDFS • We’ll still use Flume to send from our legacy pipeline, as those RabbitMQ servers will probably stick around for years.
  • 22. Isolation and the Cloud • Through our TDK (Telemetry Development Kit) users can spin up an isolated, virtual version of our pipeline to develop against. • What if we deployed the entire pipeline this way? • Smaller, dynamically provisioned Kafkas in the cloud would serve to isolate teams from each other • Requires far more automation • Reduces cross-product impacts. • Requirements: • Service discovery • Automation • Routing service
  • 23. Links • https://github.com/Blizzard/node-rdkafka • Patches welcome! • https://github.com/linkedin/Burrow • For monitoring consumer offsets • https://github.com/jiaqi/jmxterm • For getting metrics out of Kafka & ZooKeeper • https://github.com/Yelp/kafka-utils • For cloning/managing consumer groups/offsets