Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building Scalable and
Extendable Data Pipeline
for Call of Duty Games:
Lessons Learned
Yaroslav Tkachenko
Senior Data Engi...
1+
PB
Data lake size
(AWS S3)
Number of topics in the
biggest cluster
(Apache Kafka) 500+
10k - 100k+
Messages per second
(Apache Kafka)
Scaling the data pipeline even further
Volume
Industry best practices
Games
Using previous experience
Use-cases
Completely...
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Kafka topic
Consumer
or
Producer
Partition 1
Partition 2
Partiti...
Scaling the pipeline
in terms of Volume
Producers Consumers
Scaling producers
• Asynchronous / non-blocking writes (default)
• Compression and batching
• Sampling
• Throttling
• Acks...
Proxy
Each approach has pros and cons
• Simple
• Low-latency connection
• Number of TCP connections per
broker starts to look sc...
Simple rule for
high-performant
producers? Just write
to Kafka, nothing
else1
.
1. Not even auth?
Scaling Kafka clusters
• Just add more nodes!
• Disk IO is extremely important
• Tuning io.threads and network.threads
• R...
It’s not always about
tuning. Sometimes we need
more than one cluster.
Different workloads require
different topologies.
● Ingestion (HTTP Proxy)
● Long retention
● High SLA
● Lots of consumers
● Medium retention
● ACL
● Stream processing
● Sh...
Scaling consumers is
usually pretty trivial -
just increase the number of
partitions.
Unless… you can’t. What
then?
Metadata Message Queue
Archiver
Block
Storage
Work Queue
Populator
MetadataMicrobatch
Even if you can add more
partitions
• Still can have bottlenecks within a partition (large messages)
• In case of reproces...
You can’t be sure about
any improvements without
load testing.
Not only for a cluster,
but producers and
consumers too.
Scaling and extending
the pipeline in terms of
Games and Use-cases
We need to keep the number
of topics and partitions low
• More topics means more operational burden
• Number of partitions...
Topic naming convention
$env.$source.$title.$category-$version
prod.glutton.1234.telemetry_match_event-v1
Unique game id
“...
A proper solution has
been invented decades
ago.
Think about databases.
Messaging system IS a
form of a database
Data topic = Database + Table.
Data topic = Namespace + Data type.
telemetry.matches
user.logins
marketplace.purchases
prod.glutton.1234.telemetry_match_event-v1
dev.user_login_records.4321...
Each approach has pros and cons
• Topics that use metadata for their
names are obviously easier to track
and monitor (and ...
After removing
necessary metadata
from the topic names
stream processing
becomes mandatory.
Stream processing becomes mandatory
Measuring → Validating → Enriching → Filtering & routing
Having a single
message schema for a
topic is more than
just a nice-to-have.
Number of supported
message formats 8
Stream processor
JSON Protobuf
Custom Avro
? ?
? ?
// Application.java
props.put("value.deserializer", "com.example.CustomDeserializer");
// CustomDeserializer.java
public c...
Message envelope anatomy
ID, env, timestamp, source, game, ...
Event
Header / Metadata
Body / Payload
Message
Unified message envelope
syntax = "proto2";
message MessageEnvelope {
optional bytes message_id = 1;
optional uint64 creat...
Schema Registry
• API to manage message schemas
• Single source of truth for all producers and consumers
• It should be im...
Summary
• Kafka tuning and best practices matter
• Invest in good SDKs for producing and consuming data
• Unified message ...
Thanks!
@sap1ens
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned
Upcoming SlideShare
Loading in …5
×

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned

1,194 views

Published on

What can be easier than building a data pipeline nowadays? You add a few Apache Kafka clusters, some way to ingest data (probably over HTTP), design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse... wait, it does start to look like A LOT of things, doesn't it? And you probably want to make it highly scalable and available in the end, correct?
We've been developing a data pipeline in Demonware/Activision for a while. We learned how to scale it not only in terms of messages/sec it can handle, but also in terms of supporting more games and more use-cases.

In this presentation you'll hear about the lessons we learned, including (but not limited to):
- Message schemas
- Apache Kafka organization and tuning
- Topics naming conventions, structure and routing
- Reliable and scalable producers and ingestion layer
- Stream processing

Published in: Software

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned

  1. 1. Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned Yaroslav Tkachenko Senior Data Engineer at Activision
  2. 2. 1+ PB Data lake size (AWS S3)
  3. 3. Number of topics in the biggest cluster (Apache Kafka) 500+
  4. 4. 10k - 100k+ Messages per second (Apache Kafka)
  5. 5. Scaling the data pipeline even further Volume Industry best practices Games Using previous experience Use-cases Completely unpredictable Complexity
  6. 6. 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Kafka topic Consumer or Producer Partition 1 Partition 2 Partition 3 Kafka topics are partitioned and replicated
  7. 7. Scaling the pipeline in terms of Volume
  8. 8. Producers Consumers
  9. 9. Scaling producers • Asynchronous / non-blocking writes (default) • Compression and batching • Sampling • Throttling • Acks? 0, 1, -1 • Standard Kafka producer tuning: batch.size, linger.ms, buffer.memory, etc.
  10. 10. Proxy
  11. 11. Each approach has pros and cons • Simple • Low-latency connection • Number of TCP connections per broker starts to look scary • Really hard to do maintenance on Kafka clusters • Flexible • Possible to do basic enrichment • Easier to manage Kafka clusters
  12. 12. Simple rule for high-performant producers? Just write to Kafka, nothing else1 . 1. Not even auth?
  13. 13. Scaling Kafka clusters • Just add more nodes! • Disk IO is extremely important • Tuning io.threads and network.threads • Retention • For more: “Optimizing Your Apache Kafka Deployment” whitepaper from Confluent
  14. 14. It’s not always about tuning. Sometimes we need more than one cluster. Different workloads require different topologies.
  15. 15. ● Ingestion (HTTP Proxy) ● Long retention ● High SLA ● Lots of consumers ● Medium retention ● ACL ● Stream processing ● Short retention ● More partitions
  16. 16. Scaling consumers is usually pretty trivial - just increase the number of partitions. Unless… you can’t. What then?
  17. 17. Metadata Message Queue Archiver Block Storage Work Queue Populator MetadataMicrobatch
  18. 18. Even if you can add more partitions • Still can have bottlenecks within a partition (large messages) • In case of reprocessing, it’s really hard to quickly add A LOT of new partitions AND remove them after • Also, number of partitions is not infinite
  19. 19. You can’t be sure about any improvements without load testing. Not only for a cluster, but producers and consumers too.
  20. 20. Scaling and extending the pipeline in terms of Games and Use-cases
  21. 21. We need to keep the number of topics and partitions low • More topics means more operational burden • Number of partitions in a fixed cluster is not infinite • Autoscaling Kafka is impossible, scaling is hard
  22. 22. Topic naming convention $env.$source.$title.$category-$version prod.glutton.1234.telemetry_match_event-v1 Unique game id “CoD WW2 on PSN”Producer
  23. 23. A proper solution has been invented decades ago. Think about databases.
  24. 24. Messaging system IS a form of a database Data topic = Database + Table. Data topic = Namespace + Data type.
  25. 25. telemetry.matches user.logins marketplace.purchases prod.glutton.1234.telemetry_match_event-v1 dev.user_login_records.4321.all-v1 prod.marketplace.5678.purchase_event-v1 Compare this
  26. 26. Each approach has pros and cons • Topics that use metadata for their names are obviously easier to track and monitor (and even consume). • As a consumer, I can consume exactly what I want, instead of consuming a single large topic and extracting required values. • These dynamic fields can and will change. Producers (sources) and consumers will change. • Very efficient utilization of topics and partitions. • Finally, it’s impossible to enforce any constraints with a topic name. And you can always end up with dev data in prod topic and vice versa.
  27. 27. After removing necessary metadata from the topic names stream processing becomes mandatory.
  28. 28. Stream processing becomes mandatory Measuring → Validating → Enriching → Filtering & routing
  29. 29. Having a single message schema for a topic is more than just a nice-to-have.
  30. 30. Number of supported message formats 8
  31. 31. Stream processor JSON Protobuf Custom Avro ? ? ? ?
  32. 32. // Application.java props.put("value.deserializer", "com.example.CustomDeserializer"); // CustomDeserializer.java public class CustomDeserializer implements Deserializer<???> { @Override public ??? deserialize(String topic, byte[] data) { ??? } } Custom deserialization
  33. 33. Message envelope anatomy ID, env, timestamp, source, game, ... Event Header / Metadata Body / Payload Message
  34. 34. Unified message envelope syntax = "proto2"; message MessageEnvelope { optional bytes message_id = 1; optional uint64 created_at = 2; optional uint64 ingested_at = 3; optional string source = 4; optional uint64 title_id = 5; optional string env = 6; optional UserInfo resource_owner = 7; optional SchemaInfo schema_info = 8; optional string message_name = 9; optional bytes message = 100; }
  35. 35. Schema Registry • API to manage message schemas • Single source of truth for all producers and consumers • It should be impossible to send a message to the pipeline without registering its schema in the Schema Registry! • Good Schema Registry supports immutability, versioning and basic validation • Activision uses custom Schema Registry implemented with Python and Cassandra
  36. 36. Summary • Kafka tuning and best practices matter • Invest in good SDKs for producing and consuming data • Unified message envelope and topic names make adding a new game almost effortless • “Operational” stream processing makes it possible. Make sure you can support adhoc filtering and routing of data • Topic names should express data types, not producer or consumer metadata • Schema Registry is a must-have
  37. 37. Thanks! @sap1ens

×