Designing Scalable and Extendable Data Pipeline for Call Of Duty Games

•

0 likes•1,064 views

This document discusses scaling and extending the data pipeline for Call of Duty games. Some key points: 1) The data pipeline uses Apache Kafka topics partitioned for scalability, but too many topics creates operational overhead, so topic naming follows conventions expressing data types rather than producers/consumers. 2) A stream processing layer called "Refinery" enables measuring, validating, enriching, filtering and routing messages to make the pipeline flexible for new use cases. 3) A unified message envelope and Schema Registry supporting multiple formats allows easy integration of new games into the pipeline by standardizing the message format.

Software

Designing Scalable
and Extendable Data
Pipeline for Call of
Duty Games
Yaroslav Tkachenko
Senior Data Engineer at Activision

Number of topics in the
biggest cluster
(Apache Kafka) 600+

Scaling the data pipeline even further
Volume
Well-known industry
techniques
Games
Using previous experience
Use-cases
Completely unpredictable
Complexity

0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Kafka topic
Consumer
or
Producer
Partition 1
Partition 2
Partition 3
Kafka topics are partitioned and replicated

We need to keep the number
of topics and partitions low
More topics means more operational burden.
Number of partitions in a fixed cluster is not infinite.
Autoscaling Kafka is impossible, scaling is hard.

Topic naming convention
$env.$source.$title.$category-$version
prod.glutton.1234.telemetry_match_event-v1
Unique game id
“CoD WW2 on PSN”Producer

A proper solution has
been invented decades
ago.
Think about databases.

Messaging system IS a
form of a database
Data topic = Database + Table.
Data topic = Namespace + Data type.

telemetry.matches
user.logins
marketplace.purchases
prod.glutton.1234.telemetry_match_event-v1
dev.user_login_records.4321.all-v1
prod.marketplace.5678.purchase_event-v1
Compare this

Each approach has pros and cons
• Topics that use metadata for their
names are obviously easier to track
and monitor (and even consume).
• As a consumer, I can consume
exactly what I want, instead of
consuming a single large topic and
extracting required values.
• These dynamic fields can and will
change. Producers (sources) and
consumers will change.
• Very efficient utilization of topics
and partitions.
• Finally, it’s impossible to enforce
any constraints with a topic name.
And you can always end up with dev
data in prod topic and vice versa.

After removing
necessary metadata
from the topic names
stream processing
becomes mandatory.

Stream processing becomes mandatory
Measuring → Validating → Enriching → Filtering & routing

Having a single
message schema for a
topic is more than
just a nice-to-have.

Stream processor
JSON Protobuf
Custom Avro
? ?
? ?

// Application.java
props.put("value.deserializer", "com.example.CustomDeserializer");
// CustomDeserializer.java
public class CustomDeserializer implements Deserializer<???> {
@Override
public ??? deserialize(String topic, byte[] data) {
???
}
}
Custom deserialization

Message envelope anatomy
ID, env, timestamp, source, game, ...
Event
Header / Metadata
Body / Payload
Message

$Unified message envelope syntax = "proto2"; message MessageEnvelope { optional bytes message_id = 1; optional uint64 created_at = 2; optional uint64 ingested_at = 3; optional string source = 4; optional uint64 title_id = 5; optional string env = 6; optional UserInfo resource_owner = 7; optional SchemaInfo schema_info = 8; optional string message_name = 9; optional bytes message = 100; }$

Schema Registry
• API to manage message schemas
• Single source of truth for all producers and consumers
• It should be impossible to send a message to the pipeline
without registering its schema in the Schema Registry!
• Good Schema Registry supports immutability, versioning and
basic validation
• Activision uses custom Schema Registry implemented with
Python and Cassandra

Summary: scaling and extending the data pipeline
Games
• By using unified message envelope
and topic names adding a new game
becomes almost effortless
• “Operational” stream processing
makes it possible
• Still flexible enough: each game can
use its own message payload format
via Schema Registry
Use-cases
• Topic names express data types, not
producers or consumers
• Stream filtering & routing allows
low-cost experiments
• Data catalog built on top of Schema
Registry promotes data discovery

What's hot

Exactly-once Stream Processing with Kafka StreamsGuozhang Wang

Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang

KSQL Introconfluent

KSQL: Streaming SQL for Kafkaconfluent

Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread confluent

Multi cluster, multitenant and hierarchical kafka messaging service slideshareAllen (Xiaozhong) Wang

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent

Stream Application Development with Apache KafkaMatthias J. Sax

Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica

Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang

Kafkashrenikp

kafkaAriel Moskovich

LINE's messaging service architecture underlying more than 200 million monthl...kawamuray

Kafka meetup JP #3 - Engineering Apache Kafka at LINEkawamuray

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein

Chicago Kafka MeetupCliff Gilmore

Apache Kafka DemoEdward Capriolo

ksqlDB: A Stream-Relational Database Systemconfluent

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...HostedbyConfluent

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang

What's hot (20)

Exactly-once Stream Processing with Kafka Streams

Apache Kafka, and the Rise of Stream Processing

KSQL Intro

KSQL: Streaming SQL for Kafka

Kafka Summit NYC 2017 - The Best Thing Since Partitioned Bread

Multi cluster, multitenant and hierarchical kafka messaging service slideshare

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning

Stream Application Development with Apache Kafka

Webinar: Deep Dive on Apache Flink State - Seth Wiesman

Performance Analysis and Optimizations for Kafka Streams Applications

Kafka

kafka

LINE's messaging service architecture underlying more than 200 million monthl...

Kafka meetup JP #3 - Engineering Apache Kafka at LINE

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo

Chicago Kafka Meetup

Apache Kafka Demo

ksqlDB: A Stream-Relational Database System

Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...

Similar to Designing Scalable and Extendable Data Pipeline for Call Of Duty Games

Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...confluent

Leveraging Azure Databricks to minimize time to insight by combining Batch an...Microsoft Tech Community

Polylog: A Log-Based Architecture for Distributed SystemsLongtail Video

Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks

AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...Amazon Web Services

Writing Continuous Applications with Structured Streaming in PySparkDatabricks

Large-Scale Data Science in Apache Spark 2.0Databricks

Data Lakes with Azure DatabricksData Con LA

Distributed Interactive Computing Environment (DICE)The HDF-EOS Tools and Information Center

Data Grids with Oracle CoherenceBen Stopford

Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman

Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks

I can't believe it's not a queue: Kafka and SpringJoe Kutner

What is Apache Kafka®?Eventador

What is apache Kafka?Kenny Gorman

Masterclass Live: Amazon EMRAmazon Web Services

Apache big data 2016 - Speaking the language of Big Datatechmaddy

High order bits from cassandra & hadoopsrisatish ambati

Bids talk 9.18Travis Oliphant

CrateDB 101: Sensor dataClaus Matzinger

Similar to Designing Scalable and Extendable Data Pipeline for Call Of Duty Games (20)

Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...

Leveraging Azure Databricks to minimize time to insight by combining Batch an...

Polylog: A Log-Based Architecture for Distributed Systems

Writing Continuous Applications with Structured Streaming Python APIs in Apac...

AWS re:Invent 2016: Real-Time Data Exploration and Analytics with Amazon Elas...

Writing Continuous Applications with Structured Streaming in PySpark

Large-Scale Data Science in Apache Spark 2.0

Data Lakes with Azure Databricks

Distributed Interactive Computing Environment (DICE)

Data Grids with Oracle Coherence

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Easy, scalable, fault tolerant stream processing with structured streaming - ...

I can't believe it's not a queue: Kafka and Spring

What is Apache Kafka®?

What is apache Kafka?

Masterclass Live: Amazon EMR

Apache big data 2016 - Speaking the language of Big Data

High order bits from cassandra & hadoop

Bids talk 9.18

CrateDB 101: Sensor data

Recently uploaded

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

5 Signs You Need a Fashion PLM Software.pdfWave PLM

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Software Quality Assurance Interview QuestionsArshad QA

TECUNIQUE: Success Stories: IT Service providermohitmore19

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Recently uploaded (20)

How To Use Server-Side Rendering with Nuxt.js

5 Signs You Need a Fashion PLM Software.pdf

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service

A Secure and Reliable Document Management System is Essential.docx

Diamond Application Development Crafting Solutions with Precision

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Software Quality Assurance Interview Questions

TECUNIQUE: Success Stories: IT Service provider

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

HR Software Buyers Guide in 2024 - HRSoftware.com

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Designing Scalable and Extendable Data Pipeline for Call Of Duty Games

1. Designing Scalable and Extendable Data Pipeline for Call of Duty Games Yaroslav Tkachenko Senior Data Engineer at Activision

4. 1+ PB Data lake size (AWS S3)

5. Number of topics in the biggest cluster (Apache Kafka) 600+

6. 10k+ Messages per second (Apache Kafka)

7. Scaling the data pipeline even further Volume Well-known industry techniques Games Using previous experience Use-cases Completely unpredictable Complexity

9. 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Kafka topic Consumer or Producer Partition 1 Partition 2 Partition 3 Kafka topics are partitioned and replicated

10. We need to keep the number of topics and partitions low More topics means more operational burden. Number of partitions in a fixed cluster is not infinite. Autoscaling Kafka is impossible, scaling is hard.

11. Topic naming convention $env.$source.$title.$category-$version prod.glutton.1234.telemetry_match_event-v1 Unique game id “CoD WW2 on PSN”Producer

12. A proper solution has been invented decades ago. Think about databases.

13. Messaging system IS a form of a database Data topic = Database + Table. Data topic = Namespace + Data type.

14. telemetry.matches user.logins marketplace.purchases prod.glutton.1234.telemetry_match_event-v1 dev.user_login_records.4321.all-v1 prod.marketplace.5678.purchase_event-v1 Compare this

15. Each approach has pros and cons • Topics that use metadata for their names are obviously easier to track and monitor (and even consume). • As a consumer, I can consume exactly what I want, instead of consuming a single large topic and extracting required values. • These dynamic fields can and will change. Producers (sources) and consumers will change. • Very efficient utilization of topics and partitions. • Finally, it’s impossible to enforce any constraints with a topic name. And you can always end up with dev data in prod topic and vice versa.

16. After removing necessary metadata from the topic names stream processing becomes mandatory.

17. Stream processing becomes mandatory Measuring → Validating → Enriching → Filtering & routing

18. Refinery

19. Having a single message schema for a topic is more than just a nice-to-have.

20. Number of supported message formats 8

21. Stream processor JSON Protobuf Custom Avro ? ? ? ?

22. // Application.java props.put("value.deserializer", "com.example.CustomDeserializer"); // CustomDeserializer.java public class CustomDeserializer implements Deserializer<???> { @Override public ??? deserialize(String topic, byte[] data) { ??? } } Custom deserialization

23. Message envelope anatomy ID, env, timestamp, source, game, ... Event Header / Metadata Body / Payload Message

24. Unified message envelope syntax = "proto2"; message MessageEnvelope { optional bytes message_id = 1; optional uint64 created_at = 2; optional uint64 ingested_at = 3; optional string source = 4; optional uint64 title_id = 5; optional string env = 6; optional UserInfo resource_owner = 7; optional SchemaInfo schema_info = 8; optional string message_name = 9; optional bytes message = 100; }

25. Schema Registry • API to manage message schemas • Single source of truth for all producers and consumers • It should be impossible to send a message to the pipeline without registering its schema in the Schema Registry! • Good Schema Registry supports immutability, versioning and basic validation • Activision uses custom Schema Registry implemented with Python and Cassandra

26. Summary: scaling and extending the data pipeline Games • By using unified message envelope and topic names adding a new game becomes almost effortless • “Operational” stream processing makes it possible • Still flexible enough: each game can use its own message payload format via Schema Registry Use-cases • Topic names express data types, not producers or consumers • Stream filtering & routing allows low-cost experiments • Data catalog built on top of Schema Registry promotes data discovery

27. Thanks! @sap1ens

Designing Scalable and Extendable Data Pipeline for Call Of Duty Games

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Designing Scalable and Extendable Data Pipeline for Call Of Duty Games

Similar to Designing Scalable and Extendable Data Pipeline for Call Of Duty Games (20)

More from Yaroslav Tkachenko

More from Yaroslav Tkachenko (13)

Recently uploaded

Recently uploaded (20)

Designing Scalable and Extendable Data Pipeline for Call Of Duty Games