Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming

267 views

Published on

Activision Data team has been running a data pipeline for a variety of Activision games for many years. Historically we used a mix of micro-batch microservices coupled with classic Big Data tools like Hadoop and Hive for ETL. As a result, it could take up to 4-6 hours for data to be available to the end customers.

In the last few years, the adoption of data in the organization skyrocketed. We needed to de-legacy our data pipeline and provide near-realtime access to data in order to improve reporting, gather insights faster, power web and mobile applications. I want to tell a story about heavily leveraging Kafka Streams and Kafka Connect to reduce the end latency to minutes, at the same time making the pipeline easier and cheaper to run. We were able to successfully validate the new data pipeline by launching two massive games just 4 weeks apart.

Published in: Data & Analytics
  • Be the first to comment

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming

  1. 1. Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming © 2020 Activision Publishing, Inc.
  2. 2. Hello! I am Yaroslav Tkachenko Software Architect at Activision Data. You can find me at @sap1ens (pretty much everywhere). 2
  3. 3. Activision Data Pipeline 3 ● Ingesting, processing and storing game telemetry data ● Providing tabular, API and streaming access to data
  4. 4. HTTP API Schema Registry Magic
  5. 5. 200k+ msg/s Ingestion rate 9 years Age of the oldest game 5+ PB Data lake size (AWS S3) 5
  6. 6. Challenges ● Complex client-side & server-side game telemetry ● Long-living titles, hard to update or deprecate ● Various data formats, message schemas and envelopes ● Development data == production data ● Scalability, elasticity & cost 6
  7. 7. Established standards 7 ● Kafka topic name conventions must be followed ● Payload schema must be uploaded to the Schema Registry ● Message envelope has a schema too (Protobuf), with a set of required fields
  8. 8. Old pipeline Quick overview
  9. 9. aggregate transform transform devdata proddata
  10. 10. Batch job* (MR, Hive, Spark) ETL API * every X hours transformed data ETL’ed data Prod data
  11. 11. Old pipeline Architecture Flaws ● Scalability solution as a workaround ● Painful to switch between dev & prod ● No streaming capabilities ● Adhoc integration Bottlenecks ● Latency limitations ● MR glob length, memory is not infinite (ETL API), etc. ● Lots of manual configuration ● Lots of manual ETL 11
  12. 12. New pipeline It gets better from here
  13. 13. Apache Kafka ● The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams. ● The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table. 13
  14. 14. ~10 seconds End-to-end streaming latency 90% cheaper Per user/byte 6-24 hours → 5-10 mins Tabular data available for querying 14
  15. 15. Kafka Streams ● One transformation step = one service* ○ Not entirely true anymore, we’ve combined some steps to optimize cost and reduce unnecessary IO ● Stateless if possible ● Rich routing ● Auto-scaling & self-healing ● LOTS of tooling Guiding principles Kafka Connect ● Handle integration - AWS S3, Cassandra, Elasticsearch, etc. ● Only sink connectors ● Invest in configuration, deployments, monitoring 15
  16. 16. transform transform Connect
  17. 17. Why Kafka Streams? 17 Simple Java library Industry standard features Separation of concerns that makes sense Kafka first
  18. 18. Our internal protocol 18 Serialized Avro Null (99%) Schema guid Other metadata, mostly for routing Kafka Message Value Kafka Message Key Kafka Message Headers
  19. 19. Schema management ● Schemas are generated & uploaded automatically if needed. Schema hash is used as id ● Make schemas immutable and cache them aggressively. You have to use them for every single record! 19 Schema Registry API Distributed Cache In-memory Cache
  20. 20. Typical Kafka Streams service topology 20 consume process enrich produce DLQ
  21. 21. 21 1 KStream[] streams = builder 2 .stream(Pattern.compile(applicationConfig.getTopics())) 3 .transform(MetadataEnricher::new) 4 .transform(() -> new InputMetricsHandler(applicationMetrics)) 5 .transform(ResultExtractor::new) 6 .transform(() -> new OutputMetricsHandler(applicationMetrics)) 7 .branch( 8 (key, value) -> value instanceof RecordSucceeded, 9 (key, value) -> value instanceof RecordFailed, 10 (key, value) -> value instanceof RecordSkipped 11 ); 12 13 // RecordSucceeded 14 streams[0].map((key, value) -> KeyValue.pair(key, ((RecordSucceeded) value).getGenericRecord())) 15 .transform(SchemaGuidEnricher<String, GenericRecord>::new) 16 .to(new SinkTopicNameExtractor()); 17 18 // RecordFailed 19 streams[1].process(dlqFailureResultHandlerSupplier);
  22. 22. Routing & configuration Before: <env>.<producer>.<title>.<category>-<protocol> e.g. prod.service-a.1234.match_summary-v1 “raw” data, no transformations 22
  23. 23. Routing & configuration Now: <env>.rdp.<game>.<stage1> ↓ <env>.rdp.<game>.<stage2> ↓ <env>.rdp.<game>.<stageN> 23 microservice microservice
  24. 24. Routing & configuration prod.rdp.mw.ingested ↓ prod.rdp.mw.parsed 24 microservice prodMwServiceA: stream: headers: env: prod game: mw source: service-a exclude: <thingX> action: type: parse protocol: proto2
  25. 25. Routing & configuration prod.rdp.mw.ingested ↓ prod.rdp.mw.parsed 25 microservice prodMwServiceA: stream: headers: env: prod game: mw source: service-a exclude: <thingX> action: type: parse protocol: proto2Streams can be skipped, split, merged, sampled, etc.
  26. 26. Dynamic Routing* 26 ● Centralized, declarative configuration ● Self-serve APIs and UIs ● Every change is automatically applied to all running services within seconds
  27. 27. Infra & Tools 27 ● One-click Kafka deployment (Jenkins, Ansible) ● Kafka broker EBS auto-scaling ● Versioned & deployable Kafka topic configuration ● Built tooling for: ○ Data reprocessing and DLQ resubmission ○ Offset migration between consumer groups ○ Message inspection ○ ...
  28. 28. Scaling ● Every application submits <app_name>.lag metric in milliseconds ● ECS Step Scaling: add/remove X more instances every Y minutes ● Add an extra policy for rapid scaling Auto-scaling & self-healing Healing ● Heartbeat endpoint monitors streams.state() result ● ECS healthcheck replaces unhealthy instances ● Stateful applications need more time to bootstrap 28
  29. 29. Why Kafka Connect? 29 Powerful framework Built-in connectors Separation of concerns that makes sense Kafka first
  30. 30. Kafka Connect ● Multiple smaller clusters > one big cluster ● Connectors configuration lives in git, uses Jsonnet. Deployment script leverages REST API ● Custom Converter, thanks to KIP-440 ● ❤ lensesio/kafka-connect-ui ● Collecting & using tons of metrics available over JMX 30
  31. 31. C* Connector ● Implemented from scratch, inspired by JDBC connector ● Started with porting over existing C* integration code ● Took us a few days (!) to wrap it up ● Generalizing is hard ● Very performant, usually just a few tasks are running 31
  32. 32. ES Connector ● Using open-source kafka-connect-elasticsearch ● Leveraging SMTs to: ○ Partition single topic into multiple indexes ○ Enrich with a timestamp ● Currently very low-volume 32
  33. 33. S3 Connector ● Started with forking open-source kafka-connect-s3 ● Added custom Avro and Parquet formats ● Added a new flexible partitioner ● Optimized connector for at-least-once delivery ○ Generate less files on S3, reduce TPS ○ Avoid file overrides with non-deterministic upload triggers ● Running hundreds of tasks 33
  34. 34. Dev data is prod data ● Scale is different, but the pipeline is the same ● Running as a separate set of services to reduce latency, low latency is a requirement ● Different approach to alerting Otherwise, it’s the same! 34
  35. 35. Use Case: RADS Flatten my data!
  36. 36. 36 { "headers": { "field1": "value1", }, "data": { "match": { "field2": "value2" }, "players": [ {"field3": "value3", "field4": "value4"}, {"field3": "value3", "field4": "value4"} ] } } message_id context_headers_field1_s data_match_field2_s ... ... ... ... ... ... fact_data message_id index context_headers _field1_s data_players _field3_i ... ... ... ... ... ... ... ... ... ... ... fact_data_players
  37. 37. DDL ingest transform flatten table-generator S3 connector consolidator Avro Parquet 1:1 1:1 1:M RADS Schema Registry API Project API Metastore DB S3 connector Avro
  38. 38. Why is RADS rad? ● Has enough automation and generic configuration to automatically create Hive databases, tables, add new columns and partitions for a brand new game with no* human intervention. ● As a data producer you just need to start sending data in the right format to the right Kafka topic, that’s it! ● We get realtime (“hot”) and historical (“cold”) data in the same place! 38
  39. 39. 39 Thanks! Any questions? @sap1ens

×