This document describes the evolution of RTB House's real-time data processing architecture over five iterations. The first iteration used mutable data structures and single-datacenter processing. The second introduced Apache Kafka and data flow processing. The third moved to immutable event streams. The fourth added a multi-datacenter architecture with merged event streams. The current fifth iteration uses custom Kafka workers for higher distribution and tighter offset control. The changes have improved stability, scalability, data quality, and multi-region support for RTB House's growing real-time bidding platform.
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
Real-Time Data Processing Evolution at RTB House
1. REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018
OCTOBER 10-11, 2018
BIG DATA TECHNOLOGY MOSCOW 2018
OCTOBER 10-11, 2018
ARCHITECTURE & LESSONS LEARNED
BARTOSZ ŁOŚ
REAL-TIME DATA PROCESSING AT RTB HOUSE
2. TABLE OF CONTENTS
Agenda:
- our rtb platform
- the first iteration: mutable structures
- the second iteration: data-flow
- the third iteration: immutable streams of events
- the fourth iteration: multi-dc architecture
- the current iteration: kafka workers
- summary
02/30
4. OUR RTB PLATFORM: THE CONTEXT 04/30
Bid requests:
2M/s (peak)
~30 SSP networks
<50-100ms
User events:
1.5B tags/day
350M impressions/day
3.5M clicks/day
1.5M conversions/day
Other events:
bidlogs, accesslogs,
domain events etc.
5. OUR RTB PLATFORM: DATA PROCESSING NUMBERS
Kafka:
- up to 250K+ messages per second
- 50TB+ processed data every day
- 6 clusters in 4 datacenters
- 26 Kafka brokers
- 85 topics, 5000+ partitions
Docker (processing components only):
- 44 engines
- 1408 cpu cores, 5.5TB ram
- 800+ containers
05/30
HDFS:
- 2PB+ data, up to 10GB/s
BigQuery:
- 1PB+ data, up to 10GB/min
Elasticsearch:
- 40TB data, up to 50K events/s
Aerospike (processing only):
- 80TB data, up to 8K events/s
22. THE 4TH ITERATION: NEW REQUIREMENTS
Main changes:
- 5-6x larger scale:
> from 350K to 2M bid requests/s within 1.5 years
- full multi-dc architecture:
> merging streams of events
> synchronization of user profiles
- end-to-end exactly-once processing:
> at-least-once output semantics + deduplication
- a few better components:
> merger
> new stats-counter, new data-flow
> dispatcher & loader
> logstash
22/30
24. THE 4TH ITERATION: NEW DATA-FLOW ON KAFKA STREAMS 24/30
(picture from kafka.apache.org)
Why Kafka Streams:
- fully embedded library with no stream
processing cluster
- no external dependencies
- Kafka's parallelism model and group
membership mechanism
- event-at-a-time processing
(not microbatch)
- exactly-once processing semantics
(but at-least-once was good enough)
27. THE 5TH ITERATION: KAFKA WORKERS
Main features:
- higher level of distribution
- possibility to pause and resume processing for given partition
- asynchronous processing
- tighter control of offsets commits
- backpressure
- at-least-once semantics
- processing timeouts
- handling failures
- multiple consumers (in progress)
- kafka-to-kafka, hdfs, bigquery, elasticsearch connectors (in progress)
27/30
(github.com/RTBHOUSE/kafka-workers)
29. SUMMARY
What we have achieved:
- platform monitoring
- much more stable platform
- higher quality of data processing
- HDFS & BigQuery & Elasticsearch streaming
- multi-DC architecture and data synchronization
- high scalability
- better data-flow monitoring, deployment & maintenance
29/30
30. REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018
OCTOBER 10-11, 2018
THANK YOU FOR YOUR
ATTENTION