Real-Time Data Processing Evolution at RTB House

REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018
OCTOBER 10-11, 2018
OCTOBER 10-11, 2018
ARCHITECTURE & LESSONS LEARNED
BARTOSZ ŁOŚ
REAL-TIME DATA PROCESSING AT RTB HOUSE

TABLE OF CONTENTS
Agenda:
- our rtb platform
- the first iteration: mutable structures
- the second iteration: data-flow
- the third iteration: immutable streams of events
- the fourth iteration: multi-dc architecture
- the current iteration: kafka workers
- summary
02/30

OUR RTB PLATFORM: THE CONTEXT 04/30
Bid requests:
2M/s (peak)
~30 SSP networks
<50-100ms
User events:
1.5B tags/day
350M impressions/day
3.5M clicks/day
1.5M conversions/day
Other events:
bidlogs, accesslogs,
domain events etc.

OUR RTB PLATFORM: DATA PROCESSING NUMBERS
Kafka:
- up to 250K+ messages per second
- 50TB+ processed data every day
- 6 clusters in 4 datacenters
- 26 Kafka brokers
- 85 topics, 5000+ partitions
Docker (processing components only):
- 44 engines
- 1408 cpu cores, 5.5TB ram
- 800+ containers
05/30
HDFS:
- 2PB+ data, up to 10GB/s
BigQuery:
- 1PB+ data, up to 10GB/min
Elasticsearch:
- 40TB data, up to 50K events/s
Aerospike (processing only):
- 80TB data, up to 8K events/s

THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/30

THE 1ST ITERATION: DRAWBACKS
Issues:
- long, overloading data migrations (30 days back)
- complex servlets' logic, inability to reprocess
- inflexible, various schemas
- single-DC
- inconsistencies
08/30

THE SECOND ITERATION:
DATA-FLOW

THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/30

THE 2ND ITERATION: DISTRIBUTED LOG
Why Apache Kafka:
- distributed log
- topics partitioning
- partition replication
- log retention
- stateless
- efficient data consuming
11/30

THE 2ND ITERATION: BATCH LOADING
Why Apache Camus:
- "Kafka to HDFS" pipeline
- batch tool
- map-reduce jobs
- storing offsets in log files
- data partitioning
12/30

THE 2ND ITERATION: AVRO & SCHEMA VERSIONING
Why Apache Avro:
- compact, efficient format
- schema: JSON format, payload: binary format
- self-describing container files
- rich data structures
- schema changes support, reader & writer schemas
Our approach:
- Kafka's messages and HDFS files
- schema registry
- avro-fastserde
13/30
(github.com/RTBHOUSE/avro-fastserde)

THE 2ND ITERATION: ACCURATE STATISTICS
Why Apache Storm:
- real-time processing
- streams of tuples, topologies
- fault-tolerance
Why Trident:
- transactions, exactly-once processing
- microbatches (latency & throughput)
14/30

THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/30

THE 2ND ITERATION: DRAWBACKS
Hybrid architecture:
- aggregates (real-time)
- raw events (2-hour batches)
- joined events (end-of-day batch jobs)
Other issues:
- Hive joins
- mutable events
- servlets' complex logic
16/30

THE THIRD ITERATION:
NEW APPROACH

THE 3RD ITERATION: NEW APPROACH
{ "IMPRESSION”:
"URL”,
"TIME”,
"CREATIVE”,
...
"CLICKS”,
"CONVERSIONS”
}
{ "CLICK”:
"TIME”,
"IMPRESSION_ID”,
...
"IMPRESSION”
}
{ "CONVERSION”:
"TIME”,
"CLICK_ID”,
...
"IMPRESSION”,
"CLICK”
}
New approach:
- real-time processing
- publishing light events
- immutable streams of events
18/30

THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/30

THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/30

THE FOURTH ITERATION:
MULTI-DC

THE 4TH ITERATION: NEW REQUIREMENTS
Main changes:
- 5-6x larger scale:
> from 350K to 2M bid requests/s within 1.5 years
- full multi-dc architecture:
> merging streams of events
> synchronization of user profiles
- end-to-end exactly-once processing:
> at-least-once output semantics + deduplication
- a few better components:
> merger
> new stats-counter, new data-flow
> dispatcher & loader
> logstash
22/30

THE 4TH ITERATION: MULTI-DC ARCHITECTURE 23/30

THE 4TH ITERATION: NEW DATA-FLOW ON KAFKA STREAMS 24/30
(picture from kafka.apache.org)
Why Kafka Streams:
- fully embedded library with no stream
processing cluster
- no external dependencies
- Kafka's parallelism model and group
membership mechanism
- event-at-a-time processing
(not microbatch)
- exactly-once processing semantics
(but at-least-once was good enough)

THE 4TH ITERATION: MERGER ON KAFKA CONSUMER API 25/30

THE CURRENT ITERATION:
KAFKA WORKERS

THE 5TH ITERATION: KAFKA WORKERS
Main features:
- higher level of distribution
- possibility to pause and resume processing for given partition
- asynchronous processing
- tighter control of offsets commits
- backpressure
- at-least-once semantics
- processing timeouts
- handling failures
- multiple consumers (in progress)
- kafka-to-kafka, hdfs, bigquery, elasticsearch connectors (in progress)
27/30
(github.com/RTBHOUSE/kafka-workers)

THE 5TH ITERATION: KAFKA WORKERS ARCHITECTURE 28/30

SUMMARY
What we have achieved:
- platform monitoring
- much more stable platform
- higher quality of data processing
- HDFS & BigQuery & Elasticsearch streaming
- multi-DC architecture and data synchronization
- high scalability
- better data-flow monitoring, deployment & maintenance
29/30

REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
OCTOBER 10-11, 2018
THANK YOU FOR YOUR
ATTENTION

Real-Time Data Processing Evolution at RTB House

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real-Time Data Processing Evolution at RTB House

Similar to Real-Time Data Processing Evolution at RTB House (20)

Recently uploaded

Recently uploaded (20)

Real-Time Data Processing Evolution at RTB House