Real Time Data Processing at RTB House - Bartosz Łoś

•

0 likes•440 views

Our platform, which purchases and runs advertisements in the Real-Time Bidding model, processes 250K bid requests and generates 20K events per every second which gives 3TB data every day. Because of machine learning, system monitoring and financial settlements we need to filter, store, aggregate and join these events together. As a result processed events and aggregated statistics are available in Hadoop, Google BigQuery and Postgres. The most demanding are business requirements such as: events that should be joined together can appear 30 days after each other, we are not allowed to create any duplicates, we have to minimalize possible data losses as well as there could not be any differences between generated data outputs. We have designed and implemented the solution which has reduced delay of availability of this data from 1 day to 15 seconds. We will preent: Our first approach to the problem (end-of-day batch jobs) and final solution (real-time stream processing) 2. detailed description of the current architecture 3. how we had tested new data flow before it was deployed and in which way it is being monitored now 4. our one-click deployment process 5. decisions which we made with its advantages and disadvantages and our future plans to improve our current solution. We would like to share our experience connected with scaling solution over clusters of computers in several data centers. We will focus on the current architecture but also on testing and monitoring issues with our deployment process. Finally, we would like to provide an overview of engaged projects like Kafka, Mirrormaker, Storm, Aerospike, Flume, Docker etc. We will describe what we have achieved from given open source and some problems we have come across.

Data & Analytics

ARCHITECTURE & LESSONS LEARNED
BARTOSZ ŁOŚ
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY WARSAW SUMMIT 2017
FEBRUARY 9, 2017

TABLE OF CONTENTS
Agenda:
- real-time bidding
- the first iteration: mutable structures
- the second iteration: data-flow
- the third iteration: immutable streams of events
02/23

REAL-TIME BIDDING: RTB PLATFORM
Processing bid requests
(350K/s, ~30 SSP networks, <50-100ms)
04/23

REAL-TIME BIDDING: DATA & MACHINE LEARNING
Impressions:
~ 150M events / day
~ 4TB data / day
Clicks:
~ 1M events / day
~ 35GB data / day
Conversions:
~ 450K events / day
~ 25GB data / day
05/23

THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/23

THE 1ST ITERATION: DRAWBACKS
Issues:
- long, overloading data migrations (30 days back)
- complex servlets' logic, inability to reprocess
- inflexible, various schemas
- single-DC
- inconsistencies
08/23

THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/23

THE 2ND ITERATION: DISTRIBUTED LOG
Why Apache Kafka:
- distributed log
- topics partitioning
- partition replication
- log retention
- stateless
- efficient data consuming
11/23

THE 2ND ITERATION: BATCH LOADING
Why Apache Camus:
- "Kafka to HDFS" pipeline
- map-reduce jobs, batches
- storing offsets in log files
- data partitioning
12/23

THE 2ND ITERATION: AVRO & SCHEMA VERSIONING
Why Apache Avro:
- data serialization framework
- rich data structures
- self-describing container files
- reader & writer schemas
- binary data format
- schema registry
13/23

THE 2ND ITERATION: ACCURATE STATISTICS
Why Apache Storm:
- real-time processing
- streams of tuples, topologies
- fault-tolerance
Why Trident:
- transactions, exactly-once processing
- microbatches (latency & throughput)
14/23

THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/23

THE 2ND ITERATION: DRAWBACKS
Hybrid architecture:
- aggregates (real-time)
- raw events (2-hour batches)
- joined events (end-of-day batch jobs)
Other issues:
- Hive joins
- mutable events
- servlets' complex logic
16/23

$THE 3RD ITERATION: NEW APPROACH { "IMPRESSION”: "URL”, "TIME”, "CREATIVE”, ... "CLICKS”, "CONVERSIONS” } { "CLICK”: "TIME”, "IMPRESSION_ID”, ... "IMPRESSION” } { "CONVERSION”: "TIME”, "CLICK_ID”, ... "CLICK” } New approach: - real-time processing - publishing light events - immutable streams of events 18/23$

THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/23

THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/23

SUMMARY
What we have achieved:
- multi-DC architecture
- HDFS & BigQuery streaming
- platform monitoring
- much more stable platform
- higher quality of data processing
- better data-flow monitoring, deployment & maintenance
22/23

What's hot

Dynamo db and Cross Region MigrationAnamika Gupta

The missing data issue for HiSeq runsDenis C. Bauer

Summary of OGC Support by MapServerJeff McKenna

Caffe + H2O - By Cyprien noelSri Ambati

Cassandra Lunch #59 Functions in CassandraAnant Corporation

Ruby,no sql and tokyocabinetbiaowei zhuang

Stream Processing Live Traffic Data with Kafka StreamsTim Ysewyn

Stream processing comparisonYangjun Wang

Transf from csv to xmlDavide Rapacciuolo

PelotonDB - A self-driving database for hybrid workloads宇傅

Flouralessiobattistutta

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...Flink Forward

Assignment.4.2012ashish61_scs

Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...PROIDEA

Geo2tag LBS platform training at FRUCT12OSLL

Open Source india 2014lohitvijayarenu

What's hot (16)

Dynamo db and Cross Region Migration

The missing data issue for HiSeq runs

Summary of OGC Support by MapServer

Caffe + H2O - By Cyprien noel

Cassandra Lunch #59 Functions in Cassandra

Ruby,no sql and tokyocabinet

Stream Processing Live Traffic Data with Kafka Streams

Stream processing comparison

Transf from csv to xml

PelotonDB - A self-driving database for hybrid workloads

Flour

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...

Assignment.4.2012

Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...

Geo2tag LBS platform training at FRUCT12

Open Source india 2014

Similar to Real Time Data Processing at RTB House - Bartosz Łoś

[Heap con19] designing data intensive applications in serverless architectureNikolay Matvienko

Using a Fast Operational Database to Build Real-time Streaming AggregationsVoltDB

Spark streamingVenkateswaran Kandasamy

Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...DataStax Academy

Data Modeling for IoT and Big DataJayesh Thakrar

strata_spark_streaming.pptrveiga100

Big Data, Mob Scale.darach

Big Events, Mob Scale - Darach Ennis (Push Technology)jaxLondonConference

Keynote 1 the rise of stream processing for data management & micro serv...Sabri Skhiri

Time Series Analysis… using an Event Streaming Platformconfluent

Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf

Rapid Application Design in Financial ServicesAerospike

TechEvent Apache CassandraTrivadis

Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLucidworks

Leveraging the Power of Solr with SparkQAware GmbH

Dbs302 driving a realtime personalization engine with cloud bigtableCalvin French-Owen

Application-engaged Dynamic Orchestration of Optical Network ResourcesTal Lavian Ph.D.

Map Reduce OnlineHadoop User Group

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit

Similar to Real Time Data Processing at RTB House - Bartosz Łoś (20)

[Heap con19] designing data intensive applications in serverless architecture

Using a Fast Operational Database to Build Real-time Streaming Aggregations

Spark streaming

Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...

C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...

Data Modeling for IoT and Big Data

strata_spark_streaming.ppt

Big Data, Mob Scale.

Big Events, Mob Scale - Darach Ennis (Push Technology)

Keynote 1 the rise of stream processing for data management & micro serv...

Time Series Analysis… using an Event Streaming Platform

Time Series Analysis Using an Event Streaming Platform

Rapid Application Design in Financial Services

TechEvent Apache Cassandra

Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware

Leveraging the Power of Solr with Spark

Dbs302 driving a realtime personalization engine with cloud bigtable

Application-engaged Dynamic Orchestration of Optical Network Resources

Map Reduce Online

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

RadioAdProWritingCinderellabyButleri.pdfgstagge

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Spark3's new memory model/managementakshesh doshi

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

B2 Creative Industry Response Evaluation.docxStephen266013

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Data Science Project: Advancements in Fetal Health Classification

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

Call Girls In Mahipalpur O9654467111 Escorts Service

RadioAdProWritingCinderellabyButleri.pdf

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

20240419 - Measurecamp Amsterdam - SAM.pdf

Spark3's new memory model/management

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...

E-Commerce Order PredictionShraddha Kamble.pptx

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Schema on read is obsolete. Welcome metaprogramming..pdf

B2 Creative Industry Response Evaluation.docx

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Brighton SEO | April 2024 | Data Storytelling

Real Time Data Processing at RTB House - Bartosz Łoś

1. ARCHITECTURE & LESSONS LEARNED BARTOSZ ŁOŚ REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE BIG DATA TECHNOLOGY WARSAW SUMMIT 2017 FEBRUARY 9, 2017

2. TABLE OF CONTENTS Agenda: - real-time bidding - the first iteration: mutable structures - the second iteration: data-flow - the third iteration: immutable streams of events 02/23

3. REAL-TIME BIDDING

4. REAL-TIME BIDDING: RTB PLATFORM Processing bid requests (350K/s, ~30 SSP networks, <50-100ms) 04/23

5. REAL-TIME BIDDING: DATA & MACHINE LEARNING Impressions: ~ 150M events / day ~ 4TB data / day Clicks: ~ 1M events / day ~ 35GB data / day Conversions: ~ 450K events / day ~ 25GB data / day 05/23

6. THE FIRST ITERATION

7. THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/23

8. THE 1ST ITERATION: DRAWBACKS Issues: - long, overloading data migrations (30 days back) - complex servlets' logic, inability to reprocess - inflexible, various schemas - single-DC - inconsistencies 08/23

9. THE SECOND ITERATION: DATA-FLOW

10. THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/23

11. THE 2ND ITERATION: DISTRIBUTED LOG Why Apache Kafka: - distributed log - topics partitioning - partition replication - log retention - stateless - efficient data consuming 11/23

12. THE 2ND ITERATION: BATCH LOADING Why Apache Camus: - "Kafka to HDFS" pipeline - map-reduce jobs, batches - storing offsets in log files - data partitioning 12/23

13. THE 2ND ITERATION: AVRO & SCHEMA VERSIONING Why Apache Avro: - data serialization framework - rich data structures - self-describing container files - reader & writer schemas - binary data format - schema registry 13/23

14. THE 2ND ITERATION: ACCURATE STATISTICS Why Apache Storm: - real-time processing - streams of tuples, topologies - fault-tolerance Why Trident: - transactions, exactly-once processing - microbatches (latency & throughput) 14/23

15. THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/23

16. THE 2ND ITERATION: DRAWBACKS Hybrid architecture: - aggregates (real-time) - raw events (2-hour batches) - joined events (end-of-day batch jobs) Other issues: - Hive joins - mutable events - servlets' complex logic 16/23

17. THE THIRD ITERATION: NEW APPROACH

18. THE 3RD ITERATION: NEW APPROACH { "IMPRESSION”: "URL”, "TIME”, "CREATIVE”, ... "CLICKS”, "CONVERSIONS” } { "CLICK”: "TIME”, "IMPRESSION_ID”, ... "IMPRESSION” } { "CONVERSION”: "TIME”, "CLICK_ID”, ... "CLICK” } New approach: - real-time processing - publishing light events - immutable streams of events 18/23

19. THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/23

20. THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/23

21. THE 3RD ITERATION: EVENTS MERGE 21/23

22. SUMMARY What we have achieved: - multi-DC architecture - HDFS & BigQuery streaming - platform monitoring - much more stable platform - higher quality of data processing - better data-flow monitoring, deployment & maintenance 22/23

23. THANK YOU FOR YOUR ATTENTION

Real Time Data Processing at RTB House - Bartosz Łoś

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Real Time Data Processing at RTB House - Bartosz Łoś

Similar to Real Time Data Processing at RTB House - Bartosz Łoś (20)

More from Evention

More from Evention (20)

Recently uploaded

Recently uploaded (20)

Real Time Data Processing at RTB House - Bartosz Łoś