Lambda architecture для realtime-аналитики — риски и преимущества / Николай Голов (Avito)

Lambda
architecture
для RealTime-аналитики:
риски и преимущества
Николай Голов

• Redshift
• Vertica
• MongoDB
• Hadoop
• HDFS
• MapReduce
• Hive/Pig

Streaming VS Batching
• Batching – анализ полной совокупности собранных данных
• например, SQL-запросы к собранной базе;
• например, MapReduce-запрос к данным Hadoop;
• Streaming – живая агрегация на потоке
• RealTime-счетчики на потоке
queryresult
aggregation result

Преимущества и недостатки
• Streaming
• + агрегация на лету, очень быстро;
• + объем обрабатываемых данных не ограничен дисками,
«boundless data»
• - тяжело реализовать сложную логику;
• - очень тяжело исправить ошибку задним числом.
• Batching
• +можно сделать сложную логику;
• +легко пересчитывать;
• -существенные временные задержки;
• -объем анализируемых данных ограничен местом на дисках.

Задача – сервис счетчиков
• Нужно считать подневное количество определенных событий:
• Просмотры объявления;
• Просмотры телефона объявления;
• Просмотры объявлений пользователя;
• …. .
• Фильтрация – считать только людей, без ботов/парсеров;
• Изменчивость – в любой момент могут добавить новые счетчики,
изменить фильтрацию;
• Скорость – в идеале реальное время. Секунды отставания;
• Стабильность – цифры за прошлое не должны сильно «прыгать».

Primitive approach
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing
Time
Stream of web-site actions: clicks, views,searches
In-memory storage (Redis, Tarantool)
Streaming aggregation of item 0X views
Streaming aggregation of item 0X phone views
Web-servers

Problem 01 – new aggregates
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing
Time
Streaming aggregation of user 0Y profile views
Stream of web-site actions: clicks, views,searches
Web-servers

Problem 02 – filtering of non-human activity
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing
Time
Stream of web-site actions: clicks, views, searches
Web-servers

Problem 03 – errors in the past
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing
Time
Stream of web-site actions: clicks, views, searches
-wrong filter/wrong code/… some fail
Web-servers
Wrong/empty aggregates

Lambda as a solution
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing
Time
Streaming aggregation of phone views
Web-servers
Batch
aggregation

Problem 01 – new aggregates
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing
Time
Web-servers
Batch
aggregation
Streaming aggregation of user 0Y p. views

Problem 02 – filtering of non-human activity
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing
Time
Web-servers
Batch
aggregation
with filtering

Problem 03 – errors in the past
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing
Time
Web-servers
Batch
Re-
aggregation
-wrong filter/wrong code/… some fail

Problem 04 – logic duplication
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing
Time
Web-servers
Batch
aggregation

Software/hardware of stat counter
• Speed layer:
• Redis ………… (but can be … Tarantool)
• 64 nodes master <-> 64 nodes slave
• Up to 1.3 Tb of RAM for master (2.6 Tb total )
• Start from 2, up to 8 servers
• Batch layer:
• Vertica …… (but can be … ClickHouse)
• Cluster of 14 servers, 512 GB Ram each.
• ~ 50 Tb of data

One more thing! –how to do filtering?
Filter non-human cookies/devices
In-memory storage of human cookies/devices
(Redis, Tarantool)

Lambda again! (lambda 2)
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing
Time
Streaming checks of cookies/devices
Web-serversBatch
checks
of cookies/devices
In-memory storage of human cookies/devices
(Redis, Tarantool)

Software/hardware of white cookies storage
• Speed layer:
• Tarantool (hash index, much more efficient than Redis)
• 25Gb of RAM for master (50 Gb total )
• Batch layer:
• Vertica …… (you need heavy joins, no ClickHouse)
• Cluster of 14 servers, 512 GB Ram each.
• ~ 50 Tb of data
500 mln. of white cookie/devices

Conclusions - недостатки lambda
architecture
• Logic duplication:
• -> логика, дублируемая в speed layer, должна быть предельно простой
• -> логику лучше расширять не как монолит, а согласно микросервисной
архитектуре. Тогда можно «играть» базами.
• -> данные speed layer должны поддерживать полное перетирание из
batch layer
• -> batch layer должен поддерживать сложную логику (в идеале полный
SQL), чтобы компенсировать ограничения speed layer

Всем
спасибо!
Thanks!
谢谢Tack!

Lambda architecture для realtime-аналитики — риски и преимущества / Николай Голов (Avito)

More Related Content

What's hot

Viewers also liked

Similar to Lambda architecture для realtime-аналитики — риски и преимущества / Николай Голов (Avito)

More from Ontico

Lambda architecture для realtime-аналитики — риски и преимущества / Николай Голов (Avito)