About our experience with realtime analyses on never-ending stream of user events. Discuss Lambda architecture, Kappa, Apache Kafka and our own approach.
1. How to analyze billions of events in real-time?
Jozo.Kovac@infinario.com
Co-Founder & Product Manager
Lambda architecture
for real-time streaming analytics
2. Agenda
• Goals & requirements
• Design patterns for streaming analytics
– General idea
– Lambda
– Kappa
• INFINARIO backend
• Discussion
4. Requirements
• VELOCITY
– Process never ending stream of “events” in real-time
• VARIETY AT SPEED
– Analyses! Not just predefined reports
• VOLUME
– Be able to reprocess a stream; retain data
• RELIABILITY
– Never lose an event
• AVAILABLITY
– Avoid down-times
6. Real Time Streaming Architecture
Source Systems
Sources
Syslog
Machine
Data
External
Streams
Other
Data
Collection
Flume /
Custom
Agent A
Agent B
Agent N
Messaging
System
Kafka
Topic B
Topic N
Topic A
Real Time
Processing
Storm
Topology B
Topology
N
Topology A
Storage
Search
Elastic
Search /
Solr
Low Latency
NoSql
HBase
Historic
Hive /
HDFS
Access
Web
Services
REST API
Web Apps
Analytic
Tools
R / Python
BI Tools
Alerting
Systems
7. Apache Kafka
• publish-subscribe messaging for real-time feeds
• retains data for configurable period of time
• immutable messages queue (events)
• high-throughput, low-latency
8. Lambda Architecture
New
Data
Data
Stream
Batch Layer
All Data
Pre-compute
Views
Speed Layer
Stream
Processing
Real Time
View
Serving Layer
Batch View
Batch View
Data
Access
Query
http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38774
10. Lambda pros & cons
• Pros
– Combines real-time & batch processing
– Retains input data unchanged
– Allows to reprocess the data
– Stores immediate stages
• Cons
– 2 apps in 2 languages what do the same thing
– 2x implement, maintain & debug the code
– Say good bye to system specific features
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
11. Kappa Architecture
Data
Source
Data
Stream
Stream Processing
System
Job Version
n
Serving DB
Output table
n
Output table
n + 1
Data
Access
Query
Job Version
n + 1
1. Use Kafka that retains full log of data to reprocess and allows for multiple subscribers.
2. Reprocessing: new instance of processing job process from start, outputs to new table.
3. When the second job has caught up, switch the application to read from the new table.
4. Stop the old version of the job, and delete the old output table.
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
12. Kappa pros & cons
• Pros
– Allows people to develop, test, debug, and
operate their systems on top of a single processing
framework
• Cons
– Needs 2x total storage (2 versions of results)
– Requires DB with high volume writes
14. IMF™
• “In-Memory (event processing) Framework”
• Collect, store and analyze events and players
• Distributed & scalable
– Built on NodeJS and C++
– Nodes per CPU core & proportion of RAM
– Provides API for analyses
15. IMF Benchmarking
0.004 0.007 0.039 0.349
0.243 2.354 23.894
262.784
0.349 2.593 25.245
284.803
0.202 2.28
522.518
1.609 86.233
1273.985
0
200
400
600
800
1000
1200
1400
100,000 1,000,000 10,000,000 100,000,000
Timetocalculatefunnel(s)
# of events in database
BlinkBytes
Mongo
TokuMX
Postgres
MySQL
IMF
https://infinario.com/speedtest
16. Our experience
It’s lightning fast
Cheap reprocess No immediate results Easy life
Can process already processed stream (“streaming”)
x Code change or Add new node reload IMF
x Reloads can take too long
x PB of RAM in 2015 is a joke
17. Reloads
• NoSQL eats too much resources (CPU time)
• Can potentially lose some events
• Reload time (NoSQL to IMF) grows fast
• Analyses are unavailable during reload
18. INFINARIO is like this
Source Systems
Sources
SDKs
BULK
Frontend
Data
Collection
Custom
API
Agent A
Agent B
Agent N
Messaging
System
Real Time
Processing
IMF
Topology B
Topology
N
Topology A
Storage
Historic
NoSQL
Access
Web
Services
REST API
Web Apps
Analytic
Tools
R / Python
BI Tools
Alerting
Systems
20. LOW LATENCY
Access
IN MEMORY PROCESSING
PERSISTENT
STORAGE
KAFKA
RELOADEVENT
STREAM
INFINARIO Architecture Updated
RAW DATA HISTORY VIEW
RAW DATA HISTORY VIEW
Ad hoc
DM
APP
21. AngularJS developer wanted
Our designers works much faster than frontend-team.
Could you help? Emai us: jobs@infinario.com
Editor's Notes
All data is dispatched to both the batch layer and the speed layer
batch layer - (i) manage the master dataset and (ii) to pre-compute the batch views.
The serving layer indexes the batch views for low-latency, ad-hoc queries
The speed layer compensates for the high latency and deals with recent data only
incoming query can be answered by merging batch and real-time views
Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days.
When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table.
When the second job has caught up, switch the application to read from the new table.
Stop the old version of the job, and delete the old output table.