Solution for events logging with akka streams and kafka

Solution for events logging
How does Akka Streams in couple with Kafka make it easier to manage
data flows

2Sementsov A., 2020
Architecture: Step 1
• Client communicate with server using Kafka broker
• Consumer writes each record in both storages: PostgreSQL and
Hadoop (possible Hive)
• PostgreSQL used for operational data with maximum storage
period of about 1-3 months. It is the storage with fast search
capability
• Hadoop and related components used as cheaper storage, but
with slower access
Bird's eye view
Challenges
1. It is difficult to maintain consistency when writing to multiple
repositories simultaneously
2. Consumer must use existing access rights provider, monitoring
and logging system
3. How to reduce the amount of developed code and speed up
the process of adding new repositories
4. How to choose the optimal storage format from all the
possible options offered by the Hadoop and related products
5. No vendor lock-in allowed

3Sementsov A., 2020
Architecture: Step 2
Finish solution
1. Consistency – Kafka provides functionality to create several groups
of consumers. Each group process all messages independently
2. There are many tools such as Fluent, LogStash, Flume provides “get
-> process -> put” functionality. However for two main reasons
(Refusal to vendor lock-in and integration with our proprietary
systems) we decided to develop our own
3. We chose Akka Streams library to reduce the amount of developed
code and speed up the process of adding new storages
4. Hadoop write stages
1. Consumer service writes files directly to HDFS in Apache Avro
format. Pros Avro - fast write and compression. Cons – it has
no indexes, so the search is very slow
2. The Apache Oozie plans a task to convert Avro files into ORC
table using Hive. Apache ORC files has several advantages:
• Has indexes
• Allowed columnar encryption and/or masking
Cons
• File can be written only once because indexes must be
added at the end of file
5. Both PostgreSQL and Hive have JDBC drivers to access data

4Sementsov A., 2020
Akka Streams: Consumer application architecture
Consumer application
• HTTP endpoints protected by desired access provider.
Endpoint developed using Akka Http
• HTTP endpoint backed by KafkaProcessor Actor. Actor
can process two type of messages – Start and Stop
• KafkaProcessor can start & stops several kind of streams:
• PostgresFlow
• HdfsAvroFlow
• Trait BaseKafkaFlow provides common functions:
• Logging & monitoring functionality
• Get messages from and commit to Kafka
• Extension points to parse and store messages
• Each of final streams extends BaseKafkaFlow trait and
implements flow stages for
• Parser
• Store

5Sementsov A., 2020
Akka Streams coding: Base classes
These two traits – BaseProcessorFlow and BaseKafkaFlow provide functions which
cover entire processing procedure.
Type parameters In and P allowed to use different sources and process different
messages
For BaseKafkaFlow source type is KafkaMessage, which defined as
BaseProcessorFlow has two abstract values
• parse – will be used to convert incoming value into PassThrough[In, P]
• saveBatch – will be used to store batch data and transfer all processed data to
downstream. This stage can be cause of backpressure
BaseKafkaFlow extends base processor by adding ability to receive and send
messages. BaseKafkaFlow is an Actor. Also this class have states.
BaseKafkaFlow can process the following messages:
• StartKafkaConsumer(startConfig) – Message contains parameters to create and
start stream connected to Kafka brokers. Stream can read message from Kafka,
processing its and committing offset to Kafka partitions when messages
processes successfully. Moreover, startConfig contains information about
consumer group.
• StopKafkaConsumer - Message used to stop all streams gracefully
Main application actor can start many actors of different classes derived from
BaseKafkaFlow

6Sementsov A., 2020
Akka Streams coding: Storages
PostgresFlow implements parser to construct SomeEvent object stored as JSON. If
JSON parsing was not success then constructs SomeEvent object which contains
error.
saveBatch stores data into PostgreSQL database. In the case when the data cannot
be saved due to a database error, this function has repeatedly tried to save the
data. To avoid CPU overload, it has an ExponentialBackoff to control the time
between attempts. The whole batch is saved in one transaction in function
saveEvents

7Sementsov A., 2020
Akka Streams coding: Storages
To store Avro files we need more complex program code. For this purpose using
custom flow processing. HdfsAvroFileFlow defines FlowStage
HdfsAvroFileFlowLogic defines processing logic. It is necessary because we need
state to store output avro stream. One stream = one file in HDFS. In fact this logic is
very simple. Open file is it not open and write file if it open. From time to time, the
file is forced to close, rotation occurs
writeAvroEvents function writes data and pass successfully written data to the
downstream. Due to this action only data which actually written to HDFS will be
committed to Kafka
However, there is one point to consider. Kafka stream will not reread uncommitted
records. To do this we need to restart stream. It can be done either send special
message to actor or by throwing Exception during this stage

8Sementsov A., 2020
Links
• Akka main site - https://akka.io/
• Akka Streams - https://doc.akka.io/docs/akka/current/stream/index.htm
• Author email -anatolse@gmail.com

Solution for events logging with akka streams and kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Solution for events logging with akka streams and kafka

Similar to Solution for events logging with akka streams and kafka (20)

Recently uploaded

Recently uploaded (20)

Solution for events logging with akka streams and kafka