This document discusses Kappa Architecture, an alternative to Lambda Architecture for event processing. Kappa Architecture uses a single stream of events from Apache Kafka as the input, rather than separating batch and stream processing. It reads all events from Kafka and runs analytics on the full data set to enable both learning from historical events and reacting to new events. The document outlines how Kappa Architecture provides benefits like avoiding duplicate processing logic and making actionable analytics easier. It also describes how to read bounded batches of events from Kafka for analytics using tools like Apache Spark.
3. What is an event
A user performed an
action in the application
A customer just ordered a
product
An event is a something that just happened
and requires a quick reaction
Information (data)
received from an external
partner
A frequent customer
ordered another product
4. The need for Event Streaming
ReactionQualification
High
frequency
events
• A lot of events happen
• Some of them are valuable
• Some require our reaction
• We have little time to act
5. How events are processed
Action
Gather events Store & forward ReactProcess
Valuable events require a reaction
6. Complex events
• The same event happened again
• An event connected with external
data is a different event
External
data
Complex events are high level events based on multiple data points
→ Complex events have a real business value
7. Complex events identification
Reaction
Simple events:
A customer
logged in
A customer
dropped a
shopping basket
Convert to a
complex event
A complex event may be identified and added to the event stream
External data
8. Source of events
Actions performed by users in applications
Messages from a corporate event bus (EAI)
Complex events identified by correlation of
multiple events
Row changes in databases (CDC)
9. Analytical advancement
Analytical advancement ladder
Businessvalue
Descriptive
analytics
Diagnostics
analytics
Predictive
analytics
Prescriptive
analytics
What has
happened?
Why did it happen?
What will happen?
What can we do to
make it happen?
10. Event processing value proposition
Predictive analytics
Prescriptive analytics
Learn what we can get from events
Identify and act on events
Event processing requires two processes: learning and acting
11. Event consumers
Data scientists & data analysts
identify valuable events
Events are consumed for learning and for performing actions
Reaction to events
Reaction to new events in the future
Events need re-reading many times
14. Limitations of the Lamba Architecture
• The batch layer and the speed layer require double processing
• Changes to the processing logic must be reimplemented in both
processing pipelines
• The whole view of all data is possible only by a virtual query that is
an union of the batch and the speed layer
But do we need a speed layer that is up-to-date every time?
15. Lamba Architecture for log monitoring
Lambda Architecture is good for log monitoring, not for business events
16. Lamba Architecture for CDC data synchronization
Lambda Architecture is good for keeping a copy of rows
from an OLTP database
Insert
Delete
UpdateDB
Key/store database
(Hbase/Cassandra,etc.)
18. Kappa architecture data lag
As long as the reaction time to the event is longer then
processing time, we can work with the data lag
Output table N
Output table N+1
15 min batch human reaction
lags
19. Kappa Architecture benefits
• Kafka is the only source
• Only one processing logic
• Multiple types of analyses possible
• New results available in a new table
Predictive analytics
Prescriptive analytics
Actionable analytics (learning + reacting) much easier
21. What is Apache Kafka
Consumer 1
Consumer 2
Apache Kafka is a high throughput publish-subscribe event bus
Event publishers
System 1
System 2
Event consumers
Kafka topic
22. Apache Kafka partitioning
Kafka rules:
• Topics are partitioned
• Partitions are as append-only files
• Partitions distributed across nodes
• Write speed: 1 mln events / sec /
partition
• Read speed: 2 mln events / sec /
partition
Kafka topic
23. Apache Kafka consumer groups
Consumer 1
Consumer 2
A consumer group
All consumers in a group share a group.idOffset
24. Apache Kafka offset storage for a group.id
But in Kappa Architecture we do not care about offset,
we read everything again
• Event streaming consumer must
keep the last read offset for each
partition
• Offset storage is specified by
offset.storage.[topic]
• Offset stores: Zookeeper, Kafka,
custom
25. Waiting for new events on Apache Kafka
The consumer can still
read from partition 0
The customer has
reached the end of all
partitions and is waiting
A customer that has reached the end of an assigned partition is
waiting for new events for the duration of the „pull” timeout period
Partition 0
Partition 1
Partition 2
Partition 3
26. Reading events without waiting at the end of a partition
KafkaConsumer<~> consumer =...
ConsumerRecords<~> records = consumer.pool(10000);
We must stop listening to a partition when we reach the last event or
the reader will wait or consume events forever
Partition 1
27. Reading events the easy way (1/3): setup
Properties consumerProps = new Properties();
consumerProps.put("bootstrap.servers", "localhost:9092");
consumerProps.put("group.id", "consumer group here");
consumerProps.put("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
consumerProps.put("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
A random group.id must be used
28. Reading events the easy way (2/3): partition offset seek
KafkaConsumer<String, String> consumer =
new KafkaConsumer<String, String>(consumerProps);
List<PartitionInfo> partitionInfos =
consumer.partitionsFor("topic name here");
List<TopicPartition> topicPartitions =
partitionInfos.stream()
.map(pi -> new TopicPartition(pi.topic(), pi.partition()))
.collect(Collectors.toList());
consumer.assign(topicPartitions);
consumer.seekToBeginning(topicPartitions);
But we can also find offsets by a timestamp and „rewind” to it
29. Reading events the easy way (3/3): reading loop
Map<TopicPartition, Long> endOffsets = consumer.endOffsets(topicPartitions);
int remainingPartitionsCount = endOffsets.size();
while(remainingPartitionsCount > 0) {
ConsumerRecords<String, String> consumerRecords = consumer.poll(10000);
for (ConsumerRecord<String, String> record : consumerRecords) {
TopicPartition recordPartition = new TopicPartition(record.topic(), record.partition());
long endOffset = endOffsets.get(recordPartition);
if (record.offset() == endOffset - 1) {
remainingPartitionsCount--;
consumer.pause(Arrays.asList(recordPartition));
}
if (record.offset() < endOffset)
processRecord(record);
}
if (consumerRecords.isEmpty())
break;
}
30. Bounded event reading on Apache Spark
1. Create a custom RDD or Dataframe that
reads from Apache Kafka
2. Register your RDD in the context
3. Just run SQL on the DataFrame
31. Bounded Spark RDD (1/6): RDD declaration
public static class KafkaTopicRDD extends org.apache.spark.rdd.RDD<String> {
private static final ClassTag<String> STRING_TAG =
ClassManifestFactory$.MODULE$.fromClass(String.class);
private static final long serialVersionUID = 1L;
private String kafkaServer;
private String groupId;
private String topic;
private long timeout;
public KafkaTopicRDD(SparkContext sc, String kafkaServer, String groupId, String topic,
long timeout) {
super(sc, new ArrayBuffer<Dependency<?>>(), STRING_TAG);
this.kafkaServer = kafkaServer;
this.groupId = groupId;
this.topic = topic;
this.timeout = timeout;
}
32. Bounded Spark RDD (2/6): RDD’s compute
@Override
public Iterator<String> compute(Partition arg0, TaskContext arg1) {
KafkaTopicPartition p = (KafkaTopicPartition)arg0;
KafkaConsumer<String, String> kafkaConsumer = createKafkaConsumer();
TopicPartition partition = new TopicPartition(topic, p.partition);
kafkaConsumer.assign(Arrays.asList(partition));
kafkaConsumer.seek(partition, p.startOffset);
return new KafkaTopicIterator(kafkaConsumer, p.endOffset, this.timeout);
}
private KafkaConsumer<String, String> createKafkaConsumer() {
Properties consumerProps = new Properties();
consumerProps.put("bootstrap.servers", this.kafkaServer);
consumerProps.put("group.id", this.groupId);
consumerProps.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
consumerProps.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(consumerProps);
return consumer;
}
Each Kafka’s partition is processed as a separate task
33. Bounded Spark RDD (3/6): Kafka → Spark partition
public static class KafkaTopicPartition implements Partition {
private static final long serialVersionUID = 1L;
private int partition;
private long startOffset;
private long endOffset;
public KafkaTopicPartition(int partition, long startOffset, long endOffset) {
this.partition = partition;
this.startOffset = startOffset;
this.endOffset = endOffset;
}
@Override
public int index() { return partition; }
@Override
public boolean equals(Object obj) { return ... }
@Override
public int hashCode() { return index(); }
}
35. Bounded Spark RDD (5/6): events iterator
public static class KafkaTopicIterator extends AbstractIterator<String> {
private KafkaConsumer<String, String> kafkaConsumer;
private long endOffset, timeout;
private ConsumerRecords<String, String> recordsBatch;
private java.util.Iterator<ConsumerRecord<String, String>> recordIterator;
private ConsumerRecord<String, String> currentRecord;
private boolean lastRecordReached;
public KafkaTopicIterator(KafkaConsumer<String, String> kafkaConsumer, long endOffset, long timeout) {
this.kafkaConsumer = kafkaConsumer; this.endOffset = endOffset; this.timeout = timeout;
}
@Override
public String next() {
if (currentRecord == null)
hasNext();
String value = currentRecord.value();
currentRecord = null;
return value;
}
36. Bounded Spark RDD (6/6): iterator’s hasNext
@Override
public boolean hasNext() {
if (currentRecord != null) return true;
if (lastRecordReached) return false;
if (recordsBatch == null) {
recordsBatch = this.kafkaConsumer.poll(this.timeout);
recordIterator = recordsBatch.iterator();
}
if (!recordIterator.hasNext()) return false;
currentRecord = recordIterator.next();
if (currentRecord.offset() >= endOffset) {
currentRecord = null;
return false;
}
if (currentRecord.offset() >= endOffset - 1)
lastRecordReached = true;
return true;
}
37. The anatomy of a Kafka event
Key Value
• Records in Kafka have a key and a value
• Both key and the value are binary and serialized
by a serializer of choice
• JSON, String or AVRO serializers are usefull
38. Apache Kafka log compaction
The default (delete) log cleanup
policy removes old entries
„compact” cleanup policy keeps the
newest version of a record for each key
1 week
log.cleanup.policy=delete
Compact cleanup policy is required for Kappa Architecture
1 2 53 4 2 3 6 2 7
log.cleanup.policy=compact
1 2 53 4 2 3 6 2 7
40. Your existing data sources
ETL-free virtual database,
Apache Spark powered
User-friendly front-ends
keep it as it is to reduce risk
set up to facilitate &
accelerate BI
use what users know for years
Querona – Data Virtualization engine
Complete Logical Data Warehouse: ETL-free, self-service, Big Data ready, utilizing Apache Spark.
41. Data sources
CRM
ERP
OLTP
Client tools
Connects all data sources (~100)
Simple data loading (3 clicks)
Joins data from many sources (instant)
Real-time data access
Enables GDPR/RODO compliance
QUERONA – Logical Data Warehouse
42. Why Querona
Data Virtualization (DV) is not a new idea but since 2016 Garther has
considered DV as a key trend in Data Warehousing and Data Analytics
• Self-service → more people can use data
• SQL Server wire compatibility → compatible with any client tool
• Apache Spark bundled → „Big Data Ready” in 5 minutes
• Competitive licensing model → DV available for all companies
44. What do we
have here?
Data preview in one place
Maybe we can
correlate that
with events?
45. The data source
not capable of
real-time access?
Caching – just a few clicks
Let’s cache it on
Apache Spark or
in the cloud
46. More information
about an event
are in a CRM?
Joining data
Let’s build a 360°
customer profile
as an SQL view!
47. Augmented events
Original events (Kafka)
Augmented events visible as SQL Server compatible views
V_EVENTS_CUSTOMER_INFO
V_EVENTS_PRODUCT_INFO
V_EVENT_SALES
CRM
Product database
ERP
V_EVENT_CAMPAIGN_GOALS
Marketing platform
48. External data sources for event augmentation
Social mediaSaaS
Business
partner’s
database
Partners Public data
49. Kappa Architecture full data lifecycle rules
• Treat Apache Kafka as a persistent event source
• Get ready for both event analysis (learn) and reacting to
events (act)
• Identify all additional data that may augment events
• Make sure that you can reprocess events at any time
• Expose complex events for consumption (dashboards,
activities created in CRM, etc.)