4. With Apache
Kafka
Source system Source system Source system Source system
Target system Target system Target system Target system
5. Taxonomy
Producer – An application that send data to apache Kafka
Consumer – An application that receives data from apache Kafka
ConsumerGroups – A group of consumers acting as a single logical
unit
Broker – Kafka Server
Cluster – Group of Kafka brokers
Topic –All Kafka messages are organized into topics
Partition – Part ofTopic
Offset – Unique id for a message with partition
9. Brokers
A Kafka cluster is composed of brokers
Each broker is identified by an id
Each broker contains certain topic partitions
Broker 101 Broker 102 Broker 103
10. Brokers &Topics
Topic A
Partition 0
Topic A
Partition 2
Topic A
Partition 1
Topic B
Partition 1
Topic B
Partition 0
Broker 101 Broker 102 Broker 103
Topic A with 3 partitions andTopic B with 2
11. Topic replication factor
Topic A
Partition 0
Topic A
Partition 1
Topic A
Partition 1
Broker 101 Broker 102 Broker 103
Topics should have replication factor > 1 (usually between 2 and 3)
This way if a broker is down, another broker can serve the data
Eg:Topic A with 2 partitions and replication factor of 2
Topic A
Partition 0
12. Topic replication factor
Topic A
Partition 0
Topic A
Partition 1
Topic A
Partition 1
Broker 101 Broker 102 Broker 103
Topic A
Partition 0
If we lose Broker 102, we could still serve data from 101 and 103
13. Leader for a partition
• At a time only ONE broker can be a leader for a given partition
• Only that leader can receive and serve data for a partition
• The other brokers will synchronize the data
• Each partition has one leader and multiple ISR (In Sync Relplica)
Topic A
Partition 0
Topic A
Partition 1
Topic A
Partition
1(ISR)
Broker 101 Broker 102 Broker 103
Topic A
Partition
0(ISR)
14. Producer can choose to receive acknowledgement of data writes
acks=0 : Producer will not wait for acknowledgment (possible data loss)
acks=1 : Producer will wait for leader acknowledgment (limited data loss)
acks=all : leader + replica acknowledgment
Producer
Producer
Broker 101
Topic A/ Partition 0
0 1 2 3 4
0 1 2 3
0 1 2 3 4
Broker 102
Topic A/ Partition 1
Broker 103
Topic A/ Partition 2
writes
writes
writes
16. Producer can choose to send key with message (string, number …)
If key = null, data is sent in round robin manner
If a key is sent then, all messages for that key will go to the same partition
Producer
Topic A
Partition 0
Partition 1
Partition 2
Key =cc_payment_cc_123 data will always be partition 0
Key =cc_payment_cc_123 data will always be partition 0
Key =cc_payment_cc_345 data will always be partition 1
Key =cc_payment_cc_456 data will always be partition 1
17. Producer writes data to topics
Load is balanced to many brokers
Consumer
Topic A/Partition 0
0 1 2 3 4
0 1 2 3
0 1 2 3 4
Topic A/ Partition 1
Topic A/ Partition 2
consumer
consumer
Read in order
Read in order
Read in order
18. Consumer read data in consumer groups
Each consumer within a group reads from exclusive partitions
If you have more consumers than partitions, some consumers will be inactive
ConsumerGroups
Topic A
Partition 0
Topic A
Partition 1
Topic A
Partition 2
Consumer 1 Consumer 2 Consumer 1 Consumer 2 Consumer 3
Consumer group app 1 Consumer group app 2
19. What if too many consumers ?
ConsumerGroups
Topic A
Partition 0
Topic A
Partition 1
Topic A
Partition 2
Consumer 1 Consumer 2 Consumer 3
Consumer group app 2
Consumer 4
inactive
20. Kafka stores the offsets at which a consumer group has been reading.
The offsets committed live in a Kafka topic named _consumer_offsets
When a consumer in a group has processed data received from Kafka, it should be committing
the offsets
If a consumer dies, it will be able to read back from where it left off.Thanks to the committed
consumer offset
1001 1002 1003 1004 1005 1006 1007 1008
Consumer Groups
Consumer from
consumer Group
Committed offsets
Reads
21. Consumer choose when to commit offsets.
There are 3 delivery mechanisms
At most once
Offsets are committed as soon as the message is received.
If the processing goes wrong, the message will be lost (it wont be read again)
At least once
Offsets are committed after the message is received.
If the processing goes wrong, the message will be read again
This can result in duplicate processing of messages. Make sure your processing is idempotent.
Exactly once
Delivery semantics for consumer
22. • You can use connectors to
copy data between Apache
Kafka and other systems that
you want to pull data from or
push data to.
• Source Connectors import
data from another system.
Sink Connectors export data.
Kafka Connectors
23. Streaming
SQL for
Apache Kafka
• Confluent KSQL is the streaming
SQL engine that enables real-time data
processing against Apache Kafka®. It
provides an easy-to-use, yet powerful
interactive SQL interface for stream
processing on Kafka, without the need
to write code in a programming
language such as Java or Python. KSQL
is scalable, elastic, fault-tolerant, and it
supports a wide range of streaming
operations, including data filtering,
transformations, aggregations, joins,
windowing, and sessionization.
Editor's Notes
The Trusted Committer (TC) role is one of the key roles in an InnerSource community.
Think of TCs as the people in a community that you trust with important technical decisions and with mentoring contributors in order to get their contribution over the finish line.
The TC role is both a demanding and a rewarding role to fulfill.
It goes far beyond being an opinionated gatekeeper and it is instrumental for the success of any InnerSource community.
Generally speaking, the TC role is defined by its responsibilities, rather than by its privileges.
On a very high level, TCs represent the interests of both their InnerSource community and the products the community is building.
They are concerned with the health of both the community and the product.
So as a TC, you'll have both tech oriented and community oriented responsibilities.
We'll explore both of these dimensions in the following sections.
Before we go into the details of what a TC actually does, let's spend some time contrasting the TC role to other roles in
InnerSource on a high level of abstraction and explain why we think the name is both apt and important.
Let's start with the Contributor role. A Contributor - as the name implies - makes contributions to an InnerSource community.
These contributions could be code or non-code artifacts, such as bug-reports, feature-requests or documentation.
Contributors might or might not be part of the community.
They might be sent by another team to develop a feature that team might need.
This is why we sometimes also refer to Contributors as Guests or being part of a _Guest Team.
TheContributor_ is responsible for "fitting in" and for conforming to the community's expectations and processes.
The Trusted Committer is always a member of the InnerSource community, which also sometimes referred to as the Host Team.
In this analogy, the TC is responsible for both building the house and setting the house rules, to make sure their guests are comfortable and can work together effectively.
Compared to contributors, TCs have earned the responsibility to push code closer to production and are generally allowed to perform tasks that have a higher level of risk associated with them.
The Product Owner (PO) is the third role in InnerSource.
Similar to agile processes, the PO is responsible for defining and prioritizing requirements and stories for the community to implement.
The PO interacts often with the TC, e.g. in making sure that a requested or contributed feature actually belongs to the product.
Especially in smaller, grass-roots type InnerSource communities, the TC usually also acts as a PO.
Please check out our Product Owner Learning Path segment for more detailed information.
This is a common data integration requirement in any large enterprise.
Here you have source systems and target systems and they want to exchange data with one another.
Target systems could be another API, database or utility.
There are 16 integrations possible here and that means managing URIs connection details and other configs specific to each target system.
It means that all the apps in the source systems must be aware of all the APIs in the target systems that they need to call.
It also means that the target systems must be available at the time the source system makes the call.
This causes two major problems.
Over a period of time this becomes highly unmaintainable. The load on the target systems keep increasing and more source systems get added.
Source systems need to implement ways of dealing with failed calls to the target systems
Kafka provides solutions to both of our problems.
This could be solved by decoupling source systems and target systems.
Kafka is a highly scalable and fault tolerant enterprise messaging system.
It could be used as :
1 Enterprise messaging system
2 Stream processing
3 Import or export bulk data from databases to other systems
A Kafka cluster consists of one or more servers (Kafka brokers), which are running Kafka.
Producers are processes that publish data (push messages) into Kafka topics within the broker. A consumer of topics pulls messages off a Kafka topic.
All Kafka messages are organized into topics.
Producer applications write data to topics and consumer applications read from topics.
Messages published to the cluster will stay in the cluster until a configurable retention period has passed by. Kafka retains all messages for a set amount of time.
Kafka topics are divided into a number of partitions, which contains messages in an unchangeable sequence. Each message in a partition is assigned and identified by its unique offset. A topic can also have multiple partition logs like the click-topic has in the image to the right. This allows for multiple consumers to read from a topic in parallel.
In Kafka, replication is implemented at the partition level. Details to be followed
The failover and replication configurations are usually managed by platform providers/team.
But as developres its important to know because there are attributes we set while cosuming java/node library which impact the delivery mechanism and performance.