Introduction to apache kafka, confluent and why they matter

11
Introduction to Apache Kafka and Confluent
... and why they matter!
Kafka Meetup - Johannesburg
Tuesday, March 20th 2018
18:00 – 20:00
SSA - Maxwell Office Park, Magwa Cres, Waterfall City, Midrand, 2090 · Midrand
https://www.meetup.com/Johannesburg-Kafka-Meetup/events/248465767/

22
How Organizations Handle Data Flows: a Giant Mess
Data
Warehouse
Hadoop
NoSQL
Oracle
SFDC
Logging
Bloomberg
…any sink/source
Web Custom Apps Microservices Monitoring Analytics
…and more
OLTP
ActiveMQ
App App
Caches
OLTP OLTPAppAppApp

33
Apache Kafka™: A Distributed Streaming Platform
Apache Kafka
Offline Batch (+1 Hour)Near-Real Time (>100s ms)Real Time (0-100 ms)
Data
Warehouse
Hadoop
NoSQL
Oracle
SFDC
Twitter
Bloomberg
…any sink/source …any sink/source
…and more
Web Custom Apps Microservices Monitoring Analytics

44
More than 1
petabyte of
data in Kafka
Over 1.2
trillion
messages per
day
Thousands of
data streams
Source of all
data
warehouse &
Hadoop data
Over 300
billion user-
related events
per day

55
Over 35% of Fortune 500’s are using Apache Kafka™
6 of top 10
Travel
7 of top 10
Global banks
8 of top 10
Insurance
9 of top 10
Telecom

66
Industry Trends… and why Apache Kafka matters!
1. From ‘big data’ (batch) to ‘fast data’ (stream processing)
2. Internet of Things (IoT) and sensor data
3. Microservices and asynchronous communication (coordination
messages and data streams) between loosely coupled and fine-
grained services

77
Apache Kafka APIs – A UNIX Analogy
$ cat < in.txt | grep "apache" | tr a-z A-Z > out.txt
Connect APIs
Streams APIs
Producer / Consumer APIs

88
Apache Kafka API – ETL Analogy
Source SinkConnectAPI
ConnectAPI
Streams API
Extract Transform Load

99
Apache Kafka 101
Internals and Core Concepts

1010
Apache Kafka Concepts: Persistent Log
Data Producer
0 1 2 3 4 5 6 7 8 9 10 11 12
writes
Data Consumer
(offset = 7)
Data Consumer
(offset = 11)
reads reads

1111
Apache Kafka Concepts: Anatomy of a Topic
0 1 2 3 4 5 6 7 8 9 10 11 12partition 0
0 1 2 3 4 5 6 7
40 1 2 3 5
partition 1
partition 2
writes

1212
Apache Kafka Concepts: Log Storage
offset index
timestamp index
offsets: 0 - 10000
offset index
timestamp index
offsets: 10001 - 20000
offset index
timestamp index
offsets: 20001 - 30000

1313
Apache Kafka Concepts: Message Format
8 bytes 4 bytes 4 bytes 8 bytes 4 bytes varies 4 bytes varies
offset length CRC timesta
mp
key
length
value
length
key
content
value
content
magic
byte
1 byte
attribute
1 byte

1414
Apache Kafka Concepts: Producers and Consumers
Producer
Producer
Producer
Consumer
Consumer
Broker
Broker
Broker

1515
Apache Kafka Concepts: Topics and Partitions
Producer
Producer
Producer
Consumer
Consumer
Broker
Broker
Broker
T0: P0
T0: P2
T0: P1
T0: P3
T1: P0
T1: P1

1616
Apache Kafka Concepts: Fault Tolerance and Replication
Producer
Producer
Producer
Consumer
Consumer
Broker
Broker
Broker
T0: P0
T0: P0 (Replica 1)
T1: P0
T1: P0 (Replica 1)

1717
Apache Kafka Concepts: Consumer Groups
Producer
Producer
Producer
Consumer
Broker
Broker
Broker
T0: P0
T0: P2
T0: P1
T0: P3
T1: P0
T1: P1
Consumer
Consumer
Consumer

1818
The Connect API of Apache Kafka®
 Centralized management and configuration
 Support for hundreds of technologies
including RDBMS, Elasticsearch, HDFS, S3
 Supports CDC ingest of events from RDBMS
 Preserves data schema
 Fault tolerant and automatically load balanced
 Extensible API
 Single Message Transforms
 Part of Apache Kafka, included in
Confluent Open Source
Reliable and scalable integration of Kafka
with other systems – no coding required.
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo",
"table.whitelist": "sales,orders,customers"
}
https://docs.confluent.io/current/connect/

1919
Build Applications, not Clusters
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>1.0.0</version>
</dependency>

2121
How do I run in production?

2222
Uncool Cool

2323
http://docs.confluent.io/current/streams/introduction.html

2424
Elastic and Scalable
http://docs.confluent.io/current/streams/developer-guide.html#elastic-scaling-of-your-application

2525

2626

2727
Typical High Level Architecture
Real-time
Data
Ingestion

2828
Stream
Processing
Real-time
Data
Ingestion

2929
Stream
Processing
Storage
Real-time
Data
Ingestion

3030
Data Publishing /
Visualization
Stream
Processing
Storage
Real-time
Data
Ingestion

3131
How many clusters do you count?
NoSQL (Cassandra,
HBase, Couchbase,
MongoDB, …) or
Elasticsearch, Solr,
…
Storm, Flink, Spark
Streaming, Ignite,
Akka Streams, Apex,
…
HDFS, NFS, Ceph,
GlusterFS, Lustre,
...
Apache Kafka

3232
Simplicity is the Ultimate Sophistication
Node.js
Apache Kafka
Distributed Streaming Platform
Publish & Subscribe
to streams of data like a
messaging system
Store
streams of data safely in a
distributed replicated cluster
Process
streams of data efficiently
and in real-time

3333
Duality of Streams and Tables
http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables

3434
Duality of Streams and Tables
http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables

3535
Interactive Queries
http://docs.confluent.io/current/streams/developer-guide.html#streams-developer-guide-interactive-queries

3636
Interactive Queries
http://docs.confluent.io/current/streams/developer-guide.html#streams-developer-guide-interactive-queries

3737
Kafka Streams DSL
http://docs.confluent.io/current/streams/developer-guide.html#kafka-streams-dsl

3838
WordCount (and Java 8+)
WordCountLambdaExample.java
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-lambda-example");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
...
final Serde<String> stringSerde = Serdes.String();
final Serde<Long> longSerde = Serdes.Long();
final KStreamBuilder builder = new KStreamBuilder();
final KStream<String, String> textLines = builder.stream(stringSerde, stringSerde,
"TextLinesTopic");
final Pattern pattern = Pattern.compile("W+", Pattern.UNICODE_CHARACTER_CLASS);
final KTable<String, Long> wordCounts = textLines
.flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase())))
.groupBy((key, word) -> word)
.count("Counts");
wordCounts.to(stringSerde, longSerde, "WordsWithCountsTopic");
final KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration);
streams.cleanUp();
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));

3939
Easy to Develop with, Easy to Test
WordCountLambdaIntegrationTest.java
EmbeddedSingleNodeKafkaCluster CLUSTER = new EmbeddedSingleNodeKafkaCluster();
...
CLUSTER.createTopic(inputTopic);
...
Properties producerConfig = new Properties();
producerConfig.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
CLUSTER.bootstrapServers());

4040
The Streams API of Apache Kafka®
 No separate processing cluster required
 Develop on Mac, Linux, Windows
 Deploy to containers, VMs, bare metal, cloud
 Powered by Kafka: elastic, scalable,
distributed, battle-tested
 Perfect for small, medium, large use cases
 Fully integrated with Kafka security
 Exactly-once processing semantics
 Part of Apache Kafka, included in
Confluent Open Source
Write standard Java applications and microservices
to process your data in real-time
KStream<User, PageViewEvent> pageViews = builder.stream("pageviews-topic");
KTable<Windowed<User>, Long> viewsPerUserSession = pageViews
.groupByKey()
.count(SessionWindows.with(TimeUnit.MINUTES.toMillis(5)), "session-views");
https://docs.confluent.io/current/streams/

4141
KSQL: a Streaming SQL Engine for Apache Kafka® from Confluent
 No coding required, all you need is SQL
 No separate processing cluster required
 Powered by Kafka: elastic, scalable,
distributed, battle-tested
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u
ON c.userid = u.userid
WHERE u.level = 'Platinum';
KSQL is the simplest way to process streams of data in real-time
 Perfect for streaming ETL, anomaly detection,
event monitoring, and more
 Part of Confluent Open Source
https://github.com/confluentinc/ksql

Do you think that’s a
table you are querying ?

4343
KSQL in less than 5 minutes
https://www.youtube.com/watch?v=A45uRzJiv7I

4444
Confluent Enterprise: Logical Architecture
Kafka Cluster
Mainframe
Kafka Connect Servers
Kafka ConnectRDBMS
Hadoop
Cassandra
Elasticsearch
Kafka Connect Servers
Kafka Connect
Files
Producer
Application
Consumer
ApplicationZookeeper
Kafka Broker
REST Proxy Servers
REST Proxy
REST Client
Control Center Servers
Control Center
Schema Registry Servers
Schema Registry
Kafka Producer APIs Kafka Consumer APIs
Stream Processing Application 1
Stream Client
Stream Processing Application 2
Stream Client
REST Proxy Servers
REST Proxy
REST Client

4545
Confluent Enterprise: Physical Architecture
Rack 1
Kafka Broker #1
ToR Switch
ToR Switch
Schema Registry #1
Kafka Connect #1
Zookeeper #1
REST Proxy #1
Kafka Broker #4
Zookeeper #4
Rack 2
Kafka Broker #2
ToR Switch
ToR Switch
Schema Registry #2
Kafka Connect #2
Zookeeper #2
Kafka Broker #5
Zookeeper #5
Rack 3
Kafka Broker #3
ToR Switch
ToR Switch
Kafka Connect #3
Zookeeper #3
Core Switch Core Switch
REST Proxy #2
Load Balancer Load Balancer
Control Center #1 Control Center #2

4646
Confluent Completes Kafka
Feature Benefit Apache Kafka Confluent Open Source Confluent Enterprise
Apache Kafka
High throughput, low latency, high availability, secure distributed streaming
platform
Kafka Connect API Advanced API for connecting external sources/destinations into Kafka
Kafka Streams API
Simple library that enables streaming application development within the
Kafka framework
Additional Clients Supports non-Java clients; C, C++, Python, .NET and several others
REST Proxy
Provides universal access to Kafka from any network connected device via
HTTP
Schema Registry
Central registry for the format of Kafka data – guarantees all data is always
consumable
Pre-Built Connectors
HDFS, JDBC, Elasticsearch, Amazon S3 and other connectors fully certified
and supported by Confluent
JMS Client
Support for legacy Java Message Service (JMS) applications consuming
and producing directly from Kafka
Confluent Control
Center
Enables easy connector management, monitoring and alerting for a Kafka
cluster
Auto Data Balancer Rebalancing data across cluster to remove bottlenecks
Replicator Multi-datacenter replication simplifies and automates MDC Kafka clusters
Support
Enterprise class support to keep your Kafka environment running at top
performance Community Community 24x7x365

4747
Big Data and Fast Data Ecosystems
Synchronous Req/Response
0 – 100s ms
Near Real Time
> 100s ms
Offline Batch
> 1 hour
Apache Kafka
Stream Data Platform
Search
RDBMS
Apps Monitoring
Real-time
Analytics
NoSQL
Stream
Processing
Apache Hadoop
Data Lake
Impala
DWH
Hive
Spark Map-Reduce
Confluent HDFS Connector
(exactly once semantics)
https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/

4848
Building a Microservices Ecosystem with Kafka Streams and KSQL
https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/
https://github.com/confluentinc/kafka-streams-examples/tree/3.3.0-post/src/main/java/io/confluent/examples/streams/microservices

4949
Microservices: References
Blog posts series:
Part 1: The Data Dichotomy: Rethinking the Way We Treat Data and Services
https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and-services/
Part 2: Build Services on a Backbone of Events
https://www.confluent.io/blog/build-services-backbone-events/
Part 3: Using Apache Kafka as a Scalable, Event-Driven Backbone for Service Architectures
https://www.confluent.io/blog/apache-kafka-for-service-architectures/
Part 4: Chain Services with Exactly Once Guarantees
https://www.confluent.io/blog/chain-services-exactly-guarantees/
Part 5: Messaging as the Single Source of Truth
https://www.confluent.io/blog/messaging-single-source-truth/
Part 6: Leveraging the Power of a Database Unbundled
https://www.confluent.io/blog/leveraging-power-database-unbundled/
Part 7: Building a Microservices Ecosystem with Kafka Streams and KSQL
https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/
Whitepaper:
Microservices in the Apache Kafka™ Ecosystem
https://www.confluent.io/resources/microservices-in-the-apache-kafka-ecosystem/

5050
Apache Kafka Security
Security
• Processes customer data
• Regulatory requirements
• Legal compliance
• Internal security policies
• Need is not limited to
industries such as finance,
healthcare, or governmental
services
Authentication
• Scenario example: “Only certain applications may talk to the production
Kafka cluster”
• Client authentication via SASL – e.g. Kerberos, Active Directory
Authorization
• Scenario example: “Only certain applications may read data from
sensitive Kafka topics”
• Restrict who can create, write to, read from topics, and more
Encryption
• Scenario example: “Data-in-transit between apps and Kafka clusters
must be encrypted”
• SSL supported
• Encrypts data exchanged between Kafka brokers, between Kafka brokers
and Kafka clients/apps
Help meeting security requirements by supporting:

5151
Enterprise Ready Multi-Datacenter Replication for Kafka
Data Center in USA
Kafka Cluster (USA)
Kafka Broker 1
Kafka Broker 2
Kafka Broker 3
ZooKeeper 1
ZooKeeper 2
ZooKeeper 3
Control Center
Kafka Connect
Cluster
Replicator 1
Replicator 2
Data Center in EMEA
Kafka Cluster (EU)
Kafka Broker 1
Kafka Broker 2
Kafka Broker 3
ZooKeeper 1
ZooKeeper 2
ZooKeeper 3
Control Center
Kafka Connect
Cluster
Replicator 1
Replicator 2
Available only with Confluent Enterprise
Apache Kafka and Confluent Open Source

5252
Cloud Synchronization and Migrations with Confluent Enterprise: Before
DC1
DB2
DB1
DWH
App2
App3
App4
KV2KV3
DB3
App2-v2
App5
App7
App1-v2
AWS
App8
DWH
App1
Challenges
• Each team/department
must execute their own cloud
migration
• May be moving the same data
multiple times
• Each box represented here
require development, testing,
deployment, monitoring and
maintenance
KV

5353
DC1
Cloud Synchronization and Migrations with Confluent Enterprise: After
DB2
DB1
KV
DWH
App2
App4
KV2KV3
App2-v2
App5 App7
App1-v2
AWS
App8
DWH
App1
Kafka
Kafka
App3
Benefits
• Continuous low-latency
synchronization
• Centralized manageability and
monitoring
– Track at event level data
produced in all data centers
• Security and governance
– Track and control where data
comes from and who is
accessing it
• Cost Savings
– Move Data Once
DB3

5454
About Confluent and Apache Kafka™
70% of active Kafka
Committers
Founded
September 2014
Technology developed
while at LinkedIn
Founded by the creators of
Apache Kafka

5555
Apache Kafka: PMC members and committers
https://kafka.apache.org/committers
PMC
PMC PMC PMCPMC PMC PMC PMC
PMC PMC PMC

5656
Download Confluent Platform: the easiest way to get you started
https://www.confluent.io/download/

5757
Books: get them all three in PDF format from Confluent website!
https://www.confluent.io/apache-kafka-stream-processing-book-bundle

5858
Discount code: kacom17
Presented by
https://kafka-summit.org/
Presented by

Introduction to apache kafka, confluent and why they matter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to apache kafka, confluent and why they matter

Similar to Introduction to apache kafka, confluent and why they matter (20)

More from Paolo Castagna

More from Paolo Castagna (6)

Recently uploaded

Recently uploaded (20)

Introduction to apache kafka, confluent and why they matter