Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache Kafka
Stream processing made easy
Streams 101
An introduction to stream
processing
Transformation of
a stream of data fragments
into a continuous flow of
information
Stream Processing
Get real-time insights
Lower processing latency
Easier to test
Easier to maintain
Easier to scale
Differ...
Every company is already
doing stream processing
(more or less ... )
A Stream
Key 1 -> value 1
Key 2 -> value 2
Key 1 -> value 3
...
A Table
+-------+---------+
| Key 1 | value 3 |
| Key 2 | value 2 |
+-------+---------+
A Table through time
+-------+---------+
| Key 1 | value 1 |
+-------+---------+
Timestamp 1
+-------+---------+
| Key 1 |...
Let’s remove the redundancy
A Table through time as SETs
SET(key1 -> value1)
Timestamp 1
SET(key2 -> value2)
Timestamp 2
SET(key 1 -> value3)
Timestam...
SET(key1 -> value1)
SET(key2 -> value2)
SET(key1 -> value3)
Changelog
key1 -> value1
key2 -> value2
key1 -> value3
Stream
Tables are
materialized views
of streams
Why is this important?
Events used to manipulate core data.
Today events are our core data
Daan Gerits, 2012
Every stream process app is a
combination of state and streams
Streaming vs batch
is like
agile vs waterfall
but then for data.
Kafka
A Stream Processing Platform
Kafka
Proxy
Kafka
Streams
Kafka
Connect
Kafka
Security
Schema
Repo
Kafka
Kafka Platform
Streams and Connect apps are just (java) apps
Streams and Connect are libraries
Can be deployed like any ot...
Batch Microbatch
Flink
Kafka
Spark
Storm / Heron
Event
Build apps, not Jobs
Kafka
Proxy
Kafka
Streams
Kafka
Connect
Kafka
Security
Schema
Repo
Kafka
Kafka Engine
A message broker with a twist
Kafka Engine
Producer ConsumerTopic
message message
Producer
Producer Consumer
Consumer
Kafka Engine
Producer ConsumerTopic
message message
Producer
Producer Consumer
Consumer
Messages
Contain byte arrays
Have a
Timestamp
Key
Value
Topics
Are more like datastores
Uses disk instead of memory
Retains the messages
Are partitioned and replicated
Wait?? … Disk??
Sequential disk access is fast*
* Don’t believe me? Read http://kafka.apache.org/documentation#persistence
Producer
Puts messages onto kafka
Determines the partition to write to
Can be implemented in many, many languages
Consumer
Gets messages from kafka
Can be grouped into Consumer Groups
Allows for round robin message delivery
Enables scal...
Kafka Engine
Producer Consumer
Topic
Partition B
Producer Consumer
Topic
Partition A
100 000 msg/sec
On a barely tweaked, 3 node cluster
2 000 000 msg/sec
On a heavily tweaked cluster
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-...
Kafka Connect
Getting data in and out
A
Simple and scalable
way to get
data in and out
of topics
Kafka Connect
Datasource Topic
Kafka
Connect
Topic Datasink
Kafka
Connect
Or
Kafka Connect
Datasource Topic
Kafka
Connect
Kafka
Connect
Kafka
Connect
Kafka Connect
MySQL ⬢ Salesforce ⬢ Redis ⬢ MQTT ⬢
InfluxDB ⬢ RethinkDB ⬢ HBase ⬢ Solr ⬢
Couchbase ⬢ Elasticsearch ⬢ Hazelc...
Kafka Streams
Processing streaming data
Kafka Streams
Topic Topic
Kafka
Streams
Topic
Topic
Kafka Streams
KStream for a stream of data
KTable to keep the latest value for each key
KTable state is distributed across...
TOPIC A
TOPIC B
TOPIC C
Kafka
Connect
App
Kafka
Streams
App
Kafka
Streams
App
Kafka
Connect
App
TOPIC C
TOPIC B
TOPIC A
So how do you build
solutions with this?
Kafka
Kafka
Connect
Kafka
Streams
Kafka
Kafka
Kafka
Connect
TOPIC A
TOPIC B
TOPIC C
Sales JDBC
Kafka
Connect
Top
Products
Ranker
Emailer
TOPIC C
TOPIC B
TOPIC A
Low Stock
Notifier
Ka...
Proposal
Apache kafka
Upcoming SlideShare
Loading in …5
×

Apache kafka

3,994 views

Published on

An introduction to the kafka stream processing platform. The presentation gives a small introduction into stream processing and furthermore explains how kafka streams and kafka connect are used together to implement realtime stream processing flows.

Published in: Data & Analytics
  • Hello there! Get Your Professional Job-Winning Resume Here! http://bit.ly/topresum
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache kafka

  1. 1. Apache Kafka Stream processing made easy
  2. 2. Streams 101 An introduction to stream processing
  3. 3. Transformation of a stream of data fragments into a continuous flow of information
  4. 4. Stream Processing Get real-time insights Lower processing latency Easier to test Easier to maintain Easier to scale Different way of thinking At-least-once vs exactly- once Time + -
  5. 5. Every company is already doing stream processing (more or less ... )
  6. 6. A Stream Key 1 -> value 1 Key 2 -> value 2 Key 1 -> value 3 ...
  7. 7. A Table +-------+---------+ | Key 1 | value 3 | | Key 2 | value 2 | +-------+---------+
  8. 8. A Table through time +-------+---------+ | Key 1 | value 1 | +-------+---------+ Timestamp 1 +-------+---------+ | Key 1 | value 1 | | Key 2 | value 2 | +-------+---------+ Timestamp 2 +-------+---------+ | Key 1 | value 3 | | Key 2 | value 2 | +-------+---------+ Timestamp 3 Timestamp ...
  9. 9. Let’s remove the redundancy
  10. 10. A Table through time as SETs SET(key1 -> value1) Timestamp 1 SET(key2 -> value2) Timestamp 2 SET(key 1 -> value3) Timestamp 3 Timestamp ...
  11. 11. SET(key1 -> value1) SET(key2 -> value2) SET(key1 -> value3) Changelog key1 -> value1 key2 -> value2 key1 -> value3 Stream
  12. 12. Tables are materialized views of streams
  13. 13. Why is this important?
  14. 14. Events used to manipulate core data. Today events are our core data Daan Gerits, 2012
  15. 15. Every stream process app is a combination of state and streams
  16. 16. Streaming vs batch is like agile vs waterfall but then for data.
  17. 17. Kafka A Stream Processing Platform
  18. 18. Kafka Proxy Kafka Streams Kafka Connect Kafka Security Schema Repo Kafka
  19. 19. Kafka Platform Streams and Connect apps are just (java) apps Streams and Connect are libraries Can be deployed like any other (java) app Multiple instances of the same app can be launched Use tools like Mesos, kubernetes, Docker Swarm, ...
  20. 20. Batch Microbatch Flink Kafka Spark Storm / Heron Event
  21. 21. Build apps, not Jobs
  22. 22. Kafka Proxy Kafka Streams Kafka Connect Kafka Security Schema Repo Kafka
  23. 23. Kafka Engine A message broker with a twist
  24. 24. Kafka Engine Producer ConsumerTopic message message Producer Producer Consumer Consumer
  25. 25. Kafka Engine Producer ConsumerTopic message message Producer Producer Consumer Consumer
  26. 26. Messages Contain byte arrays Have a Timestamp Key Value
  27. 27. Topics Are more like datastores Uses disk instead of memory Retains the messages Are partitioned and replicated
  28. 28. Wait?? … Disk??
  29. 29. Sequential disk access is fast* * Don’t believe me? Read http://kafka.apache.org/documentation#persistence
  30. 30. Producer Puts messages onto kafka Determines the partition to write to Can be implemented in many, many languages
  31. 31. Consumer Gets messages from kafka Can be grouped into Consumer Groups Allows for round robin message delivery Enables scaling of consumers Have a persisted offset per Consumer Group Stored in Zookeeper Or in Kafka
  32. 32. Kafka Engine Producer Consumer Topic Partition B Producer Consumer Topic Partition A
  33. 33. 100 000 msg/sec On a barely tweaked, 3 node cluster
  34. 34. 2 000 000 msg/sec On a heavily tweaked cluster https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
  35. 35. Kafka Connect Getting data in and out
  36. 36. A Simple and scalable way to get data in and out of topics
  37. 37. Kafka Connect Datasource Topic Kafka Connect Topic Datasink Kafka Connect Or
  38. 38. Kafka Connect Datasource Topic Kafka Connect Kafka Connect Kafka Connect
  39. 39. Kafka Connect MySQL ⬢ Salesforce ⬢ Redis ⬢ MQTT ⬢ InfluxDB ⬢ RethinkDB ⬢ HBase ⬢ Solr ⬢ Couchbase ⬢ Elasticsearch ⬢ Hazelcast ⬢ Google PubSub ⬢ HDFS ⬢ S3 ⬢ Splunk ⬢ Spooldir ⬢ JDBC ⬢ Syslog ⬢ Cassandra ⬢ Vertica ⬢ DB2 ⬢ Goldengate ⬢ Jenkins ⬢ PredictionIO ⬢ JMS ⬢ Twitter ⬢ Attunity ⬢ MSSQL ⬢ Postgres ⬢ DynamoDB ⬢ IRC ⬢ Kudu ⬢ Ignite ⬢ MongoDB ⬢ Bloomberg Ticker ⬢ FTP
  40. 40. Kafka Streams Processing streaming data
  41. 41. Kafka Streams Topic Topic Kafka Streams Topic Topic
  42. 42. Kafka Streams KStream for a stream of data KTable to keep the latest value for each key KTable state is distributed across app instances Transform from streams to tables and tables to streams Choose which field to use as “timestamp”
  43. 43. TOPIC A TOPIC B TOPIC C Kafka Connect App Kafka Streams App Kafka Streams App Kafka Connect App TOPIC C TOPIC B TOPIC A
  44. 44. So how do you build solutions with this?
  45. 45. Kafka Kafka Connect Kafka Streams Kafka Kafka Kafka Connect
  46. 46. TOPIC A TOPIC B TOPIC C Sales JDBC Kafka Connect Top Products Ranker Emailer TOPIC C TOPIC B TOPIC A Low Stock Notifier Kafka Connect App Slack Poster
  47. 47. Proposal

×