Apache kafka

1,192 views

Published on

An introduction to the kafka stream processing platform. The presentation gives a small introduction into stream processing and furthermore explains how kafka streams and kafka connect are used together to implement realtime stream processing flows.

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,192
On SlideShare
0
From Embeds
0
Number of Embeds
1,002
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 1kb - 2kb per message, 3 node cluster with 32Gb RAM and dual quad CPU, tested at a customer
  • https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
  • Apache kafka

    1. 1. Apache Kafka Stream processing made easy
    2. 2. Streams 101 An introduction to stream processing
    3. 3. Transformation of a stream of data fragments into a continuous flow of information
    4. 4. Stream Processing Get real-time insights Lower processing latency Easier to test Easier to maintain Easier to scale Different way of thinking At-least-once vs exactly- once Time + -
    5. 5. Every company is already doing stream processing (more or less ... )
    6. 6. A Stream Key 1 -> value 1 Key 2 -> value 2 Key 1 -> value 3 ...
    7. 7. A Table +-------+---------+ | Key 1 | value 3 | | Key 2 | value 2 | +-------+---------+
    8. 8. A Table through time +-------+---------+ | Key 1 | value 1 | +-------+---------+ Timestamp 1 +-------+---------+ | Key 1 | value 1 | | Key 2 | value 2 | +-------+---------+ Timestamp 2 +-------+---------+ | Key 1 | value 3 | | Key 2 | value 2 | +-------+---------+ Timestamp 3 Timestamp ...
    9. 9. Let’s remove the redundancy
    10. 10. A Table through time as SETs SET(key1 -> value1) Timestamp 1 SET(key2 -> value2) Timestamp 2 SET(key 1 -> value3) Timestamp 3 Timestamp ...
    11. 11. SET(key1 -> value1) SET(key2 -> value2) SET(key1 -> value3) Changelog key1 -> value1 key2 -> value2 key1 -> value3 Stream
    12. 12. Tables are materialized views of streams
    13. 13. Why is this important?
    14. 14. Events used to manipulate core data. Today events are our core data Daan Gerits, 2012
    15. 15. Every stream process app is a combination of state and streams
    16. 16. Streaming vs batch is like agile vs waterfall but then for data.
    17. 17. Kafka A Stream Processing Platform
    18. 18. Kafka Proxy Kafka Streams Kafka Connect Kafka Security Schema Repo Kafka
    19. 19. Kafka Platform Streams and Connect apps are just (java) apps Streams and Connect are libraries Can be deployed like any other (java) app Multiple instances of the same app can be launched Use tools like Mesos, kubernetes, Docker Swarm, ...
    20. 20. Batch Microbatch Flink Kafka Spark Storm / Heron Event
    21. 21. Build apps, not Jobs
    22. 22. Kafka Proxy Kafka Streams Kafka Connect Kafka Security Schema Repo Kafka
    23. 23. Kafka Engine A message broker with a twist
    24. 24. Kafka Engine Producer ConsumerTopic message message Producer Producer Consumer Consumer
    25. 25. Kafka Engine Producer ConsumerTopic message message Producer Producer Consumer Consumer
    26. 26. Messages Contain byte arrays Have a Timestamp Key Value
    27. 27. Topics Are more like datastores Uses disk instead of memory Retains the messages Are partitioned and replicated
    28. 28. Wait?? … Disk??
    29. 29. Sequential disk access is fast* * Don’t believe me? Read http://kafka.apache.org/documentation#persistence
    30. 30. Producer Puts messages onto kafka Determines the partition to write to Can be implemented in many, many languages
    31. 31. Consumer Gets messages from kafka Can be grouped into Consumer Groups Allows for round robin message delivery Enables scaling of consumers Have a persisted offset per Consumer Group Stored in Zookeeper Or in Kafka
    32. 32. Kafka Engine Producer Consumer Topic Partition B Producer Consumer Topic Partition A
    33. 33. 100 000 msg/sec On a barely tweaked, 3 node cluster
    34. 34. 2 000 000 msg/sec On a heavily tweaked cluster https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
    35. 35. Kafka Connect Getting data in and out
    36. 36. A Simple and scalable way to get data in and out of topics
    37. 37. Kafka Connect Datasource Topic Kafka Connect Topic Datasink Kafka Connect Or
    38. 38. Kafka Connect Datasource Topic Kafka Connect Kafka Connect Kafka Connect
    39. 39. Kafka Connect MySQL ⬢ Salesforce ⬢ Redis ⬢ MQTT ⬢ InfluxDB ⬢ RethinkDB ⬢ HBase ⬢ Solr ⬢ Couchbase ⬢ Elasticsearch ⬢ Hazelcast ⬢ Google PubSub ⬢ HDFS ⬢ S3 ⬢ Splunk ⬢ Spooldir ⬢ JDBC ⬢ Syslog ⬢ Cassandra ⬢ Vertica ⬢ DB2 ⬢ Goldengate ⬢ Jenkins ⬢ PredictionIO ⬢ JMS ⬢ Twitter ⬢ Attunity ⬢ MSSQL ⬢ Postgres ⬢ DynamoDB ⬢ IRC ⬢ Kudu ⬢ Ignite ⬢ MongoDB ⬢ Bloomberg Ticker ⬢ FTP
    40. 40. Kafka Streams Processing streaming data
    41. 41. Kafka Streams Topic Topic Kafka Streams Topic Topic
    42. 42. Kafka Streams KStream for a stream of data KTable to keep the latest value for each key KTable state is distributed across app instances Transform from streams to tables and tables to streams Choose which field to use as “timestamp”
    43. 43. TOPIC A TOPIC B TOPIC C Kafka Connect App Kafka Streams App Kafka Streams App Kafka Connect App TOPIC C TOPIC B TOPIC A
    44. 44. So how do you build solutions with this?
    45. 45. Kafka Kafka Connect Kafka Streams Kafka Kafka Kafka Connect
    46. 46. TOPIC A TOPIC B TOPIC C Sales JDBC Kafka Connect Top Products Ranker Emailer TOPIC C TOPIC B TOPIC A Low Stock Notifier Kafka Connect App Slack Poster
    47. 47. Proposal

    ×