Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GNW03: Stream Processing with Apache Kafka by Gwen Shapira

2,723 views

Published on

Gwen Shapira of Confluent presented episode #03 of Gluent New World series and talked about stream processing in modern enterprises using Apache Kafka.

The video recording for this presentation is at: http://vimeo.com/gluent/

Published in: Technology
  • Be the first to comment

GNW03: Stream Processing with Apache Kafka by Gwen Shapira

  1. 1. Apache Kafka and Real Time Stream Processing Gwen Shapira System Architect Confluent @gwenshap
  2. 2. I’ll tell you about • What is stream processing and why it matters • What is Apache Kafka • How Kafka helps stream processing Stay awake for this part
  3. 3. What is Stream Processing?
  4. 4. Data Processing Paradigm Request / Response Batch Stream Processing
  5. 5. Stream Processing Paradigm • Data is generated at its own rate as “Streams” • We can process as much or as little as we want • Continuously • Results are available in real-time • But nothing waits for specific results • Time for data availability? • More than “few ms” • Less than “hours”
  6. 6. This is the world changing bit • Most of the business is… • Not urgent enough to require immediate response • But can’t wait for the next day • “Streams of events” represents something fundamental • Same way relational tables are fundamental
  7. 7. Ok, got the streams part. But what about Apache Kafka?
  8. 8. Cross of messaging system and file system
  9. 9. Kafka is all about LOGS
  10. 10. If you understand logs You understand Kafka
  11. 11. Redo Log: The most crucial structure for recovery operations … store all changes made to the database as they occur.
  12. 12. Important Point The redo log is the only reliable source of information about current state of the database.
  13. 13. But Logs are also a STREAM of events And Kafka stores those logs Allowing to read the past and keep getting updates on the future
  14. 14. Stream Processing Read a stream modify it output another stream
  15. 15. Example: CDC-based ETL
  16. 16. If we use Kafka for CDC, does it mean it is ACID?
  17. 17. Stream Processing is Important Kafka is a collection of logs. How does Kafka help with stream processing?
  18. 18. First, How do we actually do stream processing?
  19. 19. Method 1: Do it yourself (Hipster stream processing)
  20. 20. Method 2: The Stream Processing Frameworks • Storm • Spark • Flink • Samza • Apex • Nifi • StreamBase • InfoSphere Streams • Google DataFlow (AKA Beam) • I can go on for 5 more pages…
  21. 21. Few of those are really popular! • Pro: They handle some hard problems • Con: It can be too complex
  22. 22. What do I mean by too complex? Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Imp ala Map/Red uce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App
  23. 23. Why so many moving parts? We needed… Hbase to handle complex state Spark requires HDFS Ingest layer Batch layer to handle re-calculations
  24. 24. What we really wanted was… Inputs Kafka Processor output
  25. 25. Enter KafkaStreams 3 Simplifications: 1. Uses Kafka 2. No Framework 3. Unify Tables and Streams
  26. 26. Don’t all stream processing use Kafka?
  27. 27. We use Kafka for… Partitioning, Scalability, Fault Tolerance Kafka A A A Group A B B Group B
  28. 28. Handling Time
  29. 29. No Framework • It is just a library that does transformations • We can add languages on top • Kafka does everything we needed the framework to do • You don’t need “framework” to run queries, why do you need it to run queries continuously?
  30. 30. The really important bit: Streams meet Tables
  31. 31. Streams: Things that happen. Events. Tables: State of things as they are.
  32. 32. Databases: Only states. Streams: Only events.
  33. 33. We can convert tables to streams and back: Stream -> Apply -> Table Table -> Change Capture -> Stream This is called Table-Stream Duality.
  34. 34. Streams and Tables sometimes work the same. And sometimes are very different. KafkaStreams handles both.
  35. 35. But… Where do streams come from?
  36. 36. We really like streams So we created a Stream Data Platform
  37. 37. Where can we learn more? • http://www.confluent.io/blog • http://kafka.apache.org/docume ntation.html • http://docs.confluent.io/current

×