Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafka Streams: The Stream Processing Engine of Apache Kafka

676 views

Published on

This talk on Kafka Streams was presented at Big Data London 2016.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Kafka Streams: The Stream Processing Engine of Apache Kafka

  1. 1. 1Confidential Kafka Streams: The New Smart Kid On The Block The Stream Processing Engine of Apache Kafka Eno Thereska eno@confluent.io enotheres ka Big Data London 2016 Slide contributions: Michael Noll
  2. 2. 2Confidential Apache Kafka and Kafka Streams API
  3. 3. 3Confidential What is Kafka Streams: Unix analogy $ cat < in.txt | grep “apache” | tr a-z A-Z > out.txt Kafka Core Kafka Connect Kafka Streams
  4. 4. 4Confidential When to use Kafka Streams • Mainstream Application Development • When running a cluster would suck • Microservices • Fast Data apps for small and big data • Large-scale continuous queries and transformations • Event-triggered processes • Reactive applications • The “T” in ETL • <and more> • Use case examples • Real-time monitoring and intelligence • Customer 360-degree view • Fraud detection • Location-based marketing • Fleet management • <and more>
  5. 5. 5Confidential Some use cases in the wild & external articles • Applying Kafka Streams for internal message delivery pipeline at LINE Corp. • http://developers.linecorp.com/blog/?p=3960 • Kafka Streams in production at LINE, a social platform based in Japan with 220+ million users • Microservices and reactive applications • https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-kafka-streams • User behavior analysis • https://timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html • Containerized Kafka Streams applications in Scala • https://www.madewithtea.com/processing-tweets-with-kafka-streams.html • Geo-spatial data analysis • http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/ • Language classification with machine learning • https://dzone.com/articles/machine-learning-with-kafka-streams
  6. 6. 6Confidential Architecture comparison: use case example Real-time dashboard for security monitoring “Which of my data centers are under attack?”
  7. 7. 7Confidential Architecture comparison: use case example Other App Dashboard Frontend App Other App 1 Capture business events in Kafka 2 Must process events with separate cluster (e.g. Spark) 4 Other apps access latest results by querying these DBs 3 Must share latest results through separate systems (e.g. MySQL) Before: Undue complexity, heavy footprint, many technologies, split ownership with conflicting priorities Your “Job” Other App Dashboard Frontend App Other App 1 Capture business events in Kafka 2 Process events with standard Java apps that use Kafka Streams 3 Now other apps can directly query the latest results With Kafka Streams: simplified, app-centric architecture, puts app owners in control Kafka Streams Your App Conflicting priorities: infrastructure teams vs. product teams Complexity: a lot of moving pieces that are also complex individually Is all this a part of the solution or part of your problem?
  8. 8. 8Confidential How do I install Kafka Streams? • There is and there should be no “installation” – Build Apps, Not Clusters! • It’s a library. Add it to your app like any other library. <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.0.1</version> </dependency>
  9. 9. 9Confidential How do I package and deploy my apps? How do I …? • Whatever works for you. Stick to what you/your company think is the best way. • Kafka Streams integrates well with what you already have. • Why? Because an app that uses Kafka Streams is…a normal Java app.
  10. 10. 10Confidential Available APIs
  11. 11. 11Confidential • API option 1: Kafka Streams DSL (declarative) KStream<Integer, Integer> input = builder.stream("numbers-topic"); // Stateless computation KStream<Integer, Integer> doubled = input.mapValues(v -> v * 2); // Stateful computation KTable<Integer, Integer> sumOfOdds = input .filter((k,v) -> v % 2 != 0) .selectKey((k, v) -> 1) .groupByKey() .reduce((v1, v2) -> v1 + v2, "sum-of-odds"); The preferred API for most use cases. The DSL particularly appeals to users: • When familiar with Spark, Flink • When fans of Scala or functional programming
  12. 12. 12Confidential • API option 2: Processor API (imperative) class PrintToConsoleProcessor implements Processor<K, V> { @Override public void init(ProcessorContext context) {} @Override void process(K key, V value) { System.out.println("Received record with " + "key=" + key + " and value=" + value); } @Override void punctuate(long timestamp) {} @Override void close() {} } Full flexibility but more manual work The Processor API appeals to users: • When familiar with Storm, Samza • Still, check out the DSL! • When requiring functionality that is not yet available in the DSL
  13. 13. 13Confidential ”My WordCount is better than your WordCount” (?) Kafka Spark These isolated code snippets are nice (and actually quite similar) but they are not very meaningful. In practice, we also need to read data from somewhere, write data back to somewhere, etc.– but we can see none of this here.
  14. 14. 14Confidential WordCount in Kafka Word Count
  15. 15. 15Confidential Compared to: WordCount in Spark 2.0 1 2 3 Runtime model leaks into processing logic (here: interfacing from Spark with Kafka)
  16. 16. 16Confidential Compared to: WordCount in Spark 2.0 4 5 Runtime model leaks into processing logic (driver vs. executors)
  17. 17. 17Confidential
  18. 18. 18Confidential Kafka Streams key concepts
  19. 19. 19Confidential Key concepts
  20. 20. 20Confidential Key concepts
  21. 21. 21Confidential Key concepts Kafka Core Kafka Streams
  22. 22. 22Confidential Streams meet Tables
  23. 23. 23Confidential Streams meet Tables http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables
  24. 24. 24Confidential Motivating example: continuously compute current users per geo-region 4 7 5 3 2 8 Real-time dashboard “How many users younger than 30y, per region?” alice Asia, 25y, … bob Europe, 46y, … … … user-locations (mobile team) user-prefs (web team)
  25. 25. 25Confidential Motivating example: continuously compute current users per geo-region 4 7 5 3 2 8 Real-time dashboard “How many users younger than 30y, per region?” alice Europe user-locations alice Asia, 25y, … bob Europe, 46y, … … … user-locations (mobile team) user-prefs (web team)
  26. 26. 26Confidential Motivating example: continuously compute current users per geo-region 4 7 5 3 2 8 Real-time dashboard “How many users younger than 30y, per region?” alice Europe user-locations user-locations (mobile team) user-prefs (web team) alice Asia, 25y, … bob Europe, 46y, … … … alice Europe, 25y, … bob Europe, 46y, … … …
  27. 27. 27Confidential Motivating example: continuously compute current users per geo-region 4 7 5 3 2 8 4 7 6 3 2 7 Alice Real-time dashboard “How many users younger than 30y, per region?” alice Europe user-locations alice Asia, 25y, … bob Europe, 46y, … … … alice Europe, 25y, … bob Europe, 46y, … … … -1 +1 user-locations (mobile team) user-prefs (web team)
  28. 28. 28Confidential Same data, but different use cases require different interpretations alice San Francisco alice New York City alice Rio de Janeiro alice Sydney alice Beijing alice Paris alice Berlin
  29. 29. 29Confidential Same data, but different use cases require different interpretations alice San Francisco alice New York City alice Rio de Janeiro alice Sydney alice Beijing alice Paris alice Berlin Use case 1: Frequent traveler status? Use case 2: Current location?
  30. 30. 30Confidential Same data, but different use cases require different interpretations “Alice has been to SFO, NYC, Rio, Sydney, Beijing, Paris, and finally Berlin.” “Alice is in SFO, NYC, Rio, Sydney, Beijing, Paris, Berlin right now.” ⚑ ⚑ ⚑⚑ ⚑ ⚑ ⚑ ⚑ ⚑ ⚑⚑ ⚑ ⚑ ⚑ Use case 1: Frequent traveler status? Use case 2: Current location?
  31. 31. 31Confidential Streams meet Tables record stream When you need… so that the topic is interpreted as a All the values of a key KStream then you’d read the Kafka topic into a Example All the places Alice has ever been to with messages interpreted as INSERT (append)
  32. 32. 32Confidential Streams meet Tables record stream changelog stream When you need… so that the topic is interpreted as a All the values of a key Latest value of a key KStream KTable then you’d read the Kafka topic into a Example All the places Alice has ever been to Where Alice is right now with messages interpreted as INSERT (append) UPDATE (overwrite existing)
  33. 33. 33Confidential Motivating example: continuously compute current users per geo-region KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”); KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”);
  34. 34. 34Confidential Motivating example: continuously compute current users per geo-region alice Europe user-locations alice Asia, 25y, … bob Europe, 46y, … … … alice Europe, 25y, … bob Europe, 46y, … … … KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”); KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”); // Merge into detailed user profiles (continuously updated) KTable<UserId, UserProfile> userProfiles = userLocations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs)); KTable userProfilesKTable userProfiles
  35. 35. 35Confidential Motivating example: continuously compute current users per geo-region KTable<UserId, Location> userLocations = builder.table(“user-locations-topic”); KTable<UserId, Prefs> userPrefs = builder.table(“user-preferences-topic”); // Merge into detailed user profiles (continuously updated) KTable<UserId, UserProfile> userProfiles = userLocations.join(userPrefs, (loc, prefs) -> new UserProfile(loc, prefs)); // Compute per-region statistics (continuously updated) KTable<UserId, Long> usersPerRegion = userProfiles .filter((userId, profile) -> profile.age < 30) .groupBy((userId, profile) -> profile.location) .count(); alice Europe user-locations Africa 3 … … Asia 8 Europe 5 Africa 3 … … Asia 7 Europe 6 KTable usersPerRegion KTable usersPerRegion
  36. 36. 36Confidential Motivating example: continuously compute current users per geo-region 4 7 5 3 2 8 4 7 6 3 2 7 Alice Real-time dashboard “How many users younger than 30y, per region?” alice Europe user-locations alice Asia, 25y, … bob Europe, 46y, … … … alice Europe, 25y, … bob Europe, 46y, … … … -1 +1 user-locations (mobile team) user-prefs (web team)
  37. 37. 37Confidential Streams meet Tables – in the Kafka Streams DSL
  38. 38. 38Confidential Kafka Streams key features
  39. 39. 39Confidential Key features in 0.10 • Native, 100%-compatible Kafka integration
  40. 40. 40Confidential Native, 100% compatible Kafka integration Read from Kafka Write to Kafka
  41. 41. 41Confidential Key features in 0.10 • Native, 100%-compatible Kafka integration • Secure stream processing using Kafka’s security features • Elastic and highly scalable • Fault-tolerant
  42. 42. 42Confidential Scalability, fault tolerance, elasticity
  43. 43. 43Confidential Scalability, fault tolerance, elasticity
  44. 44. 44Confidential Scalability, fault tolerance, elasticity
  45. 45. 45Confidential Scalability, fault tolerance, elasticity
  46. 46. 46Confidential Key features in 0.10 • Native, 100%-compatible Kafka integration • Secure stream processing using Kafka’s security features • Elastic and highly scalable • Fault-tolerant • Stateful and stateless computations
  47. 47. 47Confidential Stateful computations • Stateful computations like aggregations or joins require state • We already showed a join example in the previous slides. • Windowing a stream is stateful, too, but let’s ignore this for now. • Example: count() will cause the creation of a state store to keep track of counts • State stores in Kafka Streams • … are per stream task for isolation (think: share-nothing) • … are local for best performance • … are replicated to Kafka for elasticity and for fault-tolerance • Pluggable storage engines • Default: RocksDB (key-value store) to allow for local state that is larger than available RAM • Further built-in options available: in-memory store • You can also use your own, custom storage engine
  48. 48. 48Confidential State management with built-in fault-tolerance State stores (This is a bit simplified.)
  49. 49. 49Confidential State management with built-in fault-tolerance State stores (This is a bit simplified.) charlie 3 bob 1 alice 1 alice 2
  50. 50. 50Confidential State management with built-in fault-tolerance State stores (This is a bit simplified.)
  51. 51. 51Confidential State management with built-in fault-tolerance State stores (This is a bit simplified.) alice 1 alice 2
  52. 52. 52Confidential Key features in 0.10 • Native, 100%-compatible Kafka integration • Secure stream processing using Kafka’s security features • Elastic and highly scalable • Fault-tolerant • Stateful and stateless computations • Interactive queries
  53. 53. 53Confidential Interactive Queries Kafka Streams App App App App 1 Capture business events in Kafka 2 Process the events with Kafka Streams 4 Other apps query external systems for latest results ! Must use external systems to share latest results App App App 1 Capture business events in Kafka 2 Process the events with Kafka Streams 3 Now other apps can directly query the latest results Before (0.10.0) After (0.10.1): simplified, more app-centric architecture Kafka Streams App
  54. 54. 54Confidential Key features in 0.10 • Native, 100%-compatible Kafka integration • Secure stream processing using Kafka’s security features • Elastic and highly scalable • Fault-tolerant • Stateful and stateless computations • Interactive queries • Time model • Windowing • Supports late-arriving and out-of-order data • Millisecond processing latency, no micro-batching • At-least-once processing guarantees (exactly-once is in the works as we speak)
  55. 55. 55Confidential Wrapping Up
  56. 56. 56Confidential Where to go from here • Kafka Streams is available in Confluent Platform 3.0 and in Apache Kafka 0.10 • http://www.confluent.io/download • Kafka Streams demos: https://github.com/confluentinc/examples • Java 7, Java 8+ with lambdas, and Scala • WordCount, Interactive Queries, Joins, Security, Windowing, Avro integration, … • Confluent documentation: http://docs.confluent.io/current/streams/ • Quickstart, Concepts, Architecture, Developer Guide, FAQ • Recorded talks • Introduction to Kafka Streams: http://www.youtube.com/watch?v=o7zSLNiTZbA • Application Development and Data in the Emerging World of Stream Processing (higher level talk): https://www.youtube.com/watch?v=JQnNHO5506w
  57. 57. 57Confidential Thank You

×