Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

user Behavior Analysis with Session Windows and Apache Kafka's Streams API

2,333 views

Published on

For many industries the need to group together related events based on a period of activity or inactivity is key. Advertising businesses, content producers are just a few examples of where session windows can be used to better understand user behavior.

While such sessionization has been possible in Apache Kafka up to this point, implementing it has been rather complex and required leveraging low-level APIs. In the most recent release of Kafka, however, new capabilities have been added making session windows much easier to implement.

In this online talk, we’ll introduce the concept of a session window, talk about common use cases, and walk through how Apache Kafka can be used for session-oriented use cases.

Published in: Software
  • Be the first to comment

user Behavior Analysis with Session Windows and Apache Kafka's Streams API

  1. 1. 1 User behavior analysis with Session Windows and Apache Kafka’s Streams API Michael G. Noll Product Manager
  2. 2. 2 Attend the whole series! Simplify Governance for Streaming Data in Apache Kafka Date: Thursday, April 6, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Gwen Shapira, Product Manager, Confluent Using Apache Kafka to Analyze Session Windows Date: Thursday, March 30, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Michael Noll, Product Manager, Confluent Monitoring and Alerting Apache Kafka with Confluent Control Center Date: Thursday, March 16, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Nick Dearden, Director, Engineering and Product Data Pipelines Made Simple with Apache Kafka Date: Thursday, March 23, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Ewen Cheslack-Postava, Engineer, Confluent https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/ What’s New in Apache Kafka 0.10.2 and Confluent 3.2 Date: Thursday, March 9, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Clarke Patterson, Senior Director, Product Marketing
  3. 3. 3 Kafka Streams API: to build real-time apps that power your core business Key benefits • Makes your Java apps highly scalable, elastic, fault-tolerant, stateful, distributed • No additional cluster • Easy to run as a service • Supports large aggregations and joins • Security and permissions fully integrated from Kafka Example Use Cases • Microservices • Reactive applications • Continuous queries • Continuous transformations • Event-triggered processes Streams API App Instance 1 Kafka Cluster Streams API App Instance N Your App ...
  4. 4. 4 Use case examples Industry Use case examples Travel Build applications with the Kafka Streams API to make real-time decisions to find best suitable pricing for individual customers, to cross-sell additional services, and to process bookings and reservations Finance Build applications to aggregate data sources for real-time views of potential exposures and for detecting and minimizing fraudulent transactions Logistics Build applications to track shipments fast, reliably, and in real-time Retail Build applications to decide in real-time on next best offers, personalized promotions, pricing, and inventory management Automotive, Manufacturing Build applications to ensure their production lines perform optimally, to gain real- time insights into supply chains, and to monitor telemetry data from connected cars to decide if an inspection is needed And many more …
  5. 5. 5 Some public use cases in the wild • Why Kafka Streams: towards a real-time streaming architecture, by Sky Betting and Gaming • http://engineering.skybettingandgaming.com/2017/01/23/streaming-architectures/ • Applying Kafka’s Streams API for social messaging at LINE Corp. • http://developers.linecorp.com/blog/?p=3960 • Production pipeline at LINE, a social platform based in Japan with 220+ million users • Microservices and Reactive Applications at Capital One • https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-kafka-streams • Containerized Kafka Streams applications in Scala, by Hive Streaming • https://www.madewithtea.com/processing-tweets-with-kafka-streams.html • Geo-spatial data analysis • http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/ • Language classification with machine learning • https://dzone.com/articles/machine-learning-with-kafka-streams
  6. 6. 6 Kafka Summit NYC, May 09 Here, the community will share latest Kafka Streams use cases. http://kafka-summit.org/
  7. 7. 7 Agenda • Why are session windows so important? • Recap: What is windowing? • Session windows – example use case • Session windows – how they work • Session windows – API
  8. 8. 8 Why are session windows so important? • We want to analyze user behavior, which is a very common use case area • To analyze user behavior on newspapers, social platforms, video sharing sites, booking sites, etc. • AND tailor the analysis to the individual user • Specifically, analyses of the type “how many X in one go?” – how many movies watched in one go? • Achieved through a per-user sessionization step on the input data. • AND this tailoring must be convenient and scalable • Achieved through automating the sessionization step, i.e. auto-discovery of sessions • Session-based analyses can range from simple metrics (e.g. count of user visits on a news website or social platform) to more complex metrics (e.g. customer conversion funnel and event flows).
  9. 9. 9 What is windowing? • Aggregations such as “counting things” are key-based operations • Before you can aggregate your input data, it must first be grouped by key event-time8 AM7 AM6 AM event-time Alice Bob Dave 8 AM7 AM6 AM
  10. 10. 10 What is windowing? • Aggregations such as “counting things” are key-based operations Alice: 10 movies Bob: 11 movies Dave: 8 movies “Let me COUNT how many movies each user has watched (IN TOTAL)” event-time Alice Bob Dave Feb 7Feb 6Feb 5
  11. 11. 11 What is windowing? • Windowing allows you to further “sub-group” the input data for each user event-time Alice Bob Dave “Let me COUNT how many movies each user has watched PER DAY” Alice: 4 movies Bob: 3 movies Dave: 2 movies Feb 5 Feb 7Feb 6Feb 5
  12. 12. 12 What is windowing? • Windowing allows you to further “sub-group” the input data for each user event-time Alice Bob Dave Alice: 1 movie Bob: 2 movies Dave: 4 movies Feb 6 Feb 7Feb 6Feb 5 “Let me COUNT how many movies each user has watched PER DAY”
  13. 13. 13 What is windowing? • Windowing allows you to further “sub-group” the input data for each user event-time Alice Bob Dave Alice: 4 movies Bob: 4 movies Dave: 1 movie Feb 7 Feb 7Feb 6Feb 5 “Let me COUNT how many movies each user has watched PER DAY”
  14. 14. 14 Session windows: use case • Session windows allow for “how many X in one go?” analyses, tailored to each key • Sessions are auto-discovered from the input data (we see how later) event-time Alice Bob Dave Alice: 1, 4, 1, 4 movies (4 sessions) Bob: 4, 6 movies (2 sessions) Dave: 3, 5 movies (2 sessions) Feb 7Feb 6Feb 5 “Let me COUNT how many movies each user has watched PER SESSION”
  15. 15. 15 Comparing results • Let’s compare how results differ Alice Bob Dave IN TOTAL 10 11 8 PER DAY 3.0 (avg) 3.0 (avg) 2.3 (avg) time windows PER SESSION 2.5 (avg) 5.0 (avg) 4.0 (avg) session windowsno windows
  16. 16. 16 Comparing results • Let’s compare how results differ if we our task was to rank the top users Alice Bob Dave IN TOTAL #2 #1 #3 PER DAY #1 #1 #3 time windows PER SESSION #3 #1 #2 session windowsno windows
  17. 17. 17Confidential Session windows: how they work
  18. 18. 18 Session windows: how they work • Definition of a session in Kafka Streams API is based on a configurable period of inactivity • Example: “If Alice hasn’t watched another movie in the past 3 hours, then next movie = new session!” Inactivity period
  19. 19. 19 Auto-discovering sessions, per user event-time Alice Bob Dave … … … … … …
  20. 20. 20 Auto-discovering sessions, per user event-time Alice Bob Dave … … … … … … Example: How many movies does Alice watch on average per session?” Inactivity period (e.g. 3 hours)
  21. 21. 21 Auto-discovering sessions, per user event-time Alice Bob Dave … … … … … … Example: How many movies does Alice watch on average per session?”
  22. 22. 22 Late-arriving data is handled transparently • Handling of late-arriving data is important because, in practice, a lot of data arrives late
  23. 23. 23 Late-arriving data: example Users with mobile phones enter airplane, lose Internet connectivity Emails are being written during the 8h flight Internet connectivity is restored, phones will send queued emails now, though with an 8h delay Bob  writes  Alice  an   email  at  2  P.M. Bob’s  email  is  finally   being  sent  at  10  P.M.
  24. 24. 24 Late-arriving data is handled transparently • Handling of late-arriving data is important because, in practice, a lot of data arrives late • Good news: late-arriving data is handled transparently and efficiently for you • Also, in your applications, you can define a grace period after which late-arriving data will be discarded (default: 1 day), and you can define this granularly per windowed operation • Example: “I want to sessionize the input data based on 15-min inactivity periods, and late-arriving data should be discarded if it is more than 12 hours late”
  25. 25. 25 Late-arriving data is handled transparently event-time Alice Bob Dave … … … … … … • Late-arriving data may (1) create new sessions or (2) merge existing sessions
  26. 26. 26 Sessions potentially merge as new events arrive Session Window
  27. 27. 27 Sessions potentially merge as new events arrive Session Window
  28. 28. 28 Late-arriving data is handled transparently event-time Alice Bob Dave … … … … … …
  29. 29. 29 Late-arriving data is handled transparently event-time Alice Bob Dave … … … … … …
  30. 30. 30Confidential Session windows: API
  31. 31. 31Confidential Session windows: API in Confluent 3.2 / Apache Kafka 0.10.2 //  A  session  window  with  an  inactivity  gap  of  3h;  discard  data  that  is  12h late SessionWindows.with(TimeUnit.HOURS.toMillis(3)).until(TimeUnit.HOURS.toMillis(12)); Defining a session window //  Key  (String)  is  user,  value  (Avro  record)  is  the  movie  view  event  for  that  user. KStream<String,  GenericRecord>  movieViews =  ...; //  Count  movie  views  per  session,  per  user KTable<Windowed<String>,  Long>  sessionizedMovieCounts = movieViews .groupByKey(Serdes.String(),  genericAvroSerde)         .count(SessionWindows.with(TimeUnit.HOURS.toMillis(3)),  "views-­‐per-­‐session"); Full example: aggregating with session windows More details with documentation and examples at: http://docs.confluent.io/current/streams/developer-guide.html#session-windows https://github.com/confluentinc/examples
  32. 32. 32Confidential Attend the whole series! Simplify Governance for Streaming Data in Apache Kafka Date: Thursday, April 6, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Gwen Shapira, Product Manager, Confluent Using Apache Kafka to Analyze Session Windows Date: Thursday, March 30, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Michael Noll, Product Manager, Confluent Monitoring and Alerting Apache Kafka with Confluent Control Center Date: Thursday, March 16, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Nick Dearden, Director, Engineering and Product Data Pipelines Made Simple with Apache Kafka Date: Thursday, March 23, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Ewen Cheslack-Postava, Engineer, Confluent https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/ What’s New in Apache Kafka 0.10.2 and Confluent 3.2 Date: Thursday, March 9, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Clarke Patterson, Senior Director, Product Marketing UP NEXT
  33. 33. 33 Why Confluent? More than just enterprise software Confluent Platform The only enterprise open source streaming platform based entirely on Apache Kafka Professional Services Best practice consultation for future Kafka deployments and optimize for performance and scalability of existing ones Enterprise Support 24x7 support for the entire Apache Kafka project, not just a portion of it Complete support across the entire adoption lifecycle Kafka Training Comprehensive hands-on courses for developers and operators from the Apache Kafka experts
  34. 34. 34 Get Started with Apache Kafka Today! https://www.confluent.io/downloads/ THE place to start with Apache Kafka! Thoroughly tested and quality assured More extensible developer experience Easy upgrade path to Confluent Enterprise
  35. 35. 35 Discount code: kafcom17  Use the Apache Kafka community discount code to get $50 off  www.kafka-summit.org Kafka Summit New York: May 8 Kafka Summit San Francisco: August 28 Presented by

×