For many industries the need to group together related events based on a period of activity or inactivity is key. Advertising businesses, content producers are just a few examples of where session windows can be used to better understand user behavior.
While such sessionization has been possible in Apache Kafka up to this point, implementing it has been rather complex and required leveraging low-level APIs. In the most recent release of Kafka, however, new capabilities have been added making session windows much easier to implement.
In this online talk, we’ll introduce the concept of a session window, talk about common use cases, and walk through how Apache Kafka can be used for session-oriented use cases.
Kafka streams windowing behind the curtain confluent
Kafka Streams Windowing Behind the Curtain, Neil Buesing, Principal Solutions Architect, Rill
https://www.meetup.com/TwinCities-Apache-Kafka/events/279316299/
The Rise of Data in Motion in the Healthcare Industry - Use Cases, Architectures and Examples powered by Apache Kafka.
Use Cases for Data in Motion in the Healthcare Industry:
- Know Your Patient (= “Customer 360”)
- Operations (Healthcare 4.0 including Drug R&D, Patient Care, etc.)
- IT Perspective (Cybersecurity, Mainframe Offload, Hybrid Cloud, Streaming ETL, etc)
Real-world examples include Covid-19 Electronic Lab Reporting, Cerner, Optum, Centene, Humana, Invitae, Bayer, Celmatix, Care.com.
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration.
This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
Streaming all over the World: Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka.
Learn about various case studies for event streaming with Apache Kafka across industries. The talk explores architectures for real-world deployments from Audi, BMW, Disney, Generali, Paypal, Tesla, Unity, Walmart, William Hill, and more. Use cases include fraud detection, mainframe offloading, predictive maintenance, cybersecurity, edge computing, track&trace, live betting, and much more.
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Kafka streams windowing behind the curtain confluent
Kafka Streams Windowing Behind the Curtain, Neil Buesing, Principal Solutions Architect, Rill
https://www.meetup.com/TwinCities-Apache-Kafka/events/279316299/
The Rise of Data in Motion in the Healthcare Industry - Use Cases, Architectures and Examples powered by Apache Kafka.
Use Cases for Data in Motion in the Healthcare Industry:
- Know Your Patient (= “Customer 360”)
- Operations (Healthcare 4.0 including Drug R&D, Patient Care, etc.)
- IT Perspective (Cybersecurity, Mainframe Offload, Hybrid Cloud, Streaming ETL, etc)
Real-world examples include Covid-19 Electronic Lab Reporting, Cerner, Optum, Centene, Humana, Invitae, Bayer, Celmatix, Care.com.
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration.
This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
Streaming all over the World: Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka.
Learn about various case studies for event streaming with Apache Kafka across industries. The talk explores architectures for real-world deployments from Audi, BMW, Disney, Generali, Paypal, Tesla, Unity, Walmart, William Hill, and more. Use cases include fraud detection, mainframe offloading, predictive maintenance, cybersecurity, edge computing, track&trace, live betting, and much more.
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent
Tesla ingests trillions of events every day from hundreds of unique data sources through our streaming data platform. Find out how we developed a set of high-throughput, non-blocking primitives that allow us to transform and ingest data into a variety of data stores with minimal development time. Additionally, we will discuss how these primitives allowed us to completely migrate the streaming platform in just a few months. Finally, we will talk about how we scale team size sub-linearly to data volumes, while continuing to onboard new use cases.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Best Practices for Middleware and Integration Architecture Modernization with...Claus Ibsen
What are important considerations when modernizing middleware and moving towards serverless and/or cloud native integration architectures? How can we make the most of flexible technologies such as Camel K, Kafka, Quarkus and OpenShift. Claus is working as project lead on Apache Camel and has extensive experience from open source product development.
The talk was recorded and runs for 30 minutes and published on youtube at: https://www.youtube.com/watch?v=d1Hr78a7Lww
Protect your private data with ORC column encryptionOwen O'Malley
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads but with integrated support for finding required rows quickly.
Owen O’Malley dives into the progress the Apache community made for adding fine-grained column-level encryption natively into ORC format, which also provides capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys, providing the additional flexibility to use and manage keys per column centrally.
Using Apache Kafka to Analyze Session Windowsconfluent
Speaker: Michael Noll, Product Manager, Confluent
In this online talk, we’ll introduce the concept of a session window, talk about common use cases, and walk through how Apache Kafka can be used for session-oriented use cases.
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Grant Allen, CTO Chief Product Officer at Dow Jones explains how to deploy Flowable at scale in AWS.
It was presented at the Flowfest 2018 in Barcelona, Spain
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
"This is a technical architect's case study of how Loggly has employed the latest social-media-scale technologies as the backbone ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. This presentation describes design details of how we built a second-generation system fully leveraging AWS services including Amazon Route 53 DNS with heartbeat and latency-based routing, multi-region VPCs, Elastic Load Balancing, Amazon Relational Database Service, and a number of pro-active and re-active approaches to scaling computational and indexing capacity.
The talk includes lessons learned in our first generation release, validated by thousands of customers; speed bumps and the mistakes we made along the way; various data models and architectures previously considered; and success at scale: speeds, feeds, and an unmeltable log processing engine."
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Ansible Tower is a web-based GUI tool, used for managing infrastructural configurations. It is Ansible at a more enterprise level. It is useful for centralizing infrastructure from a user interface with role-based access control(RBAC), job scheduling, and graphical inventory management.
Traditional virtualization technologies have been used by cloud infrastructure providers for many years in providing isolated environments for hosting applications. These technologies make use of full-blown operating system images for creating virtual machines (VMs). According to this architecture, each VM needs its own guest operating system to run application processes. More recently, with the introduction of the Docker project, the Linux Container (LXC) virtualization technology became popular and attracted the attention. Unlike VMs, containers do not need a dedicated guest operating system for providing OS-level isolation, rather they can provide the same level of isolation on top of a single operating system instance.
An enterprise application may need to run a server cluster to handle high request volumes. Running an entire server cluster on Docker containers, on a single Docker host could introduce the risk of single point of failure. Google started a project called Kubernetes to solve this problem. Kubernetes provides a cluster of Docker hosts for managing Docker containers in a clustered environment. It provides an API on top of Docker API for managing docker containers on multiple Docker hosts with many more features.
Building distributed systems is challenging. Luckily, Apache Kafka provides a powerful toolkit for putting together big services as a set of scalable, decoupled components. In this talk, I'll describe some of the design tradeoffs when building microservices, and how Kafka's powerful abstractions can help. I'll also talk a little bit about what the community has been up to with Kafka Streams, Kafka Connect, and exactly-once semantics.
Presentation by Colin McCabe, Confluent, Big Data Day LA
Learn how Apache Atlas is being enhanced to provide a universal open metadata and governance platform for all data processing across the enterprise. With open metadata, multiple metadata repositories, potentially from different vendors, can operate collaboratively to create an enterprise catalog of data that can be located, understood, used and governed. In this talk we will provide a detailed description of the extensions to the type system, new APIs, the connector framework, metadata discovery framework, governance action framework and the inter-operability that we are adding to Apache Atlas. We will show examples of these features in operation. For example, (1) how metadata is discovered and gathered into Apache Atlas, (2) how applications and tools access metadata, (3) how enforcement engines such as Apache Ranger keep synchronized with the latest governance requirements and (4) how to build an adapter to allow other vendor's metadata repositories can exchange metadata with Apache Atlas repositories. We will also explain how these features can be deployed together to support the Hadoop platform, and the enterprise beyond. This session will be presented by Nigel Jones - IBM & Ferd Schapers - ING Chief Information Architect
Speaker:
Nigel Jones, Software Architect, IBM Analytics Group, IBM
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent
Tesla ingests trillions of events every day from hundreds of unique data sources through our streaming data platform. Find out how we developed a set of high-throughput, non-blocking primitives that allow us to transform and ingest data into a variety of data stores with minimal development time. Additionally, we will discuss how these primitives allowed us to completely migrate the streaming platform in just a few months. Finally, we will talk about how we scale team size sub-linearly to data volumes, while continuing to onboard new use cases.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Best Practices for Middleware and Integration Architecture Modernization with...Claus Ibsen
What are important considerations when modernizing middleware and moving towards serverless and/or cloud native integration architectures? How can we make the most of flexible technologies such as Camel K, Kafka, Quarkus and OpenShift. Claus is working as project lead on Apache Camel and has extensive experience from open source product development.
The talk was recorded and runs for 30 minutes and published on youtube at: https://www.youtube.com/watch?v=d1Hr78a7Lww
Protect your private data with ORC column encryptionOwen O'Malley
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads but with integrated support for finding required rows quickly.
Owen O’Malley dives into the progress the Apache community made for adding fine-grained column-level encryption natively into ORC format, which also provides capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys, providing the additional flexibility to use and manage keys per column centrally.
Using Apache Kafka to Analyze Session Windowsconfluent
Speaker: Michael Noll, Product Manager, Confluent
In this online talk, we’ll introduce the concept of a session window, talk about common use cases, and walk through how Apache Kafka can be used for session-oriented use cases.
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Grant Allen, CTO Chief Product Officer at Dow Jones explains how to deploy Flowable at scale in AWS.
It was presented at the Flowfest 2018 in Barcelona, Spain
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
"This is a technical architect's case study of how Loggly has employed the latest social-media-scale technologies as the backbone ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. This presentation describes design details of how we built a second-generation system fully leveraging AWS services including Amazon Route 53 DNS with heartbeat and latency-based routing, multi-region VPCs, Elastic Load Balancing, Amazon Relational Database Service, and a number of pro-active and re-active approaches to scaling computational and indexing capacity.
The talk includes lessons learned in our first generation release, validated by thousands of customers; speed bumps and the mistakes we made along the way; various data models and architectures previously considered; and success at scale: speeds, feeds, and an unmeltable log processing engine."
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Ansible Tower is a web-based GUI tool, used for managing infrastructural configurations. It is Ansible at a more enterprise level. It is useful for centralizing infrastructure from a user interface with role-based access control(RBAC), job scheduling, and graphical inventory management.
Traditional virtualization technologies have been used by cloud infrastructure providers for many years in providing isolated environments for hosting applications. These technologies make use of full-blown operating system images for creating virtual machines (VMs). According to this architecture, each VM needs its own guest operating system to run application processes. More recently, with the introduction of the Docker project, the Linux Container (LXC) virtualization technology became popular and attracted the attention. Unlike VMs, containers do not need a dedicated guest operating system for providing OS-level isolation, rather they can provide the same level of isolation on top of a single operating system instance.
An enterprise application may need to run a server cluster to handle high request volumes. Running an entire server cluster on Docker containers, on a single Docker host could introduce the risk of single point of failure. Google started a project called Kubernetes to solve this problem. Kubernetes provides a cluster of Docker hosts for managing Docker containers in a clustered environment. It provides an API on top of Docker API for managing docker containers on multiple Docker hosts with many more features.
Building distributed systems is challenging. Luckily, Apache Kafka provides a powerful toolkit for putting together big services as a set of scalable, decoupled components. In this talk, I'll describe some of the design tradeoffs when building microservices, and how Kafka's powerful abstractions can help. I'll also talk a little bit about what the community has been up to with Kafka Streams, Kafka Connect, and exactly-once semantics.
Presentation by Colin McCabe, Confluent, Big Data Day LA
Learn how Apache Atlas is being enhanced to provide a universal open metadata and governance platform for all data processing across the enterprise. With open metadata, multiple metadata repositories, potentially from different vendors, can operate collaboratively to create an enterprise catalog of data that can be located, understood, used and governed. In this talk we will provide a detailed description of the extensions to the type system, new APIs, the connector framework, metadata discovery framework, governance action framework and the inter-operability that we are adding to Apache Atlas. We will show examples of these features in operation. For example, (1) how metadata is discovered and gathered into Apache Atlas, (2) how applications and tools access metadata, (3) how enforcement engines such as Apache Ranger keep synchronized with the latest governance requirements and (4) how to build an adapter to allow other vendor's metadata repositories can exchange metadata with Apache Atlas repositories. We will also explain how these features can be deployed together to support the Hadoop platform, and the enterprise beyond. This session will be presented by Nigel Jones - IBM & Ferd Schapers - ING Chief Information Architect
Speaker:
Nigel Jones, Software Architect, IBM Analytics Group, IBM
Introducing Exactly Once Semantics To Apache KafkaApurva Mehta
Here are slides from my talk on introducing exactly once semantics to Apache Kafka. The talk was given at the Kafka Summit NYC, 8 May 2017.
The slides dive into the design of transactions in Apache Kafka.
Avro Tutorial - Records with Schema for Kafka and HadoopJean-Paul Azar
Covers how to use Avro to save records to disk. This can be used later to use Avro with Kafka Schema Registry. This provides background on Avro which gets used with Hadoop and Kafka.
Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally.
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent
Siphon is a highly available and reliable distributed pub/sub system built using Apache Kafka. It is used to publish, discover and subscribe to near real-time data streams for operational and product intelligence. Siphon is used as a “Databus” by a variety of producers and subscribers in Microsoft, and is compliant with security and privacy requirements. It has a built-in Auditing and Quality control. This session will provide an overview of the use of Kafka at Microsoft, and then deep dive into Siphon. We will describe an important business scenario and talk about the technical details of the system in the context of that scenario. We will also cover the design and implementation of the service, the scale, and real world production experiences from operating the service in the Microsoft cloud environment.
What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent
With the introduction of connect and streams API in 2016, Apache Kafka is becoming the defacto solution for anyone looking to build a streaming platform. The community continues to add additional capabilities to make it the complete solution for streaming data.
Join us as we review the latest additions in Apache Kafka 0.10.2. In addition, we’ll cover what’s new in Confluent Enterprise 3.2 that makes it possible for running Kafka at scale.
Asynchronous micro-services and the unified logAlexander Dean
On Friday October 7th 2016 at Crunch Conference in Budapest I gave a talk entitled "Asynchronous micro-services and the unified log".
The unified log enabled by Apache Kafka and Amazon Kinesis has been mostly understood as a better data processing architecture, replacing traditional data warehousing techniques. But the unified log also enables a new way of building transactional software, by enabling asynchronous micro-services. In this talk, I showed how event-driven micro-services designed around Kafka or Kinesis resolve many of the issues associated with traditional monolithic and synchronous micro-service based architectures.
Data Streaming with Apache Kafka & MongoDB - EMEAAndrew Morgan
A new generation of technologies is needed to consume and exploit today's real time, fast moving data sources. Apache Kafka, originally developed at LinkedIn, has emerged as one of these key new technologies.
This webinar explores the use-cases and architecture for Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB
A new generation of technologies is needed to consume and exploit today's real time, fast moving data sources. Apache Kafka, originally developed at LinkedIn, has emerged as one of these key new technologies.
Webinar: Five Problems Facing Business-Critical NFS DeploymentsStorage Switzerland
VMware, NetApp and even EMC are proponents of using NFS based storage systems to support mission critical workloads like virtual machines, databases and performance sensitive unstructured data. But in comparison to mission critical fibre channel, the tools to monitor and optimize your NFS infrastructure are lacking. In this webinar Storage Switzerland and Virtual Instruments will discuss the five challenges facing IT professionals that depend on NFS-based storage infrastructure for performance-intensive workloads. You will learn how to detect and overcome:
* Metadata Bottlenecks
* Rogue Clients & Noisy Neighbor issues
* Server/VM Latency issues
* Poor Write Performance
* Cluster Node Bottlenecks
Event-Driven Applications Done Right - Pulsar Summit SF 2022StreamNative
Pulsar Summit San Francisco is the event dedicated to Apache Pulsar. This one-day, action-packed event will include 5 keynotes, 12 breakout sessions, and 1 amazing happy hour. Speakers are from top companies, including Google, AWS, Databricks, Onehouse, StarTree, Intel, ScyllaDB, and more! It’s the perfect opportunity to network with Pulsar thought leaders in person.
Join developers, architects, data engineers, DevOps professionals, and anyone who wants to learn about messaging and event streaming for this one-day, in-person event. Pulsar Summit San Francisco brings the Apache Pulsar Community together to share best practices and discuss the future of streaming technologies.
Creating a Single Source of Truth: Leverage all of your data with powerful an...Looker
With a centralized data store, the entire spectrum of analytics is at your fingertips. Using Looker & Segment, you can collect, store and analyze everything from click-stream and event data to transactional and behavioral data in your data warehouse.
Some of the topics this webinar will include:
-The advantages of a centralized data warehouse with Segment Warehouses
-Creating a data model to get your company on the same page with Looker Blocks
-Putting it all together: Best practices for making your data accessible to your end users
Data Streaming with Apache Kafka & MongoDBconfluent
Explore the use-cases and architecture for Apache Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future.
Authors: Josh McNutt, Keria Bermudez-Hernandez
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...Grid Dynamics
This presentation outlines key business drivers for real-time analytics applications in retail and describes the emerging architectures based on In-Stream Processing (ISP) technologies. The slides present a complete open blueprint for an ISP platform - including a demo application for real-time Twitter Sentiment Analytics - designed with 100% open source components and deployable to any cloud.
To learn more, read an adjoining blog series on this topic here : https://blog.griddynamics.com/in-stream-processing-service-blueprint
Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data.
If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.
Marco Pozzan
Power BI consultant & Trainer
Scenario di utilizzo del real-time di Power BI. In questa sessione verrà introdotta la teoria sul real-time dashboarding offerto da Power BI. Poi ci si focalizzerà sun un caso pratico di real-time dataset in modalità ibrida per la realizzazione di una dashboard di controllo con la possibilità di effettuare il write back e permettere all’utente di effettuare analisi what-if.
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
Tinder’s Quickfire Pipeline powers all things data at Tinder. It was originally built using AWS Kinesis Firehoses and has since been extended to use both Kafka and other event buses. It is the core of Tinder’s data infrastructure. This rich data flow of both client and backend data has been extended to service a variety of needs at Tinder, including Experimentation, ML, CRM, and Observability, allowing backend developers easier access to shared client side data. We perform this using many systems, including Kafka, Spark, Flink, Kubernetes, and Prometheus. Many of Tinder’s systems were natively designed in an RPC first architecture.
Things we’ll discuss decoupling your system at scale via event-driven architectures include:
– Powering ML, backend, observability, and analytical applications at scale, including an end to end walk through of our processes that allow non-programmers to write and deploy event-driven data flows.
– Show end to end the usage of dynamic event processing that creates other stream processes, via a dynamic control plane topology pattern and broadcasted state pattern
– How to manage the unavailability of cached data that would normally come from repeated API calls for data that’s being backfilled into Kafka, all online! (and why this is not necessarily a “good” idea)
– Integrating common OSS frameworks and libraries like Kafka Streams, Flink, Spark and friends to encourage the best design patterns for developers coming from traditional service oriented architectures, including pitfalls and lessons learned along the way.
– Why and how to avoid overloading microservices with excessive RPC calls from event-driven streaming systems
– Best practices in common data flow patterns, such as shared state via RocksDB + Kafka Streams as well as the complementary tools in the Apache Ecosystem.
– The simplicity and power of streaming SQL with microservices
Similar to user Behavior Analysis with Session Windows and Apache Kafka's Streams API (20)
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
In our exclusive webinar, you'll learn why event-driven architecture is the key to unlocking cost efficiency, operational effectiveness, and profitability. Gain insights on how this approach differs from API-driven methods and why it's essential for your organization's success.
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
In today's data-driven world, the Internet of Things (IoT) is revolutionizing industries and unlocking new possibilities. Join Data Reply, Confluent, and Imply as we unveil a comprehensive solution for IoT that harnesses the power of real-time insights.
Workshop híbrido: Stream Processing con Flinkconfluent
El Stream processing es un requisito previo de la pila de data streaming, que impulsa aplicaciones y pipelines en tiempo real.
Permite una mayor portabilidad de datos, una utilización optimizada de recursos y una mejor experiencia del cliente al procesar flujos de datos en tiempo real.
En nuestro taller práctico híbrido, aprenderás cómo filtrar, unir y enriquecer fácilmente datos en tiempo real dentro de Confluent Cloud utilizando nuestro servicio Flink sin servidor.
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
Our talk will explore the transformative impact of integrating Confluent, HiveMQ, and SparkPlug in Industry 4.0, emphasizing the creation of a Unified Namespace.
In addition to the creation of a Unified Namespace, our webinar will also delve into Stream Governance and Scaling, highlighting how these aspects are crucial for managing complex data flows and ensuring robust, scalable IIoT-Platforms.
You will learn how to ensure data accuracy and reliability, expand your data processing capabilities, and optimize your data management processes.
Don't miss out on this opportunity to learn from industry experts and take your business to the next level.
La arquitectura impulsada por eventos (EDA) será el corazón del ecosistema de MAPFRE. Para seguir siendo competitivas, las empresas de hoy dependen cada vez más del análisis de datos en tiempo real, lo que les permite obtener información y tiempos de respuesta más rápidos. Los negocios con datos en tiempo real consisten en tomar conciencia de la situación, detectar y responder a lo que está sucediendo en el mundo ahora.
Eventos y Microservicios - Santander TechTalkconfluent
Durante esta sesión examinaremos cómo el mundo de los eventos y los microservicios se complementan y mejoran explorando cómo los patrones basados en eventos nos permiten descomponer monolitos de manera escalable, resiliente y desacoplada.
Purpose of the session is to have a dive into Apache, Kafka, Data Streaming and Kafka in the cloud
- Dive into Apache Kafka
- Data Streaming
- Kafka in the cloud
Build real-time streaming data pipelines to AWS with Confluentconfluent
Traditional data pipelines often face scalability issues and challenges related to cost, their monolithic design, and reliance on batch data processing. They also typically operate under the premise that all data needs to be stored in a single centralized data source before it's put to practical use. Confluent Cloud on Amazon Web Services (AWS) provides a fully managed cloud-native platform that helps you simplify the way you build real-time data flows using streaming data pipelines and Apache Kafka.
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
No matter whether you are migrating your Kafka cluster to Confluent Cloud, running a cloud-hybrid environment or are in a different situation where data protection and encryption of sensitive information is required, Confluent Service Mesh allows you to transparently encrypt your data without the need to make code changes to you existing applications.
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
Microservices have become a dominant architectural paradigm for building systems in the enterprise, but they are not without their tradeoffs. Learn how to build event-driven microservices with Apache Kafka
Confluent & GSI Webinars series - Session 3confluent
An in depth look at how Confluent is being used in the financial services industry. Gain an understanding of how organisations are utilising data in motion to solve common problems and gain benefits from their real time data capabilities.
It will look more deeply into some specific use cases and show how Confluent technology is used to manage costs and mitigate risks.
This session is aimed at Solutions Architects, Sales Engineers and Pre Sales, and also the more technically minded business aligned people. Whilst this is not a deeply technical session, a level of knowledge around Kafka would be helpful.
Transforming applications built with traditional messaging solutions such as TIBCO, MQ and Solace to be scalable, reliable and ready for the move to cloud
How can applications built with traditional messaging technologies like TIBCO, Solace and IBM MQ be modernised and be made cloud ready? What are the advantages to Event Streaming approaches to pub/sub vs traditional message queues? What are the strengeths and weaknesses of both approaches, and what use cases and requirements are actually a better fit for messaging than Kafka?
This session will show why the old paradigm does not work and that a new approach to the data strategy needs to be taken. It aims to show how a Data Streaming Platform is integral to the evolution of a company’s data strategy and how Confluent is not just an integration layer but the central nervous system for an organisation
Vous apprendrez également à :
• Créer plus rapidement des produits et fonctionnalités à l’aide d’une suite complète de connecteurs et d’outils de gestion des flux, et à connecter vos environnements à des pipelines de données
• Protéger vos données et charges de travail les plus critiques grâce à des garanties intégrées en matière de sécurité, de gouvernance et de résilience
• Déployer Kafka à grande échelle en quelques minutes tout en réduisant les coûts et la charge opérationnelle associés
Confluent Partner Tech Talk with Synthesisconfluent
A discussion on the arduous planning process, and deep dive into the design/architectural decisions.
Learn more about the networking, RBAC strategies, the automation, and the deployment plan.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
1. 1
User behavior analysis with
Session Windows and Apache
Kafka’s Streams API
Michael G. Noll
Product Manager
2. 2
Attend the whole series!
Simplify Governance for Streaming Data in Apache Kafka
Date: Thursday, April 6, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Gwen Shapira, Product Manager, Confluent
Using Apache Kafka to Analyze Session Windows
Date: Thursday, March 30, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Michael Noll, Product Manager, Confluent
Monitoring and Alerting Apache Kafka with Confluent Control
Center
Date: Thursday, March 16, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Nick Dearden, Director, Engineering and Product
Data Pipelines Made Simple with Apache Kafka
Date: Thursday, March 23, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Ewen Cheslack-Postava, Engineer, Confluent
https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/
What’s New in Apache Kafka 0.10.2 and Confluent 3.2
Date: Thursday, March 9, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Clarke Patterson, Senior Director, Product Marketing
3. 3
Kafka Streams API: to build real-time apps that power your core business
Key benefits
• Makes your Java apps highly scalable,
elastic, fault-tolerant, stateful,
distributed
• No additional cluster
• Easy to run as a service
• Supports large aggregations and joins
• Security and permissions fully
integrated from Kafka
Example Use Cases
• Microservices
• Reactive applications
• Continuous queries
• Continuous transformations
• Event-triggered processes
Streams
API
App Instance 1
Kafka
Cluster
Streams
API
App Instance N
Your
App ...
4. 4
Use case examples
Industry Use case examples
Travel Build applications with the Kafka Streams API to make real-time decisions to find
best suitable pricing for individual customers, to cross-sell additional services,
and to process bookings and reservations
Finance Build applications to aggregate data sources for real-time views of potential
exposures and for detecting and minimizing fraudulent transactions
Logistics Build applications to track shipments fast, reliably, and in real-time
Retail Build applications to decide in real-time on next best offers, personalized
promotions, pricing, and inventory management
Automotive,
Manufacturing
Build applications to ensure their production lines perform optimally, to gain real-
time insights into supply chains, and to monitor telemetry data from connected
cars to decide if an inspection is needed
And many more …
5. 5
Some public use cases in the wild
• Why Kafka Streams: towards a real-time streaming architecture, by Sky Betting and Gaming
• http://engineering.skybettingandgaming.com/2017/01/23/streaming-architectures/
• Applying Kafka’s Streams API for social messaging at LINE Corp.
• http://developers.linecorp.com/blog/?p=3960
• Production pipeline at LINE, a social platform based in Japan with 220+ million users
• Microservices and Reactive Applications at Capital One
• https://speakerdeck.com/bobbycalderwood/commander-decoupled-immutable-rest-apis-with-kafka-streams
• Containerized Kafka Streams applications in Scala, by Hive Streaming
• https://www.madewithtea.com/processing-tweets-with-kafka-streams.html
• Geo-spatial data analysis
• http://www.infolace.com/blog/2016/07/14/simple-spatial-windowing-with-kafka-streams/
• Language classification with machine learning
• https://dzone.com/articles/machine-learning-with-kafka-streams
6. 6
Kafka Summit NYC, May 09
Here, the community will share
latest Kafka Streams use cases.
http://kafka-summit.org/
7. 7
Agenda
• Why are session windows so important?
• Recap: What is windowing?
• Session windows – example use case
• Session windows – how they work
• Session windows – API
8. 8
Why are session windows so important?
• We want to analyze user behavior, which is a very common use case area
• To analyze user behavior on newspapers, social platforms, video sharing sites, booking sites, etc.
• AND tailor the analysis to the individual user
• Specifically, analyses of the type “how many X in one go?” – how many movies watched in one go?
• Achieved through a per-user sessionization step on the input data.
• AND this tailoring must be convenient and scalable
• Achieved through automating the sessionization step, i.e. auto-discovery of sessions
• Session-based analyses can range from simple metrics (e.g. count of user visits on a news
website or social platform) to more complex metrics (e.g. customer conversion funnel and event
flows).
9. 9
What is windowing?
• Aggregations such as “counting things” are key-based operations
• Before you can aggregate your input data, it must first be grouped by key
event-time8 AM7 AM6 AM event-time
Alice
Bob
Dave
8 AM7 AM6 AM
10. 10
What is windowing?
• Aggregations such as “counting things” are key-based operations
Alice: 10 movies
Bob: 11 movies
Dave: 8 movies
“Let me COUNT how many movies each user has watched (IN TOTAL)”
event-time
Alice
Bob
Dave
Feb 7Feb 6Feb 5
11. 11
What is windowing?
• Windowing allows you to further “sub-group” the input data for each user
event-time
Alice
Bob
Dave
“Let me COUNT how many movies each user has watched PER DAY”
Alice: 4 movies
Bob: 3 movies
Dave: 2 movies
Feb 5
Feb 7Feb 6Feb 5
12. 12
What is windowing?
• Windowing allows you to further “sub-group” the input data for each user
event-time
Alice
Bob
Dave
Alice: 1 movie
Bob: 2 movies
Dave: 4 movies
Feb 6
Feb 7Feb 6Feb 5
“Let me COUNT how many movies each user has watched PER DAY”
13. 13
What is windowing?
• Windowing allows you to further “sub-group” the input data for each user
event-time
Alice
Bob
Dave
Alice: 4 movies
Bob: 4 movies
Dave: 1 movie
Feb 7
Feb 7Feb 6Feb 5
“Let me COUNT how many movies each user has watched PER DAY”
14. 14
Session windows: use case
• Session windows allow for “how many X in one go?” analyses, tailored to each key
• Sessions are auto-discovered from the input data (we see how later)
event-time
Alice
Bob
Dave
Alice: 1, 4, 1, 4 movies
(4 sessions)
Bob: 4, 6 movies
(2 sessions)
Dave: 3, 5 movies
(2 sessions)
Feb 7Feb 6Feb 5
“Let me COUNT how many movies each user has watched PER SESSION”
15. 15
Comparing results
• Let’s compare how results differ
Alice
Bob
Dave
IN TOTAL
10
11
8
PER DAY
3.0 (avg)
3.0 (avg)
2.3 (avg)
time windows
PER SESSION
2.5 (avg)
5.0 (avg)
4.0 (avg)
session windowsno windows
16. 16
Comparing results
• Let’s compare how results differ if we our task was to rank the top users
Alice
Bob
Dave
IN TOTAL
#2
#1
#3
PER DAY
#1
#1
#3
time windows
PER SESSION
#3
#1
#2
session windowsno windows
18. 18
Session windows: how they work
• Definition of a session in Kafka Streams API is based on a configurable period of inactivity
• Example: “If Alice hasn’t watched another movie in the past 3 hours, then next movie = new
session!”
Inactivity period
20. 20
Auto-discovering sessions, per user
event-time
Alice
Bob
Dave
… …
… …
… …
Example: How many movies does Alice watch on average per session?”
Inactivity period (e.g. 3 hours)
21. 21
Auto-discovering sessions, per user
event-time
Alice
Bob
Dave
… …
… …
… …
Example: How many movies does Alice watch on average per session?”
22. 22
Late-arriving data is handled transparently
• Handling of late-arriving data is important because, in practice, a lot of data arrives late
23. 23
Late-arriving data: example
Users with mobile phones enter
airplane, lose Internet connectivity
Emails are being written
during the 8h flight
Internet connectivity is restored,
phones will send queued emails now,
though with an 8h delay
Bob writes Alice an
email at 2 P.M.
Bob’s email is finally
being sent at 10 P.M.
24. 24
Late-arriving data is handled transparently
• Handling of late-arriving data is important because, in practice, a lot of data arrives late
• Good news: late-arriving data is handled transparently and efficiently for you
• Also, in your applications, you can define a grace period after which late-arriving data will be
discarded (default: 1 day), and you can define this granularly per windowed operation
• Example: “I want to sessionize the input data based on 15-min inactivity periods, and late-arriving
data should be discarded if it is more than 12 hours late”
25. 25
Late-arriving data is handled transparently
event-time
Alice
Bob
Dave
… …
… …
… …
• Late-arriving data may (1) create new sessions or (2) merge existing sessions
31. 31Confidential
Session windows: API in Confluent 3.2 / Apache Kafka 0.10.2
// A session window with an inactivity gap of 3h; discard data that is 12h late
SessionWindows.with(TimeUnit.HOURS.toMillis(3)).until(TimeUnit.HOURS.toMillis(12));
Defining a session window
// Key (String) is user, value (Avro record) is the movie view event for that user.
KStream<String, GenericRecord> movieViews = ...;
// Count movie views per session, per user
KTable<Windowed<String>, Long> sessionizedMovieCounts =
movieViews
.groupByKey(Serdes.String(), genericAvroSerde)
.count(SessionWindows.with(TimeUnit.HOURS.toMillis(3)), "views-‐per-‐session");
Full example: aggregating with session windows
More details with documentation and examples at:
http://docs.confluent.io/current/streams/developer-guide.html#session-windows
https://github.com/confluentinc/examples
32. 32Confidential
Attend the whole series!
Simplify Governance for Streaming Data in Apache Kafka
Date: Thursday, April 6, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Gwen Shapira, Product Manager, Confluent
Using Apache Kafka to Analyze Session Windows
Date: Thursday, March 30, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Michael Noll, Product Manager, Confluent
Monitoring and Alerting Apache Kafka with Confluent Control
Center
Date: Thursday, March 16, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Nick Dearden, Director, Engineering and Product
Data Pipelines Made Simple with Apache Kafka
Date: Thursday, March 23, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Ewen Cheslack-Postava, Engineer, Confluent
https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/
What’s New in Apache Kafka 0.10.2 and Confluent 3.2
Date: Thursday, March 9, 2017
Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET
Speaker: Clarke Patterson, Senior Director, Product Marketing
UP
NEXT
33. 33
Why Confluent? More than just enterprise software
Confluent Platform
The only enterprise open
source streaming platform
based entirely on Apache
Kafka
Professional Services
Best practice consultation for
future Kafka deployments and
optimize for performance and
scalability of existing ones
Enterprise Support
24x7 support for the entire
Apache Kafka project, not just
a portion of it
Complete support across the entire adoption lifecycle
Kafka Training
Comprehensive hands-on
courses for developers and
operators from the Apache
Kafka experts
34. 34
Get Started with Apache Kafka Today!
https://www.confluent.io/downloads/
THE place to start with Apache Kafka!
Thoroughly tested and quality
assured
More extensible developer
experience
Easy upgrade path to
Confluent Enterprise
35. 35
Discount code: kafcom17
Use the Apache Kafka community discount code to get $50 off
www.kafka-summit.org
Kafka Summit New York: May 8
Kafka Summit San Francisco: August 28
Presented by