Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Fl...Flink Forward
At GO-JEK, we build products that help millions of Indonesians At GO-JEK, we build products that help millions of Indonesians commute, shop, eat and pay, daily. Data Engineering team is responsible to create a reliable data infrastructure across all of GO-JEK’s 18+ products. We use Flink extensively to provide real-time streaming aggregation and analytics for billions of data points generated on daily basis. Working at such a large scale makes it really important to automate operations from infrastructure, failover, and monitoring. This way we can push features faster without causing chaos and disruption to the production environment. 1. Provisioning and deployment: With the nature of business at GoJek, we found ourselves provisioning Flink clusters quite often. Currently we run around 1000 jobs across 10 clusters for different data streams with increasing number of requests day by day. We also provision on the fly clusters with custom configuration for load testing, experimentation and chaos engineering. Provisioning these many clusters from ground up required lot of man hours and involved setting up virtual machines, monitoring agents, access management, configuration management, load testing and data stream integration. Our current setup has Flink over Yarn clusters as well as Kubernetes. We use our in-house provisioning tool Odin, built on top of Terraform and Chef for Yarn clusters and Kubernetes controllers for Kubernetes based deployments. It enables us to safely and predictably create and modify Flink infrastructure. Odin has helped us reduce provisioning time by 99% despite increasing number of requests. 2. Isolation and access control: Given the real-time and distributed nature of GoJek's services, events are classified into different streams depending on nature, time and transactional criticality, sensitivity and volume of data. Which requires setting up separate clusters based on security concerns, team segregation, job loads and criticality which comes at the cost of handling large volume data replication and maintenance. 3. Data quality control: The quality of ingestion events are controlled by Protobuf based version controlled strict event type schema with fully automated deployment pipeline. Deployed jobs are locked to a certain data schema and version which helps us accidental breaking schema changes and backward compatibility during migration and failover. 4. Monitoring and alerting: All the clusters are monitored using dedicated TICK setup. We monitors clusters for resource utilization, job stats and business impact per job. 5. Failover and Upgrading: Failover and upgrade operations are fully automated for yarn cluster failover, input stream failovers e.g. Kafka failover with stateless job strategies. Which helps us moving jobs from one cluster to another without any data loss or broken metric flow. 6. Chaos engineering and load testing: Loki is our disaster simulation tool that helps ensure the Flink infrastructure can tolerat
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...Flink Forward
Stream processing still evolves and changes at a speed that can make it hard to keep up with the developments. Being at the forefront of stream processing technology, the evolution of Apache Flink has mirrored many of these developments and continues to do so.
We will take you on a journey through the major milestones of stream processing technology in past years, diving into the latest additions that Apache Flink and other communities introduced to the stream processing landscape, such as Streamng SQL, Time Versioned Tables, cluster-library-duality, language portability, etc.
We will take a sneak peek into our crystal ball and present in what the Flink community is working on next.
Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...Flink Forward
One of the main characteristics of the good streaming pipeline is correctness for event time processing. Real challenges become when such pipeline should be resilient to different types of failures. In this talk, we describe how Criteo runs Flink on one of the biggest Yarn clusters in Europe and computes 100k messages per second to acknowledge revenue of our platform within the delay of 5 minutes. Real-time revenue monitoring system calculates data under 1% of discrepancies and minimizes business impact in case of revenue anomalies.
Maximize the Business Value of Machine Learning and Data Science with Kafka (...confluent
Today, many companies that have lots of data are still struggling to derive value from machine learning (ML) and data science investments. Why? Accessing the data may be difficult. Or maybe it’s poorly labeled. Or vital context is missing. Or there are questions around data integrity. Or standing up an ML service can be cumbersome and complex.
At Nuuly, we offer an innovative clothing rental subscription model and are continually evolving our ML solutions to gain insight into the behaviors of our unique customer base as well as provide personalized services. In this session, I’ll share how we used event streaming with Apache Kafka® and Confluent Cloud to address many of the challenges that may be keeping your organization from maximizing the business value of machine learning and data science. First, you’ll see how we ensure that every customer interaction and its business context is collected. Next, I’ll explain how we can replay entire interaction histories using Kafka as a transport layer as well as a persistence layer and a business application processing layer. Order management, inventory management, logistics, subscription management – all of it integrates with Kafka as the common backbone. These data streams enable Nuuly to rapidly prototype and deploy dynamic ML models to support various domains, including pricing, recommendations, product similarity, and warehouse optimization. Join us and learn how Kafka can help improve machine learning and data science initiatives that may not be delivered to their full potential.
This is a talk that I gave at the Data Council Berlin Meetup on May 16th, 2019
Abstract:
Stream processing is being rapidly adopted by the enterprise. While in the past, stream processing frameworks mostly provided Java- or Scala-based APIs, stream processing with SQL is growing increasingly popular because it makes stream processing accessible to non-programmers and significantly reduces the effort to solve common tasks.
About three years ago, the Apache Flink community started adding SQL support to process static and streaming data in a unified fashion. Today, Flink SQL powers production systems at Alibaba, Huawei, Lyft, and Uber. Fabian Hueske discusses the current state of Flink’s SQL support and explains the importance of Flink’s unified approach to process static and streaming data. After covering the basics, he shares common real-world use cases ranging from low-latency ETL to pattern detection and demonstrates how easily they can be addressed with Flink SQL.
Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Fl...Flink Forward
At GO-JEK, we build products that help millions of Indonesians At GO-JEK, we build products that help millions of Indonesians commute, shop, eat and pay, daily. Data Engineering team is responsible to create a reliable data infrastructure across all of GO-JEK’s 18+ products. We use Flink extensively to provide real-time streaming aggregation and analytics for billions of data points generated on daily basis. Working at such a large scale makes it really important to automate operations from infrastructure, failover, and monitoring. This way we can push features faster without causing chaos and disruption to the production environment. 1. Provisioning and deployment: With the nature of business at GoJek, we found ourselves provisioning Flink clusters quite often. Currently we run around 1000 jobs across 10 clusters for different data streams with increasing number of requests day by day. We also provision on the fly clusters with custom configuration for load testing, experimentation and chaos engineering. Provisioning these many clusters from ground up required lot of man hours and involved setting up virtual machines, monitoring agents, access management, configuration management, load testing and data stream integration. Our current setup has Flink over Yarn clusters as well as Kubernetes. We use our in-house provisioning tool Odin, built on top of Terraform and Chef for Yarn clusters and Kubernetes controllers for Kubernetes based deployments. It enables us to safely and predictably create and modify Flink infrastructure. Odin has helped us reduce provisioning time by 99% despite increasing number of requests. 2. Isolation and access control: Given the real-time and distributed nature of GoJek's services, events are classified into different streams depending on nature, time and transactional criticality, sensitivity and volume of data. Which requires setting up separate clusters based on security concerns, team segregation, job loads and criticality which comes at the cost of handling large volume data replication and maintenance. 3. Data quality control: The quality of ingestion events are controlled by Protobuf based version controlled strict event type schema with fully automated deployment pipeline. Deployed jobs are locked to a certain data schema and version which helps us accidental breaking schema changes and backward compatibility during migration and failover. 4. Monitoring and alerting: All the clusters are monitored using dedicated TICK setup. We monitors clusters for resource utilization, job stats and business impact per job. 5. Failover and Upgrading: Failover and upgrade operations are fully automated for yarn cluster failover, input stream failovers e.g. Kafka failover with stateless job strategies. Which helps us moving jobs from one cluster to another without any data loss or broken metric flow. 6. Chaos engineering and load testing: Loki is our disaster simulation tool that helps ensure the Flink infrastructure can tolerat
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...Flink Forward
Stream processing still evolves and changes at a speed that can make it hard to keep up with the developments. Being at the forefront of stream processing technology, the evolution of Apache Flink has mirrored many of these developments and continues to do so.
We will take you on a journey through the major milestones of stream processing technology in past years, diving into the latest additions that Apache Flink and other communities introduced to the stream processing landscape, such as Streamng SQL, Time Versioned Tables, cluster-library-duality, language portability, etc.
We will take a sneak peek into our crystal ball and present in what the Flink community is working on next.
Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...Flink Forward
One of the main characteristics of the good streaming pipeline is correctness for event time processing. Real challenges become when such pipeline should be resilient to different types of failures. In this talk, we describe how Criteo runs Flink on one of the biggest Yarn clusters in Europe and computes 100k messages per second to acknowledge revenue of our platform within the delay of 5 minutes. Real-time revenue monitoring system calculates data under 1% of discrepancies and minimizes business impact in case of revenue anomalies.
Maximize the Business Value of Machine Learning and Data Science with Kafka (...confluent
Today, many companies that have lots of data are still struggling to derive value from machine learning (ML) and data science investments. Why? Accessing the data may be difficult. Or maybe it’s poorly labeled. Or vital context is missing. Or there are questions around data integrity. Or standing up an ML service can be cumbersome and complex.
At Nuuly, we offer an innovative clothing rental subscription model and are continually evolving our ML solutions to gain insight into the behaviors of our unique customer base as well as provide personalized services. In this session, I’ll share how we used event streaming with Apache Kafka® and Confluent Cloud to address many of the challenges that may be keeping your organization from maximizing the business value of machine learning and data science. First, you’ll see how we ensure that every customer interaction and its business context is collected. Next, I’ll explain how we can replay entire interaction histories using Kafka as a transport layer as well as a persistence layer and a business application processing layer. Order management, inventory management, logistics, subscription management – all of it integrates with Kafka as the common backbone. These data streams enable Nuuly to rapidly prototype and deploy dynamic ML models to support various domains, including pricing, recommendations, product similarity, and warehouse optimization. Join us and learn how Kafka can help improve machine learning and data science initiatives that may not be delivered to their full potential.
This is a talk that I gave at the Data Council Berlin Meetup on May 16th, 2019
Abstract:
Stream processing is being rapidly adopted by the enterprise. While in the past, stream processing frameworks mostly provided Java- or Scala-based APIs, stream processing with SQL is growing increasingly popular because it makes stream processing accessible to non-programmers and significantly reduces the effort to solve common tasks.
About three years ago, the Apache Flink community started adding SQL support to process static and streaming data in a unified fashion. Today, Flink SQL powers production systems at Alibaba, Huawei, Lyft, and Uber. Fabian Hueske discusses the current state of Flink’s SQL support and explains the importance of Flink’s unified approach to process static and streaming data. After covering the basics, he shares common real-world use cases ranging from low-latency ETL to pattern detection and demonstrates how easily they can be addressed with Flink SQL.
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"Flink Forward
SQL is the lingua franca of data processing, and everybody working with data knows SQL. Apache Flink provides SQL support for querying and processing batch and streaming data. Flink's SQL support powers large-scale production systems at Alibaba, Huawei, and Uber. Based on Flink SQL, these companies have built systems for their internal users as well as publicly offered services for paying customers.In my talk I will show how to leverage the simplicity and power of SQL on Flink. I’ll explain why unified batch and stream processing is important and what it means to run SQL queries on streams of data. Once we’ve covered the basics, I will spend the remainder of the talk demonstrating the capabilities of Flink SQL. We will explore different use cases that Flink SQL was designed for by running queries on Flink’s SQL shell. In particular, I will demonstrate the unified batch and streaming engine by running the same query on batch and streaming data and show how to build a real-time dashboard that is powered by a streaming SQL query, which continuously updates an external result table.
Building Event Streaming Applications with Pac-Man (Ricardo Ferreira, Conflue...HostedbyConfluent
Since Pac-Man was originally released in the '80s, it has been a beacon of fun and joy for people of all ages. What few people know is that this game can also be used to inspire developers on how to build event streaming applications. In this near-zero-slides talk, attendees will get to play the game to generate events. As they play, the presenter will write from scratch a scoreboard using ksqlDB -- an open-source event streaming database built for Apache Kafka.
After building the scoreboard, it will be discussed the different strategies to make the data available elsewhere so any interested service could leverage it with ease. Examples of these services will be provided to monitor in near real-time the scoreboard, revealing whoever is the most proficient Pac-Man player in the room.
Stream Processing Live Traffic Data with Kafka StreamsTom Van den Bulck
In this workshop we will set up a streaming framework which will process realtime data of traffic sensors installed within the Belgian road system.
Starting with the intake of the data, you will learn best practices and the recommended approach to split the information into events in a way that won't come back to haunt you.
With some basic stream operations (count, filter, ... ) you will get to know the data and experience how easy it is to get things done with Spring Boot & Spring Cloud Stream.
But since simple data processing is not enough to fulfill all your streaming needs, we will also let you experience the power of windows.
After this workshop, tumbling, sliding and session windows hold no more mysteries and you will be a true streaming wizard.
Presented at Stream Processing Meetup (7/19/2018)(https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797/).
At Uber, we operate 20+ Kafka clusters to collect system and application logs as well as event data from rider and driver apps. We need a Kafka replication solution to replicate data between Kafka clusters across multiple data centers for different purposes. This talk will introduce the history behind uReplicator and the high level architecture. As the original uReplicator ran into scalability challenges and operational overhead as the scale of Kafka clusters increased, we built the Federated uReplicator which addressed above issues and provide an extensible architecture for further scaling.
On Track with Apache Kafka®: Building a Streaming ETL Solution with Rail Dataconfluent
Watch this talk here: https://www.confluent.io/online-talks/building-a-streaming-etl-solution-with-apache-kafka-rail-data-on-demand
As data engineers, we frequently need to build scalable systems working with data from a variety of sources and with various ingest rates, sizes, and formats. This talk takes an in-depth look at how Apache Kafka can be used to provide a common platform on which to build data infrastructure driving both real-time analytics as well as event-driven applications.
Using a public feed of railway data it will show how to ingest data from message queues such as ActiveMQ with Kafka Connect, as well as from static sources such as S3 and REST endpoints. We'll then see how to use stream processing to transform the data into a form useful for streaming to analytics in tools such as Elasticsearch and Neo4j. The same data will be used to drive a real-time notifications service through Telegram.
If you're wondering how to build your next scalable data platform, how to reconcile the impedance mismatch between stream and batch, and how to wrangle streams of data—this talk is for you!
Kafka, Killer of Point-to-Point Integrations, Lucian Litaconfluent
With 60+ products and over 24% of the US GDP flowing through it, system integration is a tough problem for Intuit. Seasonality, scale, and massive peaks in products like TurboTax, QuickBooks, and Mint.com add extra layers of difficulty when building shared data services around transaction and user graphs, clickstream processing, a/b testing, and personalization. To reduce complexity and latency, we’ve implemented Kafka as the backbone across these data services. This allows us to asynchronously trigger relevant processing, elegantly scaling up and down as needed around peaks, all without the need for point-to-point integrations.
In this talk, we share what we’ve learned about Kafka at Intuit and describe our data services architecture. We found that Kafka is invaluable in achieving a scalable, clean architecture, allowing engineering teams to focus less on integration and more on product development.
(Krunal Vora, Tinder) Kafka Summit San Francisco 2018
At Tinder, we have been using Kafka for streaming and processing events, data science processes and many other integral jobs. Forming the core of the pipeline at Tinder, Kafka has been accepted as the pragmatic solution to match the ever increasing scale of users, events and backend jobs. We, at Tinder, are investing time and effort to optimize the usage of Kafka solving the problems we face in the dating apps context. Kafka forms the backbone for the plans of the company to sustain performance through envisioned scale as the company starts to grow in unexplored markets. Come, learn about the implementation of Kafka at Tinder and how Kafka has helped solve the use cases for dating apps. Engage in the success story behind the business case of Kafka at Tinder.
The Future of Streaming: Global Apps, Event Stores and ServerlessBen Stopford
Stream processing affects a wide range of industries today: capturing sensor data, connecting microservices, processing the workloads of internet giants and giving us a real-time alternative to batch analytics.
While these use cases are exciting and valuable they are only a taste of what is to come. In this talk we look at three areas that are likely to become more prominent: Global Apps, Event Stores and Serverless Stream Processing
Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...Flink Forward
Flink started with the mission to unify batch and stream processing. We believe that Flink’s architecture is uniquely positioned to be a great engine for streaming, batch and AI workloads at the same time. We will talk about the work we did in this direction.
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...Flink Forward
The application of Quantitative Analytics to trades for the generation of Risk and P&L metrics has traditionally followed a batch based approach. Regulatory changes impose increasing demand for compute on financial institutions along with a growing demand for real time analytics due to increased volumes in eTrading across all asset classes
The talk is based on a use case for pricing Interest Rate Swaps, using Apache Beam, with a call to an external C++ analytics process. It describes the performance characteristics when operating in a non-cloud environment using Apache Flink as opposed to Google Cloud Dataflow
The talk will touch upon the subtle difference when operating across multiple runners. It will make suggestions on approaches to portability when architecting for a multi-runner operational environment.
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
Observability for developer ( Inny So & Andrew Jones, ThoughtWorks) Kafka Su...confluent
Have you ever tried to debug a production outage, when your system comprises apps your team has written, third-party apps your team runs, with logs going into some system, application performance metrics going into another system, and cloud platform metrics going somewhere else? Did you find yourself switching tabs, trying to correlate metrics with logs and alerts and finding yourself in a huge tangle? It is a nightmare. In the data world, we talk about aggregating all our data so we can derive new insights quickly, but what about our operational data? Observability is your ability to be able to ask questions of your system without having to write new code, or grab new data. When you've got an observable system, it feels like you have debugging superpowers, but can be challenging to even know where to start. If you can even convince your colleagues to start, finding the right tools can be challenging. In this talk Inny and Andrew will talk about what monitoring and logging are not sufficient anymore (if they ever were), observability basics, and demo an observability platform that you can use to start your observability journey today.
Essa palestra tem como objetivo apresentar alguns conceitos de microserviços relacionados a dados. Apresentar a dualidade entre Stream e Tabelas, conceitos e patterns de processamento de Streams bem como exemplos de código utilizando o Kafka Streams.
Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)Kai Wähner
From Prague Kafka Meetup in November 2018.
This session introduces Apache Kafka as event-driven open source streaming platform. Apache Kafka goes far beyond scalable, high volume messaging. In addition, you can leverage Kafka Connect for integration and the Kafka Streams API for building lightweight stream processing microservices in autonomous teams. The open source Confluent Platform adds further components such as a KSQL, Schema Registry, REST Proxy, Clients for different programming languages and Connectors for different technologies and databases.
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"Flink Forward
SQL is the lingua franca of data processing, and everybody working with data knows SQL. Apache Flink provides SQL support for querying and processing batch and streaming data. Flink's SQL support powers large-scale production systems at Alibaba, Huawei, and Uber. Based on Flink SQL, these companies have built systems for their internal users as well as publicly offered services for paying customers.In my talk I will show how to leverage the simplicity and power of SQL on Flink. I’ll explain why unified batch and stream processing is important and what it means to run SQL queries on streams of data. Once we’ve covered the basics, I will spend the remainder of the talk demonstrating the capabilities of Flink SQL. We will explore different use cases that Flink SQL was designed for by running queries on Flink’s SQL shell. In particular, I will demonstrate the unified batch and streaming engine by running the same query on batch and streaming data and show how to build a real-time dashboard that is powered by a streaming SQL query, which continuously updates an external result table.
Building Event Streaming Applications with Pac-Man (Ricardo Ferreira, Conflue...HostedbyConfluent
Since Pac-Man was originally released in the '80s, it has been a beacon of fun and joy for people of all ages. What few people know is that this game can also be used to inspire developers on how to build event streaming applications. In this near-zero-slides talk, attendees will get to play the game to generate events. As they play, the presenter will write from scratch a scoreboard using ksqlDB -- an open-source event streaming database built for Apache Kafka.
After building the scoreboard, it will be discussed the different strategies to make the data available elsewhere so any interested service could leverage it with ease. Examples of these services will be provided to monitor in near real-time the scoreboard, revealing whoever is the most proficient Pac-Man player in the room.
Stream Processing Live Traffic Data with Kafka StreamsTom Van den Bulck
In this workshop we will set up a streaming framework which will process realtime data of traffic sensors installed within the Belgian road system.
Starting with the intake of the data, you will learn best practices and the recommended approach to split the information into events in a way that won't come back to haunt you.
With some basic stream operations (count, filter, ... ) you will get to know the data and experience how easy it is to get things done with Spring Boot & Spring Cloud Stream.
But since simple data processing is not enough to fulfill all your streaming needs, we will also let you experience the power of windows.
After this workshop, tumbling, sliding and session windows hold no more mysteries and you will be a true streaming wizard.
Presented at Stream Processing Meetup (7/19/2018)(https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/251481797/).
At Uber, we operate 20+ Kafka clusters to collect system and application logs as well as event data from rider and driver apps. We need a Kafka replication solution to replicate data between Kafka clusters across multiple data centers for different purposes. This talk will introduce the history behind uReplicator and the high level architecture. As the original uReplicator ran into scalability challenges and operational overhead as the scale of Kafka clusters increased, we built the Federated uReplicator which addressed above issues and provide an extensible architecture for further scaling.
On Track with Apache Kafka®: Building a Streaming ETL Solution with Rail Dataconfluent
Watch this talk here: https://www.confluent.io/online-talks/building-a-streaming-etl-solution-with-apache-kafka-rail-data-on-demand
As data engineers, we frequently need to build scalable systems working with data from a variety of sources and with various ingest rates, sizes, and formats. This talk takes an in-depth look at how Apache Kafka can be used to provide a common platform on which to build data infrastructure driving both real-time analytics as well as event-driven applications.
Using a public feed of railway data it will show how to ingest data from message queues such as ActiveMQ with Kafka Connect, as well as from static sources such as S3 and REST endpoints. We'll then see how to use stream processing to transform the data into a form useful for streaming to analytics in tools such as Elasticsearch and Neo4j. The same data will be used to drive a real-time notifications service through Telegram.
If you're wondering how to build your next scalable data platform, how to reconcile the impedance mismatch between stream and batch, and how to wrangle streams of data—this talk is for you!
Kafka, Killer of Point-to-Point Integrations, Lucian Litaconfluent
With 60+ products and over 24% of the US GDP flowing through it, system integration is a tough problem for Intuit. Seasonality, scale, and massive peaks in products like TurboTax, QuickBooks, and Mint.com add extra layers of difficulty when building shared data services around transaction and user graphs, clickstream processing, a/b testing, and personalization. To reduce complexity and latency, we’ve implemented Kafka as the backbone across these data services. This allows us to asynchronously trigger relevant processing, elegantly scaling up and down as needed around peaks, all without the need for point-to-point integrations.
In this talk, we share what we’ve learned about Kafka at Intuit and describe our data services architecture. We found that Kafka is invaluable in achieving a scalable, clean architecture, allowing engineering teams to focus less on integration and more on product development.
(Krunal Vora, Tinder) Kafka Summit San Francisco 2018
At Tinder, we have been using Kafka for streaming and processing events, data science processes and many other integral jobs. Forming the core of the pipeline at Tinder, Kafka has been accepted as the pragmatic solution to match the ever increasing scale of users, events and backend jobs. We, at Tinder, are investing time and effort to optimize the usage of Kafka solving the problems we face in the dating apps context. Kafka forms the backbone for the plans of the company to sustain performance through envisioned scale as the company starts to grow in unexplored markets. Come, learn about the implementation of Kafka at Tinder and how Kafka has helped solve the use cases for dating apps. Engage in the success story behind the business case of Kafka at Tinder.
The Future of Streaming: Global Apps, Event Stores and ServerlessBen Stopford
Stream processing affects a wide range of industries today: capturing sensor data, connecting microservices, processing the workloads of internet giants and giving us a real-time alternative to batch analytics.
While these use cases are exciting and valuable they are only a taste of what is to come. In this talk we look at three areas that are likely to become more prominent: Global Apps, Event Stores and Serverless Stream Processing
Flink Forward Berlin 2018: Xiaowei Jiang - Keynote: "Unified Engine for Data ...Flink Forward
Flink started with the mission to unify batch and stream processing. We believe that Flink’s architecture is uniquely positioned to be a great engine for streaming, batch and AI workloads at the same time. We will talk about the work we did in this direction.
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...Flink Forward
The application of Quantitative Analytics to trades for the generation of Risk and P&L metrics has traditionally followed a batch based approach. Regulatory changes impose increasing demand for compute on financial institutions along with a growing demand for real time analytics due to increased volumes in eTrading across all asset classes
The talk is based on a use case for pricing Interest Rate Swaps, using Apache Beam, with a call to an external C++ analytics process. It describes the performance characteristics when operating in a non-cloud environment using Apache Flink as opposed to Google Cloud Dataflow
The talk will touch upon the subtle difference when operating across multiple runners. It will make suggestions on approaches to portability when architecting for a multi-runner operational environment.
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
Observability for developer ( Inny So & Andrew Jones, ThoughtWorks) Kafka Su...confluent
Have you ever tried to debug a production outage, when your system comprises apps your team has written, third-party apps your team runs, with logs going into some system, application performance metrics going into another system, and cloud platform metrics going somewhere else? Did you find yourself switching tabs, trying to correlate metrics with logs and alerts and finding yourself in a huge tangle? It is a nightmare. In the data world, we talk about aggregating all our data so we can derive new insights quickly, but what about our operational data? Observability is your ability to be able to ask questions of your system without having to write new code, or grab new data. When you've got an observable system, it feels like you have debugging superpowers, but can be challenging to even know where to start. If you can even convince your colleagues to start, finding the right tools can be challenging. In this talk Inny and Andrew will talk about what monitoring and logging are not sufficient anymore (if they ever were), observability basics, and demo an observability platform that you can use to start your observability journey today.
Essa palestra tem como objetivo apresentar alguns conceitos de microserviços relacionados a dados. Apresentar a dualidade entre Stream e Tabelas, conceitos e patterns de processamento de Streams bem como exemplos de código utilizando o Kafka Streams.
Apache Kafka as Event-Driven Open Source Streaming Platform (Prague Meetup)Kai Wähner
From Prague Kafka Meetup in November 2018.
This session introduces Apache Kafka as event-driven open source streaming platform. Apache Kafka goes far beyond scalable, high volume messaging. In addition, you can leverage Kafka Connect for integration and the Kafka Streams API for building lightweight stream processing microservices in autonomous teams. The open source Confluent Platform adds further components such as a KSQL, Schema Registry, REST Proxy, Clients for different programming languages and Connectors for different technologies and databases.
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)Kai Wähner
Learn the differences between an event-driven streaming platform and middleware like MQ, ETL and ESBs – including best practices and anti-patterns, but also how these concepts and tools complement each other in an enterprise architecture.
Extract-Transform-Load (ETL) is still a widely-used pattern to move data between different systems via batch processing. Due to its challenges in today’s world where real time is the new standard, an Enterprise Service Bus (ESB) is used in many enterprises as integration backbone between any kind of microservice, legacy application or cloud service to move data via SOAP / REST Web Services or other technologies. Stream Processing is often added as its own component in the enterprise architecture for correlation of different events to implement contextual rules and stateful analytics. Using all these components introduces challenges and complexities in development and operations.
This session discusses how teams in different industries solve these challenges by building a native streaming platform from the ground up instead of using ETL and ESB tools in their architecture. This allows to build and deploy independent, mission-critical streaming real time application and microservices. The architecture leverages distributed processing and fault-tolerance with fast failover, no-downtime rolling deployments and the ability to reprocess events, so you can recalculate output when your code changes. Integration and Stream Processing are still key functionality but can be realized in real time natively instead of using additional ETL, ESB or Stream Processing tools.
MongoDB World 2019: Streaming ETL on the Shoulders of GiantsMongoDB
Life doesn't happen in batch mode which is why application engineers and data architects need to closely cooperate to get the best out of streaming platforms like Apache Kafka and NoSQL data stores such as MongoDB. This session explores ways and means to integrate both worlds in a streaming fashion.
FLiP Into Trino
FLiP into Trino. Flink Pulsar Trino
Pulsar SQL (Trino/Presto)
Remember the days when you could wait until your batch data load was done and then you could run some simple queries or build stale dashboards? Those days are over, today you need instant analytics as the data is streaming in real-time. You need universal analytics where that data is. I will show you how to do this utilizing the latest cloud native open source tools. In this talk we will utilize Trino, Apache Pulsar, Pulsar SQL and Apache Flink to analyze instantly data from IoT, sensors, transportation systems, Logs, REST endpoints, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach how to use Pulsar SQL to run analytics on live data.
Tim Spann
Developer Advocate
StreamNative
David Kjerrumgaard
Developer Advocate
StreamNative
https://www.starburst.io/info/trinosummit/
https://github.com/tspannhw/FLiP-Into-Trino/blob/main/README.md
https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/tree/main/src/main/java
select * from pulsar."public/default"."weather";
Apache Pulsar plus Trio = fast analytics at scale
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...confluent
MQ, ETL and ESB middleware are often used as integration backbone between legacy applications, modern microservices and cloud services. This introduces several challenges and complexities like point-to-point integration or non-scalable architectures. This session discusses how to build a completely event-driven streaming platform leveraging Apache Kafka’s open source messaging, integration and streaming components to leverage distributed processing, fault-tolerance, rolling upgrades and the ability to reprocess events. Learn the differences between a event-driven streaming platform leveraging Apache Kafka and middleware like MQ, ETL and ESBs – including best practices and anti-patterns, but also how these concepts and tools complement each other in an enterprise architecture.
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKai Wähner
Spoilt for Choice – Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka:
Apache Kafka is a de facto standard streaming data processing platform. It is widely deployed as event streaming platform. Part of Kafka is its stream processing API “Kafka Streams”. In addition, the Kafka ecosystem now offers KSQL, a declarative, SQL-like stream processing language that lets you define powerful stream-processing applications easily. What once took some moderately sophisticated Java code can now be done at the command line with a familiar and eminently approachable syntax.
This session discusses and demos the pros and cons of Kafka Streams and KSQL to understand when to use which stream processing alternative for continuous stream processing natively on Apache Kafka infrastructures. The end of the session compares the trade-offs of Kafka Streams and KSQL to separate stream processing frameworks such as Apache Flink or Spark Streaming.
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...Michael Noll
Talk URL: https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/77360
Abstract: Would you cross the street with traffic information that’s a minute old? Certainly not. Modern businesses have the same needs nowadays, whether it’s due to competitive pressure or because their customers have much higher expectations of how they want to interact with a product or service. At the heart of this movement are events: in today’s digital age, events are everywhere. Every digital action—across online purchases to ride-sharing requests to bank deposits—creates a set of events around transaction amount, transaction time, user location, account balance, and much more. The technology that allows businesses to read, write, store, and compute and process these events in real-time are event-streaming platforms, and tens of thousands of companies like Netflix, Audi, PayPal, Airbnb, Uber, and Pinterest have picked Apache Kafka as the de facto choice to implement event-driven architectures and reshape their industries.
Michael Noll explores why and how you can use Apache Kafka and its growing ecosystem to build event-driven architectures that are elastic, scalable, robust, and fault tolerant, whether it’s on-premises, in the cloud, on bare metal machines, or in Kubernetes with Docker containers. Specifically, you’ll look at Kafka as the storage and publish and subscribe layer; Kafka’s Connect framework for integrating external data systems such as MySQL, Elastic, or S3 with Kafka; and Kafka’s Streams API and KSQL as the compute layer to implement event-driven applications and microservices in Java and Scala and streaming SQL, respectively, that process the events flowing through Kafka in real time. Michael provides an overview of the most relevant functionality, both current and upcoming, and shares best practices and typical use cases so you can tie it all together for your own needs.
How to Build Streaming Apps with Confluent IIconfluent
In this interactive session, you’ll access a lab environment that shows you how to build Streaming Applications on top of Kafka, leveraging Confluent's modern tooling.
This is your exclusive opportunity to hear from the thought leaders of Apache Kafka on how event streaming enables you to leverage real-time data processing, with an easy-to-use, yet powerful interactive interface for stream processing, without the need to write code.
We have seen tremendous growth in near real-time ("nearline") processing at LinkedIn in recent years. LinkedIn now uses Apache Samza to process well over a Trillion messages every day across thousands of applications. Apache Samza serves as the foundation for several application platforms at LinkedIn, spanning a wide variety of use cases like security, notifications, machine learning, monitoring, search, and more. In this talk we will explore various features of Apache Samza that provide the flexibility and scalability to we need to power stream processing at massive scale.
Experiences in Architecting & Implementing Platforms using Serverless.pdfSrushith Repakula
In this talk, I share our experiences in architecting & implementing platforms such as KonfHub.com using completely serverless approach from the ground-up. Will discuss benefits and disadvantages of adopting serverless and provide pointers and best practices for the ones who are planning to adopt serverless architectures.
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent
In this talk we’ll look at the relationship between three of the most disruptive software engineering paradigms: event sourcing, stream processing and serverless. We’ll debunk some of the myths around event sourcing. We’ll look at the inevitability of event-driven programming in the serverless space and we’ll see how stream processing links these two concepts together with a single ‘database for events’. As the story unfolds we’ll dive into some use cases, examine the practicalities of each approach-particularly the stateful elements-and finally extrapolate how their future relationship is likely to unfold. Key takeaways include: The different flavors of event sourcing and where their value lies. The difference between stream processing at application- and infrastructure-levels. The relationship between stream processors and serverless functions. The practical limits of storing data in Kafka and stream processors like KSQL."
apidays LIVE India - Asynchronous and Broadcasting APIs using Kafka by Rohit ...apidays
apidays LIVE India 2021 - Connecting 1.3 billion digital innovators
May 20, 2021
Asynchronous and Broadcasting APIs using Kafka
Rohit Saxena, Software Development Consultant at Guardian Life
Streaming SQL to unify batch and stream processing: Theory and practice with ...Fabian Hueske
SQL is the lingua franca for querying and processing data. To this day, it provides non-programmers with a powerful tool for analyzing and manipulating data. But with the emergence of stream processing as a core technology for data infrastructures, can you still use SQL and bring real-time data analysis to a broader audience?
The answer is yes, you can. SQL fits into the streaming world very well and forms an intuitive and powerful abstraction for streaming analytics. More importantly, you can use SQL as an abstraction to unify batch and streaming data processing. Viewing streams as dynamic tables, you can obtain consistent results from SQL evaluated over static tables and streams alike and use SQL to build materialized views as a data integration tool.
Fabian Hueske and Shuyi Chen explore SQL’s role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges and how the unified stream and batch processing platform enables both technical or nontechnical users to process real-time and batch data reliably using the same SQL at Uber scale.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
14. What is Streaming Processing?
“Is some kind of computation over a Data Stream. First and foremost, a data
stream is an abstraction representing an unbounded dataset. Unbounded means
infinite and ever growing.”
Kafka: The Definitive Guide
16. Stream processing is a programming paradigm...
Request-Response Batch ProcessingStreaming
Processing
Throughput
Latency
17. The world always changes, and sometimes we are interested
in the events that caused those changes, whereas other
times we are interested in the current state of the world….
Stream-Table Duality
20. Systems that allow you to transition back and forth between
the two ways of looking at data are more powerful than
systems that support just one.
- Neha Narkhede (Kafka: The definitive guide)
21. Kafka vs Kafka Stream
● Distributed log
● High available
● ⅓ Fortune 500
● APIS:
○ Producer
○ Consumer
○ Connect
○ Streams
● Part of Kafka ecosystem
● Just a lib
● Simple API
● DSL