Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance.
Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Integrating Apache Kafka Into Your Environmentconfluent
Watch this talk here: https://www.confluent.io/online-talks/integrating-apache-kafka-into-your-environment-on-demand
Integrating Apache Kafka with other systems in a reliable and scalable way is a key part of an event streaming platform. This session will show you how to get streams of data into and out of Kafka with Kafka Connect and REST Proxy, maintain data formats and ensure compatibility with Schema Registry and Avro, and build real-time stream processing applications with Confluent KSQL and Kafka Streams.
This session is part 4 of 4 in our Fundamentals for Apache Kafka series.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Increasingly, organizations are relying on Kafka for mission critical use-cases where high availability and fast recovery times are essential. In particular, enterprise operators need the ability to quickly migrate applications between clusters in order to maintain business continuity during outages. In many cases, out-of-order or missing records are entirely unacceptable. MirrorMaker is a popular tool for replicating topics between clusters, but it has proven inadequate for these enterprise multi-cluster environments. Here we present MirrorMaker 2.0, an upcoming all-new replication engine designed specifically to provide disaster recovery and high availability for Kafka. We describe various replication topologies and recovery strategies using MirrorMaker 2.0 and associated tooling.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Integrating Apache Kafka Into Your Environmentconfluent
Watch this talk here: https://www.confluent.io/online-talks/integrating-apache-kafka-into-your-environment-on-demand
Integrating Apache Kafka with other systems in a reliable and scalable way is a key part of an event streaming platform. This session will show you how to get streams of data into and out of Kafka with Kafka Connect and REST Proxy, maintain data formats and ensure compatibility with Schema Registry and Avro, and build real-time stream processing applications with Confluent KSQL and Kafka Streams.
This session is part 4 of 4 in our Fundamentals for Apache Kafka series.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Increasingly, organizations are relying on Kafka for mission critical use-cases where high availability and fast recovery times are essential. In particular, enterprise operators need the ability to quickly migrate applications between clusters in order to maintain business continuity during outages. In many cases, out-of-order or missing records are entirely unacceptable. MirrorMaker is a popular tool for replicating topics between clusters, but it has proven inadequate for these enterprise multi-cluster environments. Here we present MirrorMaker 2.0, an upcoming all-new replication engine designed specifically to provide disaster recovery and high availability for Kafka. We describe various replication topologies and recovery strategies using MirrorMaker 2.0 and associated tooling.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Getting Started with Confluent Schema Registryconfluent
Getting started with Confluent Schema Registry, Patrick Druley, Senior Solutions Engineer, Confluent
Meetup link: https://www.meetup.com/Cleveland-Kafka/events/272787313/
Kafka Tiered Storage separates compute and data storage in two independently scalable layers. Uber's Kafka Improvement Proposal (KIP) #405 describes two-tiered storage, which is a major step towards cloud-native Kafka. It stores the most recent data locally and offloads older data to a remote storage service. Operationally, the benefit is faster routine cluster maintenance activities. In Linkedin, Kafka tiered storage is strongly desired to reduce the cost of running Kafka in the Azure cloud environment. As KIP-405 does not dictate the implementation of remote storage substrate, Linkedin's choice for tiering Kafka in Azure deployments is the Azure Blob Service. This presentation will begin with the motivation behind Linkedin efforts to adopt Kafka Tiered Storage. Next, the architecture of KIP-405 will be discussed. Finally, the Remote Storage Manager for Azure Blobs, which is a work-in-progress, will be presented.
Video: https://youtu.be/V5gaBE5CMwg?t=1387
Slides from #PromCon2018 Munich.
https://promcon.io/2018-munich/talks/thanos-prometheus-at-scale/
Bartłomiej Płotka
Fabian Reinartz
The Prometheus Monitoring system has been thriving for several years. Along with its powerful data model, operational simplicity and reliability have been a key factor in its success. However, some questions were still largely unaddressed to this day. How can we store historical data at the order of petabytes in a reliable and cost-efficient way? Can we do so without sacrificing responsive query times? And what about a global view of all our metrics and transparent handling of HA setups?
Thanos takes Prometheus' strong foundations and extends it into a clustered, yet coordination free, globally scalable metric system. It retains Prometheus's simple operational model and even simplifies deployments further. Under the hood, Thanos uses highly cost-efficient object storage that's available in virtually all environments today. By building directly on top of the storage format introduced with Prometheus 2.0, Thanos achieves near real-time responsiveness even for cold queries against historical data. All while having virtually no cost overhead beyond that of the underlying object storage.
We will show the theoretical concepts behind Thanos and demonstrate how it seamlessly integrates into existing Prometheus setups.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019confluent
Cloud migration: it's practically a rite of passage for anyone who's built infrastructure on bare metal. When we migrated our 5-year-old Kafka deployment from the datacenter to GCP, we were faced with the task of making our highly mutable server infrastructure more cloud-friendly. This led to a surprising decision: we chose to run our Kafka cluster on Kubernetes. I'll share war stories from our Kafka migration journey, explain why we chose Kubernetes over arguably simpler options like GCP VMs, and present the lessons we learned while making our way toward a stable and self-healing Kubernetes deployment. I'll also go through some improvements in the more recent Kafka releases that make upgrades crucial for any Kafka deployment on immutable and ephemeral infrastructure. You'll learn what happens when you try to run one complex distributed system on top of another, and come away with some handy tricks for automating cloud cluster management, plus some migration pitfalls to avoid. And if you're not sure whether running Kafka on Kubernetes is right for you, our experiences should provide some extra data points that you can use as you make that decision.
Spend some time working with OpenAPI and gRPC and you’ll notice that these two technologies have a lot in common. Both are open source efforts, both describe APIs, and both promise better experiences for API producers and consumers. So why do we need both? If we do, what value does each provide? What can each project learn from the other? We’ll bring the two together for a side-by-side comparison and pose answers to these and other questions about two API methodologies that will do much to influence the future of networked APIs.
The mission to μServices, should anyone choose to accept it, typically starts with a set of approaches and patterns around system design or deconstruction. The objective of these methods being to enable better isolation and autonomy for teams, data and processes. As the journey progresses, events typically appear as both a goal and an approach to enable even looser coupling and better scalability.
In this talk Kingsley will share with you a rough overview of the eventing landscape and why events, immutable data, functions and processes are key to developing scalable services. Concretely, pub/sub, event sourcing and event storming will be covered as well as experiences from building event based services and frameworks. There will also be suggestions on where microservices are heading, and the close proximity between MuServices and blockchain technology.
The talk will be both introductory and interactive, with life vests and support provided.
See: https://skillsmatter.com/skillscasts/10730-looking-forward-to-kingsley-davies-talk
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Getting Started with Confluent Schema Registryconfluent
Getting started with Confluent Schema Registry, Patrick Druley, Senior Solutions Engineer, Confluent
Meetup link: https://www.meetup.com/Cleveland-Kafka/events/272787313/
Kafka Tiered Storage separates compute and data storage in two independently scalable layers. Uber's Kafka Improvement Proposal (KIP) #405 describes two-tiered storage, which is a major step towards cloud-native Kafka. It stores the most recent data locally and offloads older data to a remote storage service. Operationally, the benefit is faster routine cluster maintenance activities. In Linkedin, Kafka tiered storage is strongly desired to reduce the cost of running Kafka in the Azure cloud environment. As KIP-405 does not dictate the implementation of remote storage substrate, Linkedin's choice for tiering Kafka in Azure deployments is the Azure Blob Service. This presentation will begin with the motivation behind Linkedin efforts to adopt Kafka Tiered Storage. Next, the architecture of KIP-405 will be discussed. Finally, the Remote Storage Manager for Azure Blobs, which is a work-in-progress, will be presented.
Video: https://youtu.be/V5gaBE5CMwg?t=1387
Slides from #PromCon2018 Munich.
https://promcon.io/2018-munich/talks/thanos-prometheus-at-scale/
Bartłomiej Płotka
Fabian Reinartz
The Prometheus Monitoring system has been thriving for several years. Along with its powerful data model, operational simplicity and reliability have been a key factor in its success. However, some questions were still largely unaddressed to this day. How can we store historical data at the order of petabytes in a reliable and cost-efficient way? Can we do so without sacrificing responsive query times? And what about a global view of all our metrics and transparent handling of HA setups?
Thanos takes Prometheus' strong foundations and extends it into a clustered, yet coordination free, globally scalable metric system. It retains Prometheus's simple operational model and even simplifies deployments further. Under the hood, Thanos uses highly cost-efficient object storage that's available in virtually all environments today. By building directly on top of the storage format introduced with Prometheus 2.0, Thanos achieves near real-time responsiveness even for cold queries against historical data. All while having virtually no cost overhead beyond that of the underlying object storage.
We will show the theoretical concepts behind Thanos and demonstrate how it seamlessly integrates into existing Prometheus setups.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019confluent
Cloud migration: it's practically a rite of passage for anyone who's built infrastructure on bare metal. When we migrated our 5-year-old Kafka deployment from the datacenter to GCP, we were faced with the task of making our highly mutable server infrastructure more cloud-friendly. This led to a surprising decision: we chose to run our Kafka cluster on Kubernetes. I'll share war stories from our Kafka migration journey, explain why we chose Kubernetes over arguably simpler options like GCP VMs, and present the lessons we learned while making our way toward a stable and self-healing Kubernetes deployment. I'll also go through some improvements in the more recent Kafka releases that make upgrades crucial for any Kafka deployment on immutable and ephemeral infrastructure. You'll learn what happens when you try to run one complex distributed system on top of another, and come away with some handy tricks for automating cloud cluster management, plus some migration pitfalls to avoid. And if you're not sure whether running Kafka on Kubernetes is right for you, our experiences should provide some extra data points that you can use as you make that decision.
Spend some time working with OpenAPI and gRPC and you’ll notice that these two technologies have a lot in common. Both are open source efforts, both describe APIs, and both promise better experiences for API producers and consumers. So why do we need both? If we do, what value does each provide? What can each project learn from the other? We’ll bring the two together for a side-by-side comparison and pose answers to these and other questions about two API methodologies that will do much to influence the future of networked APIs.
The mission to μServices, should anyone choose to accept it, typically starts with a set of approaches and patterns around system design or deconstruction. The objective of these methods being to enable better isolation and autonomy for teams, data and processes. As the journey progresses, events typically appear as both a goal and an approach to enable even looser coupling and better scalability.
In this talk Kingsley will share with you a rough overview of the eventing landscape and why events, immutable data, functions and processes are key to developing scalable services. Concretely, pub/sub, event sourcing and event storming will be covered as well as experiences from building event based services and frameworks. There will also be suggestions on where microservices are heading, and the close proximity between MuServices and blockchain technology.
The talk will be both introductory and interactive, with life vests and support provided.
See: https://skillsmatter.com/skillscasts/10730-looking-forward-to-kingsley-davies-talk
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
My talk at Google DevFest Switzerland, Fribourg, Oct 2017.
https://devfest.ch/schedule/day1?sessionId=118
Abstract:
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka.
Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice.
In this session we talk about how Apache Kafka helps you to radically simplify your data architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. We discuss common use cases to realize that stream processing in practice often requires database-like functionality, and how Kafka allows you to bridge the worlds of streams and databases when implementing your own core business applications (inventory management for large retailers, patient monitoring in healthcare, fleet tracking in logistics, etc), for example in the form of event-driven, containerized microservices. We will also give a brief shout-out to the recently launched KSQL, a streaming SQL engine for Apache Kafka.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
Speakers: Ravi Dubey, Senior Manager, Software Engineering, Capital One + Jeff Sharpe, Software Engineer, Capital One
Capital One supports interactions with real-time streaming transactional data using Apache Kafka®. Kafka helps deliver information to internal operation teams and bank tellers to assist with assessing risk and protect customers in a myriad of ways.
Inside the bank, Kafka allows Capital One to build a real-time system that takes advantage of modern data and cloud technologies without exposing customers to unnecessary data breaches, or violating privacy regulations. These examples demonstrate how a streaming platform enables Capital One to act on their visions faster and in a more scalable way through the Kafka solution, helping establish Capital One as an innovator in the banking space.
Join us for this online talk on lessons learned, best practices and technical patterns of Capital One’s deployment of Apache Kafka.
-Find out how Kafka delivers on a 5-second service-level agreement (SLA) for inside branch tellers.
-Learn how to combine and host data in-memory and prevent personally identifiable information (PII) violations of in-flight transactions.
-Understand how Capital One manages Kafka Docker containers using Kubernetes.
Watch the recording: https://videos.confluent.io/watch/6e6ukQNnmASwkf9Gkdhh69?.
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
My talk from ScalaDays 2016 in New York on May 11, 2016:
Transitioning from a monolithic application to a set of microservices can help increase performance and scalability, but it can also drastically increase complexity. Layers of inter-service network calls for add latency and an increasing risk of failure where previously only local function calls existed. In this talk, I'll speak about how to tame this complexity using Apache Kafka and Reactive Streams to:
- Extract non-critical processing from the critical path of your application to reduce request latency
- Provide back-pressure to handle both slow and fast producers/consumers
- Maintain high availability, high performance, and reliable messaging
- Evolve message payloads while maintaining backwards and forwards compatibility.
Keystone processes over 1 trillion events per day with at-least once processing semantics in the cloud. We will explore in detail how we have modified and leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in the Amazon AWS cloud within a year.
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022HostedbyConfluent
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs is a hyperscale PaaS event stream broker with protocol support for HTTP, AMQP, and Apache Kafka RPC that accepts and forwards several trillion (!) events per day and is available in all global Azure regions. This session is a look behind the curtain where we dive deep into the architecture of Event Hubs and look at the Event Hubs cluster model, resource isolation, and storage strategies and also review some performance figures.
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...confluent
BY Jun Rao
From the Bay Area Apache Kafka September 2016 Meetup.
Abstract: To manage the ever-increasing volume and velocity of data within your company you have successfully made the transition from single machines and one-off solutions to large, distributed stream infrastructures in your data center powered by Apache Kafka. But what needs to be done if one data center is not enough? In this session we describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence. We provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication and mirroring as well as disaster scenarios and failure handling.
This tutorial gives out an brief and interesting introduction to modern stream computing technologies. The participants can learn the essential concepts and methodologies for designing and building a advanced stream processing system. The tutorial unveils the key fundamentals behind various kinds of design choices. Some forecast of technology developments in this domain is also introduced at the last section of this tutorial.
World of Tanks Experience of Using KafkaLevon Avakyan
In this paper I speak about BigWorld technology, WoT server, Apache Kafka and how we started to use it together. What difficulties we had and how we had solved them.
Right-Sizing your SQL Server Virtual Machineheraflux
Virtualizing your top-tier production SQL Servers is not as easy as P2V’ing it. Sometimes allocating more resources to the VM is the wrong approach, and getting it wrong will silently hurt performance. What is the most effective method for determining the ‘right’ amount of resources to allocate? What happens if the workload changes a month from now?
The methods for understanding the performance of your mission-critical SQL Servers gathered over the past ten years of SQL Server virtualization will be addressed, and valuable processes for performance statistic collection and analysis will be displayed. Come learn how to properly ‘right-size’ the resources allocated to a VM, improve the performance of your SQL Servers, and keep it maximized well into the future.
Systematic Generation Data and Types in C++Sumant Tambe
This presentation will discuss two classic techniques from the functional domain — composable data generators and property-based testing — implemented in C++14 for testing a generic serialization and deserialization library. We will look at a systematic technique of constructing data generators from a mere random number generator and random type generation using compile-time meta-programming. Along the way, we will discuss monoids, functors, and monads as we encounter them.
Variants have been around in C++ for a long time and C++17 now has std::variant. We will compare inheritance and std::variant for their ability to model sum-types (a fancy name for tagged unions). We will visit std::visit and discuss how it helps us model the pattern matching idiom. Immutability is one of the core pillars of Functional Programming (FP). C++ now allows you to model deep immutability; we'll see a way to do that using the standard library. We'll explore if `return std::move(*this)` makes any sense in C++. Immutability may be a reason for that.
Remote Log Analytics Using DDS, ELK, and RxJSSumant Tambe
Autonomous Probing and Diagnostics for remote IT log data using RTI Connext Data Distribution Service (DDS), Elasticsearch-Logstash-Kibana (ELK), and Reactive Extensions for JavaScript (RxJS). Github: https://github.com/rticommunity/rticonnextdds-reactive/tree/master/javascript
Reactive Stream Processing Using DDS and RxSumant Tambe
In this presentation you will see why Reactive Extensions (Rx) is a powerful technology for asynchronous stream processing. RTI Data Distribution Service (DDS) will be used as the source of data and as a communication channel for asynchronous data streams. On top of DDS, we'll use Rx to subscribe, observe, project, filter, aggregate, merge, zip, and correlate one or more data streams (Observables). The live demo will be very visual as bouncing shapes of different colors will be transformed in front of you using C# lambdas, Rx.NET, and Visual Studio. You will also learn about the new Rx4DDS.NET library that integrates RTI DDS with Rx.NET. Rx and DDS are a great match because both are reactive. Rx is based on the subject-observer pattern, which is quite analogous to the publish-subscribe pattern of DDS. When used together they support distributed dataflows seamlessly. If time permits, we will touch upon advanced Rx concepts such as stream of streams (IGroupedObservable) and how it captures DDS "keyed topics". The DDS applications using Rx4DDS.NET dramatically simplify concurrency to the extent that it can be simply configured.
Fun with Lambdas: C++14 Style (part 1)Sumant Tambe
If virtual functions in C++ imply design patterns, then C++ lambdas imply what? What does it really mean to have lambdas in C++? Frankly, I don't know but I've a hunch: It's BIG.
Just like virtual functions open doors to the OO paradigm, lambdas open doors to a different paradigm--the functional paradigm. This talk is not a praise of functional programming or some elusive lambda-based library. (Although, I'll mention one briefly that tops my list these days.) Instead, the goal is to have fun while working our way through some mind-bending examples of C++14 lambdas. Beware, your brain will hurt! Bring your laptop and code the examples right along because that may be the fastest way to answer the quiz.
An Extensible Architecture for Avionics Sensor Health Assessment Using DDSSumant Tambe
Avionics Sensor Health Assessment is a sub-discipline of Integrated Vehicle Health Management (IVHM), which relates to the collection of sensor data, distributing it to diagnostics/prognostics algorithms, detecting run-time anomalies, and scheduling maintenance procedures. Real-time availability of the sensor health diagnostics for aircraft (manned or unmanned) subsystems allows pilots and operators to improve operational decisions. Therefore, avionics sensor health assessments are used extensively in the mil-aero domain. As avionics platforms consist of a variety of hardware and software components, standards such as Open System Architecture for Condition-Based Maintenance (OSA-CBM) have emerged to facilitate integration and interoperability. However, OSA-CBM is a platform-independent standard that provides little guidance for avionics sensor health monitoring, which requires onboard health assessment of airborne sensors in real-time. In this paper, we present a distributed architecture for avionics sensor health assessment using the Data Distribution Service (DDS), an Object Management Group (OMG) standard for developing loosely coupled high-performance real-time distributed systems. We use the data-centric publish/subscribe model supported by DDS for data acquisition, distribution, health monitoring, and presentation of diagnostics. We developed a normalized data model for exchanging the sensor and diagnostics information in a global data space in the system. Moreover, Extensible and Dynamic Topic Types (XTypes) specification allows incremental evolution of any subset of system components without disrupting the overall health monitoring system. We believe, the DDS standard and in particular RTI Connext DDS, is a viable technology for implementing OSA-CBM for avionics systems due to its real-time characteristics and extremely low resource requirements. RTI Connext DDS is being used in other major avionics programs, such as FACE™ and UCS. We evaluated our approach to sensor health assessment in a hardware-in-the-loop simulation of an Inertial Measurement Unit (IMU) onboard a simulated General Atomics MQ-9 Reaper UAV. Our proof-of-concept effectively demonstrates real-time health monitoring of avionics sensors using a Bayesian Network –based analysis running on an extremely low-power and lightweight processing unit.
Overloading in Overdrive: A Generic Data-Centric Messaging Library for DDSSumant Tambe
When it comes to sending data across a network, applications send either binary or self-describing data (XML). Both approaches have merits. Data Distribution Service (DDS) combines the best of both in what’s called “data-centric messaging”. DDS shares the type description once, upfront, and later on sends binary data that meets the type description. You typically use IDL or XSD to specify the types and run them through a code generator for type-safe wrapper APIs for your application in your programming language. Simple and fast! As it turns out, however, C++11 bends the rules once again. In this presentation you will learn about a template-based C++11 messaging library that gives the DDS code generator a run for its money. The types and objects in your C++11 application are mapped to standard DDS X-Types type descriptions and serialized format, respectively, using template meta-programming. If you have never heard about SFINAE you won’t stop talking about it after you see "overloading in overdrive" in this presentation. What’s more? I will share my newfound hatred for std::vector of bool/enums. This presentation will cover DDS-XTypes, DDS_TypeCode, DDS_DynamicData, STL, type_traits, Boost Fusion, and overloading with enable_if (lots and lots of it!).
Communication Patterns Using Data-Centric Publish/SubscribeSumant Tambe
Fundamental to any distributed system are communication patterns: point-to-point, request-reply, transactional queues, and publish-subscribe. Large distributed systems often employ two or more communication patterns. Using a single middleware that supports multiple communication patterns is a very cost-effective way of developing and maintaining large distributed systems. This talk will begin with an introduction of Data Distribution Service (DDS) – an OMG standard – that supports data-centric publish-subscribe communication for real-time distributed systems. DDS separates state management and distribution from application logic and supports discoverable data models. The talk will then describe how RTI Connext Messaging goes beyond vanilla DDS and implements various communication patterns including request-reply, command-response, and guaranteed delivery. You will also learn how these patterns can be combined to create interesting variations when the underlying substrate is as powerful as DDS. We’ll also discuss APIs for creating high-performance applications using the request-reply communication pattern.
C++11 Idioms @ Silicon Valley Code Camp 2012 Sumant Tambe
C++11 feels like a new language. Compared to its previous standards, C++11 packs more language features and libraries designed to make C++ programs easier to understand and faster. As the community is building up experience with the new features, new stylistic ways of using them are emerging. These styles (a.k.a. idioms) give the new language its unique flavor. This talk will present emerging idioms of using rvalue references -- a marquee feature of C++11 as many renowned experts call it. You will see how C++11 opens new possibilities to design class interfaces. Finally, you will learn some advanced use-cases of rvalue references which will likely make you feel something amiss in this flagship feature of C++11.
Retargeting Embedded Software Stack for Many-Core SystemsSumant Tambe
Novel techniques are needed for high-performance applications to exploit massive local concurrency in many-core systems. Getting software applications to run faster on machines with more cores requires substantial restructuring of embedded software stacks, including applications, middleware, and the operating system (OS). Contemporary software stacks are not designed to exploit hundreds or thousands of cores. New OS and middleware mechanisms must be developed to handle scheduling, resource sharing, and communication in many-core systems. The solution must also provide high-level API to simplify development of concurrent software. In this session, we describe new mechanisms for scheduling and communication for many-core embedded platforms.
Native XML processing in C++ (BoostCon'11)Sumant Tambe
XML programming has emerged as a powerful data processing paradigm with its own rules for abstracting, partitioning, programming styles, and idioms. Seasoned XML programmers expect, and their productivity depends on the availability of languages and tools that allow usage of the patterns and practices native to the domain of XML programming. The object-oriented community, however, prefers XML data binding tools over dedicated XML languages because these tools automatically generate a statically-typed, vocabulary-specific object model from a given XML schema. Unfortunately, these tools often sidestep the expectations of seasoned XML programmers because of the difficulties in synthesizing abstractions of XML programming using purely object-oriented principles. This talk demonstrates how this prevailing gap can be significantly narrowed by a novel application of multi-paradigm programming capabilities of C++. In particular, how generic programming, meta-programming, generative programming, strategic programming, and operator overloading supported by C++ together enable native and typed XML programming.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
3. Tuning Truly Global Production Kafka Pipelines
Data
Source
(Hadoop)
Kafka
Venice Feed
East Coast
Mirror-Maker
To west-coast
Mirror-Maker
To Asia
Mirror-Maker
To east-coast
Mirror-Maker
To gulf-coast
Gulf
Coast
West
Coast
Asia
Kafka
Venice
Kafka
Venice
Kafka
Venice
Kafka
Venice
Venice
Consumers
Venice
Consumers
Venice
Consumers
Venice
Consumers
East
Coast
4. But first, some basics…
• Kafka: Distributed Messaging System rethought as a distributed
commit log
Producer 1
Kafka Cluster
Broker 1
Broker 2
P0
P1’
P1
P0’
Consumer Group A
Consumer Group B
A1
A2
B1
Producer 2
Topic T
Log
Log
Replication
Topic T has 2 partitions P0 and P1.
P0’ and P1’ are replicas of P0 and P1.
5. Moving Data Is Critical in Internet Companies
(Image Credit: Kafka Online Documentation)
6. Kafka Pipeline
• Why Kafka-based Pipelines
• Producer/Consumer Throughput and Time Decoupling
• Large, Reliable, Durable buffer
• Data replication for high availability of data
Producer
Source Kafka
Cluster
Kafka
Mirror-Maker
Cluster
Destination
Kafka
Cluster
Consumer
Log Log
The main value Kafka provides to data pipelines is its ability to serve as a very
large, reliable buffer between various stages in the pipeline, effectively
decoupling producers and consumers of data within the pipeline.
7. Anatomy of a Kafka Pipeline
(Image Credit: Kafka Definitive Guide, O’Reilly)
8. Aspects of Kafka Pipelines
• Reliability and Availability
• Replication Topologies (Structure)
• Time Decoupling
• Durability
• Throughput
• Latency
• Data Integration and Schemas
• Transformations
• Fair Load Distribution
• Migration/Upgrades
• Topic Lifecycle Management
• DDoS Prevention and Quotas
• Auditing
9. Reliability and Availability
• Must avoid single points of failure
• Allow fast and automatic recovery
• Most systems need at-least once delivery guarantee
• Do not lose data
• But, be ready for duplicates
10. Replication Topologies
Hub and Spoke Architecture
(Image Credit: Kafka Definitive Guide, O’Reilly)
Kafka
Cluster
Local
Apps
Kafka
Cluster
Local
Apps
Kafka
Cluster
Local
Apps
Kafka
Cluster
Local
Apps
Kafka
Cluster
Local
Apps
Crossbar Architecture
(LinkedIn)
There are many more replication topologies
Each arrow is a
Mirror-Maker
Cluster
11. Kafka Pipelines in Industrial IoT
Coditation
[link]
telemetry
(Dotted lines and shaded shapes mean passive replication)
12. Durability (no-loss data pipeline)
• Durability interacts with throughput and latency
• Durability levels change depending upon producer configurations
Producer Configurations Throughput Latency Durability Ordered
acks=0 High Low No guarantee Yes
acks=1 Medium Medium Leader Yes
acks=all (-1) Low High In Sync Replicas Yes
13. Kafka
Mirror-Maker
Cluster
Throughput
• Producer and consumer throughputs are decoupled
• Add/Remove producers and consumers independently
• Throughput scales with cluster size
• Increase parallelization by increasing partitions
• Throughput also depends on co-location
• Remote consume throughput is much greater than remote produce
• Consumers can batch much more data in a response than producer requests
Source Kafka
Cluster
Destination
Kafka
Cluster
Log Log
Kafka
Mirror-Maker
Cluster
Remote Produce Remote Consume
Datacenter 1 Datacenter 2
14. Configurations For Tuning Throughput [link]
Producer
Source Kafka
Cluster
Kafka
Mirror-Maker
Cluster
Destination
Kafka
Cluster
Consumer
Log Log
Producer Configurations Kafka Broker Configurations KMM Configurations Consumer Configurations
batch.size num.replica.fetchers All producer and
consumer configs are
applicable
Increase # of topic
partitions
linger.ms replica.fetch.max.byte
s
Consumer to producer
ratio
fetch.message.max.byt
es
compression.type Disable inter-broker
SSL
fetch.min.bytes
acks socket.receive.buffer.
bytes
max.in.flight.requests
.per.connection
send.buffer.bytes
(also TCP buffers)
15. Latency
• Typical latency few hundred milliseconds
• Latency SLA depends on availability SLA
• One 60-minutes downtime in a week is 99.4% availability (Assuming a weekly report)
• One 1-minute downtime in a week is 99.99% availability (Assuming a weekly report)
• But SLA can be fragile
• Large Mirror-Maker clusters could take minutes to rebalance
• Maintenance of Mirror-Maker clusters could take several minutes
• Bounce Mirror-Maker cluster with 100% concurrency (to avoid repetitive rebalances)
• Configurations that affect pipeline latency
• Producer linger.ms and acks
• Topic replication factor
16. Data Integration and Schemas
• Kafka is schema agnostic
• But applications must be protected from backwards incompatible
changes to schema
• Schema-registry
• Data Integration should support schema evolution
• Only backwards compatible schema evolution
• But bend the rules if/when needed
• Single topic with multiple schemas
• Propagate schema changes automatically through the pipeline
18. Fair Load Distribution
• Ideal: Each Kafka Mirror Maker should share the burden equally
• But
• When brokers go up/down partition imbalance can happen because Preferred
Leader Election is not run
• Imbalance in partitions and change in partition leadership may caused KMM
to exceed quotas
• Remedy: Move partitions manually
19. Migration/Upgrades
• Upgrading hardware for brokers
• More cores
• More memory
• Faster NIC
• If you reduce # of brokers
• Must increase quotas
• Increase num.replica.fetchers
• Increase replica.fetch.response.max.bytes
20. Topic Lifecycle Management
• Topic creation
• Topic should be created in the destination cluster first
• If not, Mirror-Maker will start replicating the topic and may fail to produce (or
a topic with default configs gets created)
• Topic deletion
• Topic should be deleted in the source cluster first
• But only when no one is producing or consuming
• If topic is deleted in the source cluster, the mirror-maker will cause them to
be recreated with default configs due to metadata refresh
21. DDoS Prevention and Quotas
• Hadoop to Kafka pipeline gets DDoS easily
• 800+ mappers in some cases
• Should use reducers instead
• Quotas on incoming byte rate
• Byte rate may be low but request-rate also matters
• Request-rate throttling is available in Kafka 0.11.
• Mirror-Makers batch very well so request-rate throttling is not
necessarily needed
23. Global PROD Kafka Pipelines for Venice
Data
Source
(Hadoop)
Kafka
Venice Feed
East Coast
Kafka MM
To west-coast
Kafka MM
To Asia
Kafka MM
To east-coast
Kafka MM
To gulf-coast
Gulf
Coast
West
Coast
Asia
Kafka
Venice
Kafka
Venice
Kafka
Venice
Kafka
Venice
Venice
Consumers
Venice
Consumers
Venice
Consumers
Venice
Consumers
East
Coast
Low throughput
Low throughput
24. The Slow Throughput Problem (One Topic Experiment)
22 min
38 min
Replication to West Coast = 54 mins
Replication to Asia = 180 min
25. CPU Utilization On Slow Mirror-Makers
To Asia (this one was the slowest)
To West coast (slower)
Average
CPU Util
(aggregate)
Max CPU
Util
(aggregate)
To Gulf
Coast
96% 165%
To East
Coast
104% 165%
To West
Coast
40% 90%
To Asia 16% 60%
27. Setup
• Producer Setup
• 100 GB data in each push
from Hadoop
• 840 mappers producing
data
• Kafka Broker Setup
• 4 large brokers, 32 cores
each, 256 GB RAM each
• Broker replication over SSL
• Topic Replication Factor=3
• Producer ACK = -1 (all)
• Partitions = 200
• Mirror Maker Setup
• 4 independent groups
• 10 processes in each cluster
• 8 consumers in each process
• 80 consumers in each
pipeline
• It’s CPU bound (due to
decompression)
28. High Ping Latency
• From East Coast
East coast Gulf Coast West Coast Asia
0.025 ms 29 ms 67 ms 236 ms
29. Text Book Solution
• Don’t remote produce. Prefer remote consume and local produce
• Increase max.in.flight.request.per.connection > 1
Data
Source
(Hadoop)
Kafka
Venice Feed
Kafka MM
To east-coast
Kafka MM
To gulf-coast
Gulf
Coast
West
Coast
Asia
Kafka
Venice
Kafka
Venice
Kafka
Venice
Kafka
Venice
Venice
Consumers
Venice
Consumers
Venice
Consumers
Venice
Consumers
East
Coast
Kafka MM
To west-coast
Kafka MM
To Asia
30. Text Book Solution Was Not Practical (at the moment)
• Must guarantee order
(max.in.flight.requests.per.connection must be 1)
• Must open ACLs (firewall ports) for incoming remote connections. Takes
time.
• Must have hardware capacity in the destination datacenter
31. Key Observations and Remedies
• High Ping Latency
• From East-coast
• Four Source brokers
• 150+ Under Replicated Partitions (URP)
• 840 mappers (producers) is simply way to many Replaced by reducers
• SSL has overhead Disable inter-broker SSL
• Imbalanced response time
• Unequal workload on the brokers. Should do manual replica movement to spread load evenly
• Kafka Mirror Maker
• Under provisioned machines. 4 cores only. Must change to 8 cores.
• 200 partitions and 80 consumers 2 or 3 partitions per consumer Each consume talks to at most 3
brokers Inefficient Fetch Must increase # of partitions
• Producer batch.size=100K Must increase batch size (1 MB max is allowed)
• Producer send.buffer.bytes=128K Must increase send.buffer.bytes (10 MB)
• Just 1 producer per process. At most one request in flight at a time Can’t change that because order
must be preserved
East coast Gulf Coast West Coast Asia
0.025 ms 29 ms 67 ms 236 ms
32. The Solution That Saved The Day Week
• Remote produce
• Max-in-flight = 1
• Increased batch.size to 1 MB and send.buffer.bytes to 10
MB
• But there was a bug. Producer estimated batch sizes incorrectly.
• Sent larger than 1MB batches to the broker.
• Sporadic REQUEST_TO_LARGE exceptions. Shuts down KMM.
• Disabled compression estimation
• Pack a batch up to 1 MB, compress, and send.
• Resulting compressed batch size up to 650K (30% unutilized)
The main value Kafka provides to data pipelines is its ability to serve as a very large, reliable buffer between various stages in the pipeline, effectively decoupling producers and consumers of data within the pipeline. This decoupling, combined with reliability security and efficiency, makes Kafka a good fit for most data pipelines.
Fetch response sent to consumers batch much more data than a produce response can batch.
Performance of compression types differs a lot.
KMM: High value of messageBatchSize to 200K. 1 consumer 4 producers per process. Small linger because the batches fill fast due to cpu optimization
Another way to increase throughput without increase partition number is to bump up the fetch.min.bytes to something like 20 MB, this will allow more data to be fetched from a single partition. The downside is that there might be long GC due to such big memory allocation,
When end-to-end latency requirements are in seconds, even availability % starts to matter