This document discusses how ChatWork migrated from using a native blocking HBase client to the asynchronous non-blocking asynchbase client. It describes problems they faced with the blocking client, such as long running queries blocking threads. It then explains how asynchbase and Akka streams were used to build an asynchronous non-blocking interface. Performance tests showed the asynchronous approach improved throughput by 30% and reduced latency at the 95th percentile from 1000ms to 200ms. Migrating to a non-blocking client made the system more resilient to partial failures.
Developing Secure Scala Applications With Fortify For ScalaLightbend
From banks to airlines to credit rating agencies, security continues to be a major focus for organizations across various industries. As the newspapers show, it’s heavily damaging to enterprises when security vulnerabilities in their code, infrastructure, or open source frameworks/libraries get exploited.
The good news is that your Scala development team now has a powerful ally for securing their applications. Co-developed by the Fortify team along with Lightbend, the upcoming Fortify for Scala Plugin is the only Static Application Security Testing (SAST) solution to use the official Scala compiler. This plugin automatically identifies code-level security vulnerabilities early in the SDLC, so you can confidently and reliably secure your mission-critical Scala-based applications.
In this webinar by Seth Tisue, Scala Committer and Senior Scala Engineer at Lightbend, and Poonam Yadav, Product Manager for Fortify at Micro Focus, you will learn about:
* Some of the more than 200 vulnerabilities that the Fortify plugin for Scala can catch and help you resolve,
* How the plugin works to analyze, identify and provide actionable recommendations,
* How to integrate it into your modern DevOps environment,
* Why this plugin was co-developed by Lightbend and the Fortify team, and how it benefits your organization’s security professionals / CISO office.
What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent
With the introduction of connect and streams API in 2016, Apache Kafka is becoming the defacto solution for anyone looking to build a streaming platform. The community continues to add additional capabilities to make it the complete solution for streaming data.
Join us as we review the latest additions in Apache Kafka 0.10.2. In addition, we’ll cover what’s new in Confluent Enterprise 3.2 that makes it possible for running Kafka at scale.
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSLightbend
Apache Kafka–part of Lightbend Fast Data Platform–is a distributed streaming platform that is best suited to run close to the metal on dedicated machines in statically defined clusters. For most enterprises, however, these fixed clusters are quickly becoming extinct in favor of mixed-use clusters that take advantage of all infrastructure resources available.
In this webinar by Sean Glover, Fast Data Engineer at Lightbend, we will review leading Kafka implementations on DC/OS and Kubernetes to see how they reliably run Kafka in container orchestrated clusters and reduce the overhead for a number of common operational tasks with standard cluster resource manager features. You will learn specifically about concerns like:
* The need for greater operational knowhow to do common tasks with Kafka in static clusters, such as applying broker configuration updates, upgrading to a new version, and adding or decommissioning brokers.
* The best way to provide resources to stateful technologies while in a mixed-use cluster, noting the importance of disk space as one of Kafka’s most important resource requirements.
* How to address the particular needs of stateful services in a model that natively favors stateless, transient services.
Developing Secure Scala Applications With Fortify For ScalaLightbend
From banks to airlines to credit rating agencies, security continues to be a major focus for organizations across various industries. As the newspapers show, it’s heavily damaging to enterprises when security vulnerabilities in their code, infrastructure, or open source frameworks/libraries get exploited.
The good news is that your Scala development team now has a powerful ally for securing their applications. Co-developed by the Fortify team along with Lightbend, the upcoming Fortify for Scala Plugin is the only Static Application Security Testing (SAST) solution to use the official Scala compiler. This plugin automatically identifies code-level security vulnerabilities early in the SDLC, so you can confidently and reliably secure your mission-critical Scala-based applications.
In this webinar by Seth Tisue, Scala Committer and Senior Scala Engineer at Lightbend, and Poonam Yadav, Product Manager for Fortify at Micro Focus, you will learn about:
* Some of the more than 200 vulnerabilities that the Fortify plugin for Scala can catch and help you resolve,
* How the plugin works to analyze, identify and provide actionable recommendations,
* How to integrate it into your modern DevOps environment,
* Why this plugin was co-developed by Lightbend and the Fortify team, and how it benefits your organization’s security professionals / CISO office.
What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent
With the introduction of connect and streams API in 2016, Apache Kafka is becoming the defacto solution for anyone looking to build a streaming platform. The community continues to add additional capabilities to make it the complete solution for streaming data.
Join us as we review the latest additions in Apache Kafka 0.10.2. In addition, we’ll cover what’s new in Confluent Enterprise 3.2 that makes it possible for running Kafka at scale.
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSLightbend
Apache Kafka–part of Lightbend Fast Data Platform–is a distributed streaming platform that is best suited to run close to the metal on dedicated machines in statically defined clusters. For most enterprises, however, these fixed clusters are quickly becoming extinct in favor of mixed-use clusters that take advantage of all infrastructure resources available.
In this webinar by Sean Glover, Fast Data Engineer at Lightbend, we will review leading Kafka implementations on DC/OS and Kubernetes to see how they reliably run Kafka in container orchestrated clusters and reduce the overhead for a number of common operational tasks with standard cluster resource manager features. You will learn specifically about concerns like:
* The need for greater operational knowhow to do common tasks with Kafka in static clusters, such as applying broker configuration updates, upgrading to a new version, and adding or decommissioning brokers.
* The best way to provide resources to stateful technologies while in a mixed-use cluster, noting the importance of disk space as one of Kafka’s most important resource requirements.
* How to address the particular needs of stateful services in a model that natively favors stateless, transient services.
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsLightbend
Organizations like Starbucks, HPE, and PayPal (see our customers) have selected the Akka toolkit for their enterprise scale distributed applications; and when it comes to squeezing out the best possible performance, the secret is using two particular modules in tandem: Akka Cluster and Akka Streams.
In this webinar by Nolan Grace, Senior Solution Architect at Lightbend, we look at these two Akka modules and discuss the features that will push your application architecture to the next tier of performance.
For the full blog post, including the video, visit: https://www.lightbend.com/blog/akka-at-enterprise-scale-performance-tuning-distributed-applications
Revitalizing Enterprise Integration with Reactive StreamsLightbend
With Viktor Klang, Deputy CTO Lightbend, Inc.
As software grows more and more interconnected, and with several generations of software having to interoperate, a new take on the integration of systems is needed—ad hoc, unversioned, and unreplicated scripts just won’t suffice, and the traditional Enterprise Service Bus (ESB) concept has experienced stability, reliability, performance, and scalability problems.
In this webinar, Viktor explores a new take on Enterprise Integration Patterns:
First, he will explore the Reactive Streams standard, an orchestration layer where transformations are standalone, composable, reusable, and—most importantly—using asynchronous flow-control—back pressure—to maintain predictable, stable, behavior over time.
Furthermore, he will go through how one-off workloads relate to continuous, and batch, workloads, and how they can be addressed by that very same orchestration layer.
Finally, he will review how this type of design achieves resilience, scalability, and ultimately—responsiveness.
Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent
William Hill is one of the UK’s largest, most well-established gaming companies with a global presence across 9 countries with over 16,000 employees. In recent years the gaming industry and in particular sports betting, has been revolutionised by technology. Customers now demand a wide range of events and markets to bet on both pre-game and in-play 24/7. This has driven out a business need to process more data, provide more updates and offer more markets and prices in real time.
At William Hill, we have invested in a completely new trading platform using Apache Kafka. We process vast quantities of data from a variety of feeds, this data is fed through a variety of odds compilation models, before being piped out to UI apps for use by our trading teams to provide events, markets and pricing data out to various end points across the whole of William Hill. We deal with thousands of sporting events, each with sometimes hundreds of betting markets, each market receiving hundreds of updates. This scales up to vast numbers of messages flowing through our system. We have to process, transform and route that data in real time. Using Apache Kafka, we have built a high throughput, low latency pipeline, based on Cloud hosted Microservices. When we started, we were on a steep learning curve with Kafka, Microservices and associated technologies. This led to fast learnings and fast failings.
In this session, we will tell the story of what we built, what went well, what didn’t go so well and what we learnt. This is a story of how a team of developers learnt (and are still learning) how to use Kafka. We hope that you will be able to take away lessons and learnings of how to build a data processing pipeline with Apache Kafka.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Building Stream Processing Applications with Apache Kafka's Exactly-Once Proc...Matthias J. Sax
This talk was given at the "Big Data Applications" Meetup group (https://www.meetup.com/BigDataApps/).
Abstract:
Kafka 0.11 added a new feature called "exactly-once guarantees". In this talk, we will explain what "exactly-once" means in the context of Kafka and data stream processing and how it effects application development. The talk will go into some details about exactly-once namely the new idempotent producer and transactions and how both can be exploited to simplify application code: for example, you don't need to have complex deduplication code in your input path, as you can rely on Kafka to deduplicate messages when data is produces by an upstream application. Transactions can be used to write multiple messages into different topics and/or partitions and commit all writes in an atomic manner (or abort all writes so none will be read by a downstream consumer in read-committed mode). Thus, transactions allow for applications with strong consistency guarantees, like in the financial sector (e.g., either send a withdrawal and deposit message to transfer money or none of them). Finally, we talk about Kafka's Streams API that makes exactly-once stream processing as simple as it can get.
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent
Whether you are deploying a new application in Microservices or transitioning from a monolithic database application to a cloud-ready architecture, you will inevitably face the decision of either creating a service mesh of API’s – or – using an event bus for better durability, reliability and extensibility of your application. If you choose to go the event bus route, Kafka is an excellent choice for several reasons. One key technology not to overlook is Avro Schemas. They provide a definition for your event payload, just like an API, to ensure all of the event consumers can reliably consume the events. They also handle schema evolution as requirements change and much, much more.
In this talk we will discuss all the nuances and considerations around using Avro Schemas for your JSON event payloads. From developer tools, to DevOps approaches, versioning, governance and some “gotchas” we found when working with Avro Schemas and the Confluent Schema Registry.
Big data event streaming is very common part of any big data Architecture. Of the available open source big data streaming technologies Apache Kafka stands out because of it realtime, distributed, and reliable characteristics. This is possible because of the Kafka Architecture. This talk highlights those features.
Watch this talk here: https://www.confluent.io/online-talks/how-apache-kafka-works-on-demand
Pick up best practices for developing applications that use Apache Kafka, beginning with a high level code overview for a basic producer and consumer. From there we’ll cover strategies for building powerful stream processing applications, including high availability through replication, data retention policies, producer design and producer guarantees.
We’ll delve into the details of delivery guarantees, including exactly-once semantics, partition strategies and consumer group rebalances. The talk will finish with a discussion of compacted topics, troubleshooting strategies and a security overview.
This session is part 3 of 4 in our Fundamentals for Apache Kafka series.
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQLconfluent
Speaker: Robin Moffatt, Developer Advocate, Confluent
In this talk, we'll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API and KSQL. We'll stream data in from MySQL, transform it with KSQL and stream it out to Elasticsearch. Options for integrating databases with Kafka using CDC and Kafka Connect will be covered as well.
This is part 2 of 3 in Streaming ETL - The New Data Integration series.
Watch the recording: https://videos.confluent.io/watch/4cVXUQ2jCLgJNmg4kjCRqo?.
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Kafka Connect is a framework which connects Kafka with external Systems. It helps to move the data in and out of the Kafka. Connect makes it simple to use existing connector configuration for common source and sink Connectors.
Kafka, Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform (Kafka Core + Kafka Connect + Kafka Streams) for building streaming data pipelines and streaming data applications.
This talk, that I gave at the Chicago Java Users Group (CJUG) on June 8th 2017, is mainly focusing on Kafka Streams, a lightweight open source Java library for building stream processing applications on top of Kafka using Kafka topics as input/output.
You will learn more about the following:
1. Apache Kafka: a Streaming Data Platform
2. Overview of Kafka Streams: Before Kafka Streams? What is Kafka Streams? Why Kafka Streams? What are Kafka Streams key concepts? Kafka Streams APIs and code examples?
3. Writing, deploying and running your first Kafka Streams application
4. Code and Demo of an end-to-end Kafka-based Streaming Data Application
5. Where to go from here?
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
Many companies are adopting Apache Kafka to power their data pipelines, including LinkedIn, Netflix, and Airbnb. Kafka’s ability to handle high throughput real-time data makes it a perfect fit for solving the data integration problem, acting as the common buffer for all your data and bridging the gap between streaming and batch systems.
However, building a data pipeline around Kafka today can be challenging because it requires combining a wide variety of tools to collect data from disparate data systems. One tool streams updates from your database to Kafka, another imports logs, and yet another exports to HDFS. As a result, building a data pipeline can take significant engineering effort and has high operational overhead because all these different tools require ongoing monitoring and maintenance. Additionally, some of the tools are simply a poor fit for the job: the fragmented nature of the data integration tools ecosystem lead to creative but misguided solutions such as misusing stream processing frameworks for data integration purposes.
We describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
eventbrite_kafka_summit_event_logo_v3-035858-edited.png
Kinesis to Kafka Bridge is a Samza job that replicates AWS Kinesis to a configurable set of Kafka topics and vice versa. It enables integration between AWS and the rest of LinkedIn. It supports replicating streams in any LinkedIn fabric, any AWS account, and any AWS region. DynamoDB Stream to Kafka Bridge is built on top of Kinesis to Kafka Bridge. It enables data replication from AWS DynamoDB to LinkedIn. In this presentation we will talk about how we designed the system and how we use it in LinkedIn.
What happened when our biggest and most important Kafka cluster went rogue all of a sudden, and while trying to recover it, a single, crucial misconfiguration made things even worse?
At a company like Taboola, where service availability and latency are our top priority, this was a disaster.
With 300K messages/sec and 250TB of messages produced each day to our on-premise Kafka clusters, and mirrored to our central Kafka cluster, we always try to ensure Kafka behaves well under high loads of traffic and unexpected cluster failures. So when our main Kafka cluster went crazy we had a serious issue on our hands.
This session is the story of how we learned the hard way about mitigating cluster failures with the proper configurations in place.
Hagen Toennies from Gaikai Inc. presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"In this talk we will present how we enable distributed, Unix style programming using Docker and Apache Kafka. We will show how we can take the famous Unix Pipe Pattern and apply it to a Distributed Computing System. We will demonstrate the development of two simple applications with the focus on "Do One Thing and Do It Well." Afterwards we demonstrate how we make these two programs work to together using Apache Kafka. By encapsulating our applications in containers we will also show how that enables us to go from the limited resources of a development machine to cluster of computers in a data center without changing our applications or containers."
Watch the video: http://wp.me/p3RLHQ-goG
Learn more: http://www.hpcadvisorycouncil.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka. Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle.
Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice.
In this session we talk about how Apache Kafka helps you to radically simplify your data processing architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. Notably, we introduce Kafka’s Streams API, its abstractions for streams and tables, and its recently introduced Interactive Queries functionality. As we will see, Kafka makes such architectures equally viable for small, medium, and large scale use cases.
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
Akka Streams and its amazing handling of streaming with back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially in use cases where the amount of work continues to increase as you’re processing it. This is where back-pressure really shines.
In this talk for Architects and Dev Managers by Akara Sucharitakul, Principal MTS for Global Platform Frameworks at PayPal, Inc., we look at how back-pressure based on Akka Streams and Kafka is being used at PayPal to handle very bursty workloads.
In addition, Akara will also share experiences in creating a platform based on Akka and Akka Streams that currently processes over 1 billion transactions per day (on just 8 VMs), with the aim of helping teams adopt these technologies. In this webinar, you will:
*Start with a sample web crawler use case to examine what happens when each processing pass expands to a larger and larger workload to process.
*Review how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
*Look at lessons learned, plus some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Reactivesummit
Akka Streams and its amazing handling of stream back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially use cases where the amount of work increases as you process make you really value the back-pressure.
This talk takes a sample web crawler use case where each processing pass expands to a larger and larger workload to process, and discusses how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
In addition, we will also provide some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsLightbend
Organizations like Starbucks, HPE, and PayPal (see our customers) have selected the Akka toolkit for their enterprise scale distributed applications; and when it comes to squeezing out the best possible performance, the secret is using two particular modules in tandem: Akka Cluster and Akka Streams.
In this webinar by Nolan Grace, Senior Solution Architect at Lightbend, we look at these two Akka modules and discuss the features that will push your application architecture to the next tier of performance.
For the full blog post, including the video, visit: https://www.lightbend.com/blog/akka-at-enterprise-scale-performance-tuning-distributed-applications
Revitalizing Enterprise Integration with Reactive StreamsLightbend
With Viktor Klang, Deputy CTO Lightbend, Inc.
As software grows more and more interconnected, and with several generations of software having to interoperate, a new take on the integration of systems is needed—ad hoc, unversioned, and unreplicated scripts just won’t suffice, and the traditional Enterprise Service Bus (ESB) concept has experienced stability, reliability, performance, and scalability problems.
In this webinar, Viktor explores a new take on Enterprise Integration Patterns:
First, he will explore the Reactive Streams standard, an orchestration layer where transformations are standalone, composable, reusable, and—most importantly—using asynchronous flow-control—back pressure—to maintain predictable, stable, behavior over time.
Furthermore, he will go through how one-off workloads relate to continuous, and batch, workloads, and how they can be addressed by that very same orchestration layer.
Finally, he will review how this type of design achieves resilience, scalability, and ultimately—responsiveness.
Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent
William Hill is one of the UK’s largest, most well-established gaming companies with a global presence across 9 countries with over 16,000 employees. In recent years the gaming industry and in particular sports betting, has been revolutionised by technology. Customers now demand a wide range of events and markets to bet on both pre-game and in-play 24/7. This has driven out a business need to process more data, provide more updates and offer more markets and prices in real time.
At William Hill, we have invested in a completely new trading platform using Apache Kafka. We process vast quantities of data from a variety of feeds, this data is fed through a variety of odds compilation models, before being piped out to UI apps for use by our trading teams to provide events, markets and pricing data out to various end points across the whole of William Hill. We deal with thousands of sporting events, each with sometimes hundreds of betting markets, each market receiving hundreds of updates. This scales up to vast numbers of messages flowing through our system. We have to process, transform and route that data in real time. Using Apache Kafka, we have built a high throughput, low latency pipeline, based on Cloud hosted Microservices. When we started, we were on a steep learning curve with Kafka, Microservices and associated technologies. This led to fast learnings and fast failings.
In this session, we will tell the story of what we built, what went well, what didn’t go so well and what we learnt. This is a story of how a team of developers learnt (and are still learning) how to use Kafka. We hope that you will be able to take away lessons and learnings of how to build a data processing pipeline with Apache Kafka.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Building Stream Processing Applications with Apache Kafka's Exactly-Once Proc...Matthias J. Sax
This talk was given at the "Big Data Applications" Meetup group (https://www.meetup.com/BigDataApps/).
Abstract:
Kafka 0.11 added a new feature called "exactly-once guarantees". In this talk, we will explain what "exactly-once" means in the context of Kafka and data stream processing and how it effects application development. The talk will go into some details about exactly-once namely the new idempotent producer and transactions and how both can be exploited to simplify application code: for example, you don't need to have complex deduplication code in your input path, as you can rely on Kafka to deduplicate messages when data is produces by an upstream application. Transactions can be used to write multiple messages into different topics and/or partitions and commit all writes in an atomic manner (or abort all writes so none will be read by a downstream consumer in read-committed mode). Thus, transactions allow for applications with strong consistency guarantees, like in the financial sector (e.g., either send a withdrawal and deposit message to transfer money or none of them). Finally, we talk about Kafka's Streams API that makes exactly-once stream processing as simple as it can get.
Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent
Whether you are deploying a new application in Microservices or transitioning from a monolithic database application to a cloud-ready architecture, you will inevitably face the decision of either creating a service mesh of API’s – or – using an event bus for better durability, reliability and extensibility of your application. If you choose to go the event bus route, Kafka is an excellent choice for several reasons. One key technology not to overlook is Avro Schemas. They provide a definition for your event payload, just like an API, to ensure all of the event consumers can reliably consume the events. They also handle schema evolution as requirements change and much, much more.
In this talk we will discuss all the nuances and considerations around using Avro Schemas for your JSON event payloads. From developer tools, to DevOps approaches, versioning, governance and some “gotchas” we found when working with Avro Schemas and the Confluent Schema Registry.
Big data event streaming is very common part of any big data Architecture. Of the available open source big data streaming technologies Apache Kafka stands out because of it realtime, distributed, and reliable characteristics. This is possible because of the Kafka Architecture. This talk highlights those features.
Watch this talk here: https://www.confluent.io/online-talks/how-apache-kafka-works-on-demand
Pick up best practices for developing applications that use Apache Kafka, beginning with a high level code overview for a basic producer and consumer. From there we’ll cover strategies for building powerful stream processing applications, including high availability through replication, data retention policies, producer design and producer guarantees.
We’ll delve into the details of delivery guarantees, including exactly-once semantics, partition strategies and consumer group rebalances. The talk will finish with a discussion of compacted topics, troubleshooting strategies and a security overview.
This session is part 3 of 4 in our Fundamentals for Apache Kafka series.
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQLconfluent
Speaker: Robin Moffatt, Developer Advocate, Confluent
In this talk, we'll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API and KSQL. We'll stream data in from MySQL, transform it with KSQL and stream it out to Elasticsearch. Options for integrating databases with Kafka using CDC and Kafka Connect will be covered as well.
This is part 2 of 3 in Streaming ETL - The New Data Integration series.
Watch the recording: https://videos.confluent.io/watch/4cVXUQ2jCLgJNmg4kjCRqo?.
ksqlDB: A Stream-Relational Database Systemconfluent
Speaker: Matthias J. Sax, Software Engineer, Confluent
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka®, a distributed event streaming platform. In this talk, we discuss ksqlDB’s architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries while achieving fault tolerance and high vailability. Furthermore, we explore ksqlDB’s streaming SQL dialect and the different types of supported queries.
Matthias J. Sax is a software engineer at Confluent working on ksqlDB. He mainly contributes to Kafka Streams, Apache Kafka's stream processing library, which serves as ksqlDB's execution engine. Furthermore, he helps evolve ksqlDB's "streaming SQL" language. In the past, Matthias also contributed to Apache Flink and Apache Storm and he is an Apache committer and PMC member. Matthias holds a Ph.D. from Humboldt University of Berlin, where he studied distributed data stream processing systems.
https://db.cs.cmu.edu/events/quarantine-db-talk-2020-confluent-ksqldb-a-stream-relational-database-system/
Kafka Connect is a framework which connects Kafka with external Systems. It helps to move the data in and out of the Kafka. Connect makes it simple to use existing connector configuration for common source and sink Connectors.
Kafka, Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform (Kafka Core + Kafka Connect + Kafka Streams) for building streaming data pipelines and streaming data applications.
This talk, that I gave at the Chicago Java Users Group (CJUG) on June 8th 2017, is mainly focusing on Kafka Streams, a lightweight open source Java library for building stream processing applications on top of Kafka using Kafka topics as input/output.
You will learn more about the following:
1. Apache Kafka: a Streaming Data Platform
2. Overview of Kafka Streams: Before Kafka Streams? What is Kafka Streams? Why Kafka Streams? What are Kafka Streams key concepts? Kafka Streams APIs and code examples?
3. Writing, deploying and running your first Kafka Streams application
4. Code and Demo of an end-to-end Kafka-based Streaming Data Application
5. Where to go from here?
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
Many companies are adopting Apache Kafka to power their data pipelines, including LinkedIn, Netflix, and Airbnb. Kafka’s ability to handle high throughput real-time data makes it a perfect fit for solving the data integration problem, acting as the common buffer for all your data and bridging the gap between streaming and batch systems.
However, building a data pipeline around Kafka today can be challenging because it requires combining a wide variety of tools to collect data from disparate data systems. One tool streams updates from your database to Kafka, another imports logs, and yet another exports to HDFS. As a result, building a data pipeline can take significant engineering effort and has high operational overhead because all these different tools require ongoing monitoring and maintenance. Additionally, some of the tools are simply a poor fit for the job: the fragmented nature of the data integration tools ecosystem lead to creative but misguided solutions such as misusing stream processing frameworks for data integration purposes.
We describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
eventbrite_kafka_summit_event_logo_v3-035858-edited.png
Kinesis to Kafka Bridge is a Samza job that replicates AWS Kinesis to a configurable set of Kafka topics and vice versa. It enables integration between AWS and the rest of LinkedIn. It supports replicating streams in any LinkedIn fabric, any AWS account, and any AWS region. DynamoDB Stream to Kafka Bridge is built on top of Kinesis to Kafka Bridge. It enables data replication from AWS DynamoDB to LinkedIn. In this presentation we will talk about how we designed the system and how we use it in LinkedIn.
What happened when our biggest and most important Kafka cluster went rogue all of a sudden, and while trying to recover it, a single, crucial misconfiguration made things even worse?
At a company like Taboola, where service availability and latency are our top priority, this was a disaster.
With 300K messages/sec and 250TB of messages produced each day to our on-premise Kafka clusters, and mirrored to our central Kafka cluster, we always try to ensure Kafka behaves well under high loads of traffic and unexpected cluster failures. So when our main Kafka cluster went crazy we had a serious issue on our hands.
This session is the story of how we learned the hard way about mitigating cluster failures with the proper configurations in place.
Hagen Toennies from Gaikai Inc. presented this deck at the 2017 HPC Advisory Council Stanford Conference.
"In this talk we will present how we enable distributed, Unix style programming using Docker and Apache Kafka. We will show how we can take the famous Unix Pipe Pattern and apply it to a Distributed Computing System. We will demonstrate the development of two simple applications with the focus on "Do One Thing and Do It Well." Afterwards we demonstrate how we make these two programs work to together using Apache Kafka. By encapsulating our applications in containers we will also show how that enables us to go from the limited resources of a development machine to cluster of computers in a data center without changing our applications or containers."
Watch the video: http://wp.me/p3RLHQ-goG
Learn more: http://www.hpcadvisorycouncil.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka. Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle.
Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice.
In this session we talk about how Apache Kafka helps you to radically simplify your data processing architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. Notably, we introduce Kafka’s Streams API, its abstractions for streams and tables, and its recently introduced Interactive Queries functionality. As we will see, Kafka makes such architectures equally viable for small, medium, and large scale use cases.
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
Akka Streams and its amazing handling of streaming with back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially in use cases where the amount of work continues to increase as you’re processing it. This is where back-pressure really shines.
In this talk for Architects and Dev Managers by Akara Sucharitakul, Principal MTS for Global Platform Frameworks at PayPal, Inc., we look at how back-pressure based on Akka Streams and Kafka is being used at PayPal to handle very bursty workloads.
In addition, Akara will also share experiences in creating a platform based on Akka and Akka Streams that currently processes over 1 billion transactions per day (on just 8 VMs), with the aim of helping teams adopt these technologies. In this webinar, you will:
*Start with a sample web crawler use case to examine what happens when each processing pass expands to a larger and larger workload to process.
*Review how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
*Look at lessons learned, plus some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Reactivesummit
Akka Streams and its amazing handling of stream back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially use cases where the amount of work increases as you process make you really value the back-pressure.
This talk takes a sample web crawler use case where each processing pass expands to a larger and larger workload to process, and discusses how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
In addition, we will also provide some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & KafkaAkara Sucharitakul
Akka Streams and its amazing handling of stream back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially use cases where the amount of work increases as you process make you really value the back-pressure.
This talk takes a sample web crawler use case where each processing pass expands to a larger and larger workload to process, and discusses how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
In addition, we will also provide some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
BigDataSpain 2016: Stream Processing Applications with Apache ApexThomas Weise
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache KafkaLightbend
Since its stable release in 2016, Akka Streams is quickly becoming the de facto standard integration layer between various Streaming systems and products. Enterprises like PayPal, Intel, Samsung and Norwegian Cruise Lines see this is a game changer in terms of designing Reactive streaming applications by connecting pipelines of back-pressured asynchronous processing stages.
This comes from the Reactive Streams initiative in part, which has been long led by Lightbend and others, allowing multiple streaming libraries to inter-operate between each other in a performant and resilient fashion, providing back-pressure all the way. But perhaps even more so thanks to the various integration drivers that have sprung up in the community and the Akka team—including drivers for Apache Kafka, Apache Cassandra, Streaming HTTP, Websockets and much more.
In this webinar for JVM Architects, Konrad Malawski explores the what and why of Reactive integrations, with examples featuring technologies like Akka Streams, Apache Kafka, and Alpakka, a new community project for building Streaming connectors that seeks to “back-pressurize” traditional Apache Camel endpoints.
* An overview of Reactive Streams and what it will look like in JDK 9, and the Akka Streams API implementation for Java and Scala.
* Introduction to Alpakka, a modern, Reactive version of Apache Camel, and its growing community of Streams connectors (e.g. Akka Streams Kafka, MQTT, AMQP, Streaming HTTP/TCP/FileIO and more).
* How Akka Streams and Akka HTTP work with Websockets, HTTP and TCP, with examples in both in Java and Scala.
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
One of the biggest challenges in data science is to build a continuous data application which delivers results rapidly and reliably. Spark Streaming offers a powerful solution for real-time data processing. However, the challenge remains in how to connect them with various continuous and real-time data sources, guaranteeing the responsiveness and reliability of data applications.
In this talk, Nan and Arijit will summarize their experiences learned from serving the real-time Spark-based data analytic solutions on Azure HDInsight. Their solution seamlessly integrates Spark and Azure EventHubs which is a hyper-scale telemetry ingestion service enabling users to ingress massive amounts of telemetry into the cloud and read the data from multiple applications using publish-subscribe semantics.
They’ll will cover three topics: bridging the gap of data communication model in Spark and data source, accommodating Spark to rate control and message addressing of data source, and the co-design of fault tolerance Mechanisms. This talk will share the insights on how to build continuous data applications with Spark and boost more availabilities of connectors for Spark and different real-time data sources.
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
Apache Beam (incubating) is a unified batch and streaming data processing programming model that is efficient and portable. Beam evolved from a decade of system-building at Google, and Beam pipelines run today on both open source (Apache Flink, Apache Spark) and proprietary (Google Cloud Dataflow) runners. This talk will focus on I/O and connectors in Apache Beam, specifically its APIs for efficient, parallel, adaptive I/O. Google will discuss how these APIs enable a Beam data processing pipeline runner to dynamically rebalance work at runtime, to work around stragglers, and to automatically scale up and down cluster size as a job’s workload changes. Together these APIs and techniques enable Apache Beam runners to efficiently use computing resources without compromising on performance or correctness. Practical examples and a demonstration of Beam will be included.
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon
In this talk we introduce Apache Beam, a unified model to create efficient and portable data processing pipelines. Beam uses a single set of abstractions to implement both batch and streaming computations that can be executed in different environments, e.g. Apache Spark, Apache Flink and Google Dataflow. Beam not only does data processing, but can be used as a tool to ingest/extract data to/from different data stores including HBase. We will present interaction scenarios between HBase and Beam and explore Beam's Input/Output (IO) model and how we leverage it to provide support for HBase.
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing pipelines, and also data ingestion and integration flows, supporting for both batch and streaming use cases. In presentation I will provide a general overview of Apache Beam and programming model comparison Apache Beam vs Apache Spark.
Similar to Non-blocking IO to tame distributed systems ー How and why ChatWork uses asynchbase (20)
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
2. Agenda
● How we used a native HBase client
● Problems we faced with a native HBase client
● Migration to asynchbase
● Blocking IO vs Non-blocking IO: performance test results
3. About me
● Yusuke Yasuda / 安田裕介
● @TanUkkii007
● Working for Chatwork for 2 years
● Scala developer
6. Messaging system architecture overview
You can find more information about our architecture at Kafka summit 2017.
Today’s topic
7. HBase
● Key-value storage to enable random access on HDFS
● HBase is used as a query-side storage in our system
○ version: 1.2.0
● Provides streaming API called “Scan” to query a sequence of
rows iteratively
● Scan is the most used HBase API in ChatWork
8. Synchronous scan with native HBase client
A bad example
def scanHBase(connection: Connection, tableName: TableName, scan: Scan): Vector[Result] = {
val table: Table = connection.getTable(tableName)
val scanner: ResultScanner = table.getScanner(scan)
@tailrec
def loop(results: Vector[Result]): Vector[Result] = {
val result = scanner.next()
if (result == null)
results
else
loop(results :+ result)
}
try {
loop(Vector.empty)
} finally {
table.close()
scanner.close()
}
}
● a thread is not released
until whole scan is
finished
● throughput is bounded
by the number of threads
in a pool
● long running blocking
calls cause serious
performance problem in
event loop style
application like Akka
HTTP
Cons:
Gist
9. Throughput and Latency trade-off
in asynchronous and synchronous settings
asynchronous : throughput=8, latency=2
synchronous: throughput=4, latency=1
Asynchronous setting is more flexible and fair!
synchronous asynchronous
Optimized for latency throughput
Under high
workload
throughput is
bounded
throughput
increases while
sacrificing
latency
Under low
workload
Requests for
many rows
are executed
exclusively
are evenly
scheduled as
small requests
both have equal latency and
throughput
10. Asynchronous streaming of Scan operation
with Akka Stream
class HBaseScanStage(connection: Connection, tableName: TableName, scan: Scan)
extends GraphStage[SourceShape[Result]] {
val out: Outlet[Result] = Outlet("HBaseScanSource")
override def shape: SourceShape[Result] = SourceShape(out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) {
var table: Table = _
var scanner: ResultScanner = _
override def preStart(): Unit = {
table = connection.getTable(tableName)
scanner = table.getScanner(scan)
}
setHandler(out, new OutHandler {
override def onPull(): Unit = {
val next = scanner.next()
if (next == null)
complete(out)
else
push(out, next)
}
})
override def postStop(): Unit = {
if (scanner != null) scanner.close()
if (table != null) table.close()
super.postStop()
}
}
}
● ResultScanner#next() is passively called
inside callback in a thread safe way
● thread is released immediately after
single ResultScanner#next() call
● Results are pushed to downstream
asynchronously
● when and how many times next()s are
called is determined by downstream
Gist
12. Just a single unresponsive HBase region
server caused whole system degradation
The call queue size of hslave-5 region server spiked.
All Message Read API servers suffered
latency increase and throughput fall.
13. Distributed systems are supposed
to fail partially but why not?
● Native HBase client uses blocking IO
● Requests to unresponsive HBase block a
thread until timeout
● All threads in a thread pool are consumed
so Message Read API servers were not
able to respond
upper limit of pool size
HBase IPC queue size
thread pool status in Read API servers
#active threads
16. asynchbase
Non-blocking HBase client based on Netty
● https://github.com/OpenTSDB/asynchbase
● Netty 3.9
● Supports reverse scan since 1.8
● Asynchronous interface by Deferred
○ https://github.com/OpenTSDB/async
○ Observer pattern that provides callback interfaces
● Thread safety provided by Deferred
○ Event loop executes volatile checks at each step
○ Safe to mutate states inside callbacks
17. Introduce streaming
interface to
asynchbase with Akka
Stream
class HBaseAsyncScanStage(scanner: Scanner)
extends GraphStage[SourceShape[util.ArrayList[KeyValue]]] with HBaseCallbackConversion {
val out: Outlet[util.ArrayList[KeyValue]] = Outlet("HBaseAsyncScanStage")
override def shape: SourceShape[util.ArrayList[KeyValue]] = SourceShape(out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) {
var buffer: List[util.ArrayList[KeyValue]] = List.empty
setHandler(out, new OutHandler {
override def onPull(): Unit = {
if (buffer.isEmpty) {
val deferred = scanner.nextRows()
deferred.addCallbacks(
(results: util.ArrayList[util.ArrayList[KeyValue]]) => callback.invoke(Option(results)),
(e: Throwable) => errorback.invoke(e)
)
} else {
val (element, tailBuffer) = (buffer.head, buffer.tail)
buffer = tailBuffer
push(out, element)
}
}
})
override def postStop(): Unit = {
scanner.close()
super.postStop()
}
private val callback = getAsyncCallback[Option[util.ArrayList[util.ArrayList[KeyValue]]]] {
case Some(results) if !results.isEmpty =>
val element = results.remove(0)
buffer = results.asScala.toList
push(out, element)
case Some(results) if results.isEmpty => complete(out)
case None => complete(out)
}
private val errorback = getAsyncCallback[Throwable] { error => fail(out, error) }
}
}
※ This code contains a serious issue.
You must handle downstream cancellation properly.
Otherwise a Close request may be fired while NextRows
request is still running, which causes HBase protocol violation.
See how to solve this problem on the Gist.
Gist
18. Customizing Scan behavior with
downstream pipelines
HBaseAsyncScanSource(scanner).take(1000)
HBaseAsyncScanSource(scanner)
.throttle(elements=100, per=1 second, maximumBurst=100, ThrottleMode.Shaping)
HBaseAsyncScanSource(scanner).completionTimeout(5 seconds)
HBaseAsyncScanSource(scanner).recoverWithRetries(10, {
case NotServingRegionException => HBaseAsyncScanSource(scanner)
})
● early termination of scan when count of rows limit is reached
● scan iteration rate limiting
● early termination of scan by timeout
● retrying if a region server is not serving
Gist
19. Switching from synchronous API to
asynchronous API
● Switching from synchronous API to asynchronous API usually
requires rewriting whole APIs
● Abstracting database drivers is difficult
● Starting with asynchronous interface like Future[T] is a good
practice
● Another option for abstract interface is streams
● Streams can behave collections like Future, Option, List, Try, but
do not require monad transformer to integrate each other
● Stream interface specification like reactive-streams (JEP266)
gives a way to connect various asynchronous libraries
● Akka Stream is one of the implementations of the reactive-streams
21. Transport Interface Layer
interface: Directive[T], Future[T]
engine: Akka HTTP
Stream Adaptor
interface: Source[Out, M], Flow[In, Out, M], Sink[In, M]
engine: Akka Stream
Database Interface Layer
interface: implementation specific
engine: database driver
● native HBase client
● asynchbase
● HBaseScanStage
● HBaseAsyncScanStage
● ReadMessageDAS
UseCase Layer
interface: Source[Out, M], Flow[In, Out, M], Sink[In, M]
engine: Akka Stream
Domain Layer
interface: Scala collections and case classes
engine: Scala standard library
● Stream abstraction mitigates impact of changes of underlying implementations
● Database access implementation can be switched by Factory functions
● No change was required inside UseCase and Domain layers
Database access abstraction with streams
22. Blocking IO vs Non-blocking IO
performance test results
Fortunately we have not faced HBase issues since asynchbase migration in production.
Following slides show performance test results that was conducted before asynchbase deployment.
23. Blocking IO vs Non-blocking IO
performance test settings
● Single Message Read API server
○ JVM heap size=4GiB
○ CPU request=3.5
○ CPU limit=4
● Using production workload pattern simulated with gatling stress tool
● 1340 request/second
● mainly invokes HBase Scan, but there are Get and batch Get
as well
Both implementations with asynchbase and native HBase client are
tested with the same condition.
24. Blocking IO vs Non-blocking IO
throughput
Message Read API server
with native HBase client
Message Read API server
with asynchbase
throughput: 1000 → 1300
25. Blocking IO vs Non-blocking IO
latency
Message Read API server with
native HBase client
Message Read API server
with asynchbase
※ Note that the scales of y-axis are different.
99pt.: 2000ms → 300ms
95pt.: 1000ms → 200ms
26. Blocking IO vs Non-blocking IO
Thread pool usage
Message Read API server with
native HBase client
Message Read API server
with asynchbase
Note that hbase-dispatcher is an application
thread pool, not Netty IO worker thread pool.
pool size: 600 → 8
active threads: 80 → 2
27. Blocking IO vs Non-blocking IO
JVM heap usage
Message Read API server with
native HBase client
Message Read API server with
asynchbase
heap usage: 2.6GiB → 1.8Gi
28. Blocking IO vs Non-blocking IO
HBase scan metrics
Message Read API server with
native HBase client
Message Read API server with
asynchbase
average of sum of millis sec between nexts average of sum of millis sec between nexts
29. HBase scan metrics may come to
asynchnase
https://github.com/OpenTSDB/asynchbase/pull/184
30. Room for improvement
Timeouts and Rate limiting
● Proper timeouts and rate limiting are necessary for asynchronous and non-blocking
systems
○ Without reins asynchronous system increases its throughput until consumes
all resources
● Timeouts
○ completionTimeout: timout based on total processing time
■ Not ideal for Scan that has broad distribution of processing time
○ idleTimeout: timeout based on processing time between two data
■ Single iteration of Scan has sharp distribution of processing time.
Probably a better strategy.
● Rate limiting
○ Under high workload, the first bottleneck is throughput of storage of HBase
■ How to implement storage-aware rate limiting?
■ Tuning application resources may be necessary
31. Conclusion
● Blocking IO spoils benefits of distributed databases
○ partial failure of database exhausts application threads and makes
the application unresponsive
● Non-blocking IO is resilient to partial failure
● Asynchronous stream is great as a flexible execution model and abstract
interface
● asynchronous stream with Non-blocking IO outperforms blocking one
● Our journey for resilient system continues