The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day.
https://www.bigdataspain.org/2017/talk/apache-samza-jake-maes
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
Spark Streaming has supported Kafka since it’s inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.
https://www.bigdataspain.org/2017/talk/spark-streaming-kafka-0-10-an-integration-story
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
Apache Beam (unified Batch and strEAM processing!) is a new Apache incubator project. Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large.
Come learn about the fundamentals of out-of-order stream processing, and how Beam’s powerful tools for reasoning about time greatly simplify this complex task. Beam provides a model that allows developers to focus on the four important questions that must be answered by any stream processing pipeline:
What results are being calculated?
Where in event time are they calculated?
When in processing time are they materialized?
How do refinements of results relate?
Furthermore, by cleanly separating these questions from runtime characteristics, Beam programs become portable across multiple runtime environments, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark, et al).
Add Horsepower to AI/ML streaming Pipeline - Pulsar Summit NA 2021StreamNative
The more time the data science teams spend on model training, the less business value is added because no value is created until that model is deployed in production. Traditional HDD-based systems are not suitable for training, which is very IO intensive due to complex transformations that are involved during data preparation. Moreover, Training is not a one-time process. Trends and patterns in the data keep changing rapidly, hence models need to be retrained to address drift issues to continually improve performance in production. Data scientists often experiment with thousands of models, and speeding up the process has significant business implications.
In this talk, we will cover how you can accelerate an AI/ML pipeline by speeding up data loads using the Aerospike database which leverages its hybrid memory architecture to achieve sub-millisecond read/writes. In a hybrid memory architecture the index is stored in-memory (not persisted), and data is stored on persistent storage (SSD) and read directly from the disk. Disk I/O is not required to access the index. For time-sensitive and high throughput use cases such as fraud detection, you need a transactional database at the edge that can handle high-velocity ingestion and support millions of IOPS. The events are then streamed downstream to your AL/ML platform for training or your inference server for predictions. We will share the reference architecture of a highly performant AI/ML training and inference pipelines consisting of Apache Pulsar, Apache Spark 3.0, Aerospike database, and its Spark and Pulsar connectors. This architecture can be extended to other use cases that demand low latency and high throughput while not blowing your budget.
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
Akka Streams and its amazing handling of streaming with back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially in use cases where the amount of work continues to increase as you’re processing it. This is where back-pressure really shines.
In this talk for Architects and Dev Managers by Akara Sucharitakul, Principal MTS for Global Platform Frameworks at PayPal, Inc., we look at how back-pressure based on Akka Streams and Kafka is being used at PayPal to handle very bursty workloads.
In addition, Akara will also share experiences in creating a platform based on Akka and Akka Streams that currently processes over 1 billion transactions per day (on just 8 VMs), with the aim of helping teams adopt these technologies. In this webinar, you will:
*Start with a sample web crawler use case to examine what happens when each processing pass expands to a larger and larger workload to process.
*Review how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
*Look at lessons learned, plus some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...HostedbyConfluent
"Just as the Apache Kafka Brokers provide JMX metrics to monitor your cluster's health, Kafka Streams provides a rich set of metrics for monitoring your application's health and performance. The metrics to observe for a given use-case of Kafka Streams will vary significantly from application to application. Learning how to build and customize monitoring of those applications will help you maintain a healthy Kafka Streams ecosystem.
Takeaways
* An analysis and overview of the provided metrics, including the new end-to-end metrics of Kafka Streams 2.7.
* See how to extract metrics from your application using existing JMX tooling.
* Walkthrough how to build a dashboard for observing those metrics.
* Explore options of how to add additional JMX resources and Kafka Stream metrics to your application.
* How to verify you built your dashboard correctly by creating a data control set to validate your dashboard.
* Go beyond what you can collect from the Kafka Stream metrics."
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
Spark Streaming has supported Kafka since it’s inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.
https://www.bigdataspain.org/2017/talk/spark-streaming-kafka-0-10-an-integration-story
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
Apache Beam (unified Batch and strEAM processing!) is a new Apache incubator project. Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large.
Come learn about the fundamentals of out-of-order stream processing, and how Beam’s powerful tools for reasoning about time greatly simplify this complex task. Beam provides a model that allows developers to focus on the four important questions that must be answered by any stream processing pipeline:
What results are being calculated?
Where in event time are they calculated?
When in processing time are they materialized?
How do refinements of results relate?
Furthermore, by cleanly separating these questions from runtime characteristics, Beam programs become portable across multiple runtime environments, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark, et al).
Add Horsepower to AI/ML streaming Pipeline - Pulsar Summit NA 2021StreamNative
The more time the data science teams spend on model training, the less business value is added because no value is created until that model is deployed in production. Traditional HDD-based systems are not suitable for training, which is very IO intensive due to complex transformations that are involved during data preparation. Moreover, Training is not a one-time process. Trends and patterns in the data keep changing rapidly, hence models need to be retrained to address drift issues to continually improve performance in production. Data scientists often experiment with thousands of models, and speeding up the process has significant business implications.
In this talk, we will cover how you can accelerate an AI/ML pipeline by speeding up data loads using the Aerospike database which leverages its hybrid memory architecture to achieve sub-millisecond read/writes. In a hybrid memory architecture the index is stored in-memory (not persisted), and data is stored on persistent storage (SSD) and read directly from the disk. Disk I/O is not required to access the index. For time-sensitive and high throughput use cases such as fraud detection, you need a transactional database at the edge that can handle high-velocity ingestion and support millions of IOPS. The events are then streamed downstream to your AL/ML platform for training or your inference server for predictions. We will share the reference architecture of a highly performant AI/ML training and inference pipelines consisting of Apache Pulsar, Apache Spark 3.0, Aerospike database, and its Spark and Pulsar connectors. This architecture can be extended to other use cases that demand low latency and high throughput while not blowing your budget.
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
Akka Streams and its amazing handling of streaming with back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially in use cases where the amount of work continues to increase as you’re processing it. This is where back-pressure really shines.
In this talk for Architects and Dev Managers by Akara Sucharitakul, Principal MTS for Global Platform Frameworks at PayPal, Inc., we look at how back-pressure based on Akka Streams and Kafka is being used at PayPal to handle very bursty workloads.
In addition, Akara will also share experiences in creating a platform based on Akka and Akka Streams that currently processes over 1 billion transactions per day (on just 8 VMs), with the aim of helping teams adopt these technologies. In this webinar, you will:
*Start with a sample web crawler use case to examine what happens when each processing pass expands to a larger and larger workload to process.
*Review how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
*Look at lessons learned, plus some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...HostedbyConfluent
"Just as the Apache Kafka Brokers provide JMX metrics to monitor your cluster's health, Kafka Streams provides a rich set of metrics for monitoring your application's health and performance. The metrics to observe for a given use-case of Kafka Streams will vary significantly from application to application. Learning how to build and customize monitoring of those applications will help you maintain a healthy Kafka Streams ecosystem.
Takeaways
* An analysis and overview of the provided metrics, including the new end-to-end metrics of Kafka Streams 2.7.
* See how to extract metrics from your application using existing JMX tooling.
* Walkthrough how to build a dashboard for observing those metrics.
* Explore options of how to add additional JMX resources and Kafka Stream metrics to your application.
* How to verify you built your dashboard correctly by creating a data control set to validate your dashboard.
* Go beyond what you can collect from the Kafka Stream metrics."
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
http://flink-forward.org/kb_sessions/no-shard-left-behind-dynamic-work-rebalancing-in-apache-beam/
The Apache Beam (incubating) programming model is designed to support several advanced data processing features such as autoscaling and dynamic work rebalancing. In this talk, we will first explain how dynamic work rebalancing not only provides a general and robust solution to the problem of stragglers in traditional data processing pipelines, but also how it allows autoscaling to be truly effective. We will then present how dynamic work rebalancing works as implemented in Google Cloud Dataflow and which path other Apache Beam runners link Apache Flink can follow to benefit from it.
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulHostedbyConfluent
Apache Kafka is the most popular open-source stream-processing software for collecting, processing, storing, and analyzing data at scale. Most known for its excellent performance, low latency, fault tolerance, and high throughput, it's capable of handling thousands of messages per second. For mission-critical applications, how do you ensure that the performance delivered is the performance required? This is especially important as Kafka is written in Java and Scala and runs on the JVM. The JVM is a fantastic platform that delivers on an internet scale. In this session, we'll explore how making changes to the JVM design can eliminate the problems of garbage collection pauses and raise the throughput of applications. For cloud-based Kafka applications, this can deliver both lower latency and reduced infrastructure costs. All without changing a line of code!
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...DataWorks Summit
More than 300,000 globally connected Scania vehicles, trucks and buses, are continuously submitting their GPS positions. In this presentation we will show how Scania analyses these positions to obtain valuable information about the operation of the vehicles. The algorithms have been developed in a research project, FUMA, that is run as a joint venture between Fraunhofer Chalmers Centre and Scania.
In the project we build a continuous delivery pipeline that enables us to iteratively improve our code, our algorithms and the data deliverables. The pipeline runs on a Hortonworks platform using Apache Spark Streaming. In the build pipeline we use Jenkins, Nexus and Ansible to test, deploy and run Apache Spark Streaming jobs and the results are pushed to Apache Kafka. We will highlight and present some of the steps we have taken in order to put a streaming big data application in production at a manufacturing company. We think that people with a general awareness of the challenges with big data, the possibilities of the streaming paradigm and the need for continuous delivery will find this talk very intriguing. In this presentation you will learn how we develop and run the code and how we ensure that we are creating value for Scania.
(Nina Hanzlikova, Zalando) Kafka Summit SF 2018
My team at Zalando fell in love with KStreams and their programming model straight out of the gate. However, as a small team of developers, building out and supporting our infrastructure while still trying to deliver solutions for our business has not always resulted in a smooth journey.
Can a small team of a couple of developers run their own Kafka infrastructure confidently and still spend most of their time developing code?
In this talk, we will dive into some of the problems we experienced while running Kafka brokers and Kafka Streams applications, as well as the consultations we had with other teams around this matter. We will outline some of the pragmatic decisions we made regarding backups, monitoring and operations to minimize our time spent administering our Kafka brokers and various stream applications.
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...confluent
Getting Kafka running on Kubernetes is only step one of a journey to create a production-ready Kafka cluster. This talk walks through the other steps: 1) Monitoring and remediating faults. 2) Updates to Kubernetes nodes for clusters not using shared storage. 3) Automating Kafka updates and restarts. We present how to create fault-tolerant Kafka clusters on Kubernetes without sacrificing availability, durability, or latency. Learn about Lyft's overlay-free Kubernetes networking driver and how we use it to keep performance on par with non-Kubernetes clusters.
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.ioHostedbyConfluent
We all love to play with the shiny toys, but an event stream with no events is a sorry sight. In this session you’ll see how to create your own streaming dataset for Apache Kafka using Python and the Faker library. You’ll learn how to create a random data producer and define the structure and rate of its message delivery. Randomly-generated data is often hilarious in its own right, and it adds just the right amount of fun to any Kafka and its integrations!
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
Cooperative Task Execution for Apache SparkDatabricks
Apache Spark has enabled a vast assortment of users to express batch, streaming, and machine learning computations, using a mixture of programming paradigms and interfaces. Lately, we observe that different jobs are often implemented as part of the same application to share application logic, state, or to interact with each other. Examples include online machine learning, real-time data transformation and serving, low-latency event monitoring and reporting. Although the recent addition of Structured Streaming to Spark provides the programming interface to enable such unified applications over bounded and unbounded data, the underlying execution engine was not designed to efficiently support jobs with different requirements (i.e., latency vs. throughput) as part of the same runtime. It therefore becomes particularly challenging to schedule such jobs to efficiently utilize the cluster resources while respecting their requirements in terms of task response times. Scheduling policies such as FAIR could alleviate the problem by prioritizing critical tasks, but the challenge remains, as there is no way to guarantee no queuing delays. Even though preemption by task killing could minimize queuing, it would also require task resubmission and loss of progress, leading to wasted cluster resources. In this talk, we present Neptune, a new cooperative task execution model for Spark with fine-grained control over resources such as CPU time. Neptune utilizes Scala coroutines as a lightweight mechanism to suspend task execution with sub-millisecond latency and introduces new scheduling policies that respect diverse task requirements while efficiently sharing the same runtime. Users can directly use Neptune for their continuous applications as it supports all existing DataFrame, DataSet, and RDD operators. We present an implementation of the execution model as part of Spark 2.4.0 and describe the observed performance benefits from running a number of streaming and machine learning workloads on an Azure cluster.
Speaker: Konstantinos Karanasos
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsLightbend
Organizations like Starbucks, HPE, and PayPal (see our customers) have selected the Akka toolkit for their enterprise scale distributed applications; and when it comes to squeezing out the best possible performance, the secret is using two particular modules in tandem: Akka Cluster and Akka Streams.
In this webinar by Nolan Grace, Senior Solution Architect at Lightbend, we look at these two Akka modules and discuss the features that will push your application architecture to the next tier of performance.
For the full blog post, including the video, visit: https://www.lightbend.com/blog/akka-at-enterprise-scale-performance-tuning-distributed-applications
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...StreamNative
In this session, we provide an overview of the “Lakehouse” architecture and how Apache Pulsar™ can be used to support this architecture through integrations with the Apache Spark™ and Delta Lake to build your reliable data lake. We will also discuss the current state of Pulsar + Spark & Delta Lake connectors and discuss real world use cases and present the roadmap on what you can expect in the future of integrations between Spark, Delta Lake, and Pulsar communities.
Change Data Capture (CDC) and the Kafka Scylla Connector now allow Scylla to act as a data producer for Kafka streams. Discover how combining Scylla with the Confluent platform allows to maximize the value of NoSQL data by introducing Scylla as a key component of event driven architecture and enables streaming database changes, enriching these updates via message transformations, and allowing users to efficiently run data pipelines.
Kafka summit SF 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed realtime database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS). Building upon this, I explain how to build common business functionality by stepping through patterns for Scalable payment processing Run it on rails: Instrumentation and monitoring Control flow patterns (start, stop, pause) Finally, all of these concepts are combined in a solution architecture that can be used at enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...confluent
Watch this talk here: https://www.confluent.io/online-talks/scaling-security-on-100s-of-millions-of-mobile-devices-using-kafka-and-scylla-on-demand
Join mobile cybersecurity leader Lookout as they talk through their data ingestion journey.
Lookout enables enterprises to protect their data by evaluating threats and risks at post-perimeter endpoint devices and providing access to corporate data after conditional security scans. Their continuous assessment of device health creates a massive amount of telemetry data, forcing new approaches to data ingestion. Learn how Lookout changed its approach in order to grow from 1.5 million devices to 100 million devices and beyond, by implementing Confluent Platform and switching to Scylla.
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...confluent
Cloud providers like AWS allow free data transfers within an Availability Zone (AZ), but bill users when data moves between AZs. When the data volume streamed through Kafka reaches big data scale, (e.g. numeric data points or user activity tracking), the costs incurred by cross-AZ traffic can add significantly to your monthly cloud spend. Since Kafka serves reads and writes only from leader partitions, for a topic with a replication factor of 3, a message sent through Kafka can cross AZs up to 4 times. Once when a producer produces a message onto broker in a different AZ, two times during Kafka replication, and once more during message consumption. With careful design, we can eliminate the first and last part of the cross AZ traffic. We can also use message compression strategies provided by Kafka to reduce costs during replication. In this talk, we will discuss the architectural choices that allow us to ensure a Kafka message is produced and consumed within a single AZ, as well as an algorithm that lets consumers intelligently subscribe to partitions with leaders in the same AZ. We will also cover use cases in which cross-AZ message streaming is unavoidable due to design limitations. Talk outline: 1) A review of Kafka replication, 2) Cross-AZ traffic implications, 3) Architectural choices for AZ-aware message streaming, 4) Algorithms for AZ-aware producers and consumers, 5) Results, 6) Limitations, 7) Takeaways.
Unified Stream Processing at Scale with Apache Samza - BDS2017Jacob Maes
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day. Many of these are new applications, but there have also been more migrations from existing online and offline applications. To support the influx of new use cases, we have improved the flexibility, efficiency and reliability of Apache Samza.
In this talk, we will take a brief look at the broader streaming ecosystem at LinkedIn, then we will zoom in on a few representative use cases and explain how they are powered by recent advancements to Apache Samza including a unified high level API, flexible deployment model, batch processing, and more.
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
http://flink-forward.org/kb_sessions/no-shard-left-behind-dynamic-work-rebalancing-in-apache-beam/
The Apache Beam (incubating) programming model is designed to support several advanced data processing features such as autoscaling and dynamic work rebalancing. In this talk, we will first explain how dynamic work rebalancing not only provides a general and robust solution to the problem of stragglers in traditional data processing pipelines, but also how it allows autoscaling to be truly effective. We will then present how dynamic work rebalancing works as implemented in Google Cloud Dataflow and which path other Apache Beam runners link Apache Flink can follow to benefit from it.
Better Kafka Performance Without Changing Any Code | Simon Ritter, AzulHostedbyConfluent
Apache Kafka is the most popular open-source stream-processing software for collecting, processing, storing, and analyzing data at scale. Most known for its excellent performance, low latency, fault tolerance, and high throughput, it's capable of handling thousands of messages per second. For mission-critical applications, how do you ensure that the performance delivered is the performance required? This is especially important as Kafka is written in Java and Scala and runs on the JVM. The JVM is a fantastic platform that delivers on an internet scale. In this session, we'll explore how making changes to the JVM design can eliminate the problems of garbage collection pauses and raise the throughput of applications. For cloud-based Kafka applications, this can deliver both lower latency and reduced infrastructure costs. All without changing a line of code!
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...DataWorks Summit
More than 300,000 globally connected Scania vehicles, trucks and buses, are continuously submitting their GPS positions. In this presentation we will show how Scania analyses these positions to obtain valuable information about the operation of the vehicles. The algorithms have been developed in a research project, FUMA, that is run as a joint venture between Fraunhofer Chalmers Centre and Scania.
In the project we build a continuous delivery pipeline that enables us to iteratively improve our code, our algorithms and the data deliverables. The pipeline runs on a Hortonworks platform using Apache Spark Streaming. In the build pipeline we use Jenkins, Nexus and Ansible to test, deploy and run Apache Spark Streaming jobs and the results are pushed to Apache Kafka. We will highlight and present some of the steps we have taken in order to put a streaming big data application in production at a manufacturing company. We think that people with a general awareness of the challenges with big data, the possibilities of the streaming paradigm and the need for continuous delivery will find this talk very intriguing. In this presentation you will learn how we develop and run the code and how we ensure that we are creating value for Scania.
(Nina Hanzlikova, Zalando) Kafka Summit SF 2018
My team at Zalando fell in love with KStreams and their programming model straight out of the gate. However, as a small team of developers, building out and supporting our infrastructure while still trying to deliver solutions for our business has not always resulted in a smooth journey.
Can a small team of a couple of developers run their own Kafka infrastructure confidently and still spend most of their time developing code?
In this talk, we will dive into some of the problems we experienced while running Kafka brokers and Kafka Streams applications, as well as the consultations we had with other teams around this matter. We will outline some of the pragmatic decisions we made regarding backups, monitoring and operations to minimize our time spent administering our Kafka brokers and various stream applications.
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...confluent
Getting Kafka running on Kubernetes is only step one of a journey to create a production-ready Kafka cluster. This talk walks through the other steps: 1) Monitoring and remediating faults. 2) Updates to Kubernetes nodes for clusters not using shared storage. 3) Automating Kafka updates and restarts. We present how to create fault-tolerant Kafka clusters on Kubernetes without sacrificing availability, durability, or latency. Learn about Lyft's overlay-free Kubernetes networking driver and how we use it to keep performance on par with non-Kubernetes clusters.
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.ioHostedbyConfluent
We all love to play with the shiny toys, but an event stream with no events is a sorry sight. In this session you’ll see how to create your own streaming dataset for Apache Kafka using Python and the Faker library. You’ll learn how to create a random data producer and define the structure and rate of its message delivery. Randomly-generated data is often hilarious in its own right, and it adds just the right amount of fun to any Kafka and its integrations!
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
Cooperative Task Execution for Apache SparkDatabricks
Apache Spark has enabled a vast assortment of users to express batch, streaming, and machine learning computations, using a mixture of programming paradigms and interfaces. Lately, we observe that different jobs are often implemented as part of the same application to share application logic, state, or to interact with each other. Examples include online machine learning, real-time data transformation and serving, low-latency event monitoring and reporting. Although the recent addition of Structured Streaming to Spark provides the programming interface to enable such unified applications over bounded and unbounded data, the underlying execution engine was not designed to efficiently support jobs with different requirements (i.e., latency vs. throughput) as part of the same runtime. It therefore becomes particularly challenging to schedule such jobs to efficiently utilize the cluster resources while respecting their requirements in terms of task response times. Scheduling policies such as FAIR could alleviate the problem by prioritizing critical tasks, but the challenge remains, as there is no way to guarantee no queuing delays. Even though preemption by task killing could minimize queuing, it would also require task resubmission and loss of progress, leading to wasted cluster resources. In this talk, we present Neptune, a new cooperative task execution model for Spark with fine-grained control over resources such as CPU time. Neptune utilizes Scala coroutines as a lightweight mechanism to suspend task execution with sub-millisecond latency and introduces new scheduling policies that respect diverse task requirements while efficiently sharing the same runtime. Users can directly use Neptune for their continuous applications as it supports all existing DataFrame, DataSet, and RDD operators. We present an implementation of the execution model as part of Spark 2.4.0 and describe the observed performance benefits from running a number of streaming and machine learning workloads on an Azure cluster.
Speaker: Konstantinos Karanasos
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsLightbend
Organizations like Starbucks, HPE, and PayPal (see our customers) have selected the Akka toolkit for their enterprise scale distributed applications; and when it comes to squeezing out the best possible performance, the secret is using two particular modules in tandem: Akka Cluster and Akka Streams.
In this webinar by Nolan Grace, Senior Solution Architect at Lightbend, we look at these two Akka modules and discuss the features that will push your application architecture to the next tier of performance.
For the full blog post, including the video, visit: https://www.lightbend.com/blog/akka-at-enterprise-scale-performance-tuning-distributed-applications
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...StreamNative
In this session, we provide an overview of the “Lakehouse” architecture and how Apache Pulsar™ can be used to support this architecture through integrations with the Apache Spark™ and Delta Lake to build your reliable data lake. We will also discuss the current state of Pulsar + Spark & Delta Lake connectors and discuss real world use cases and present the roadmap on what you can expect in the future of integrations between Spark, Delta Lake, and Pulsar communities.
Change Data Capture (CDC) and the Kafka Scylla Connector now allow Scylla to act as a data producer for Kafka streams. Discover how combining Scylla with the Confluent platform allows to maximize the value of NoSQL data by introducing Scylla as a key component of event driven architecture and enables streaming database changes, enriching these updates via message transformations, and allowing users to efficiently run data pipelines.
Kafka summit SF 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed realtime database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS). Building upon this, I explain how to build common business functionality by stepping through patterns for Scalable payment processing Run it on rails: Instrumentation and monitoring Control flow patterns (start, stop, pause) Finally, all of these concepts are combined in a solution architecture that can be used at enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...confluent
Watch this talk here: https://www.confluent.io/online-talks/scaling-security-on-100s-of-millions-of-mobile-devices-using-kafka-and-scylla-on-demand
Join mobile cybersecurity leader Lookout as they talk through their data ingestion journey.
Lookout enables enterprises to protect their data by evaluating threats and risks at post-perimeter endpoint devices and providing access to corporate data after conditional security scans. Their continuous assessment of device health creates a massive amount of telemetry data, forcing new approaches to data ingestion. Learn how Lookout changed its approach in order to grow from 1.5 million devices to 100 million devices and beyond, by implementing Confluent Platform and switching to Scylla.
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...confluent
Cloud providers like AWS allow free data transfers within an Availability Zone (AZ), but bill users when data moves between AZs. When the data volume streamed through Kafka reaches big data scale, (e.g. numeric data points or user activity tracking), the costs incurred by cross-AZ traffic can add significantly to your monthly cloud spend. Since Kafka serves reads and writes only from leader partitions, for a topic with a replication factor of 3, a message sent through Kafka can cross AZs up to 4 times. Once when a producer produces a message onto broker in a different AZ, two times during Kafka replication, and once more during message consumption. With careful design, we can eliminate the first and last part of the cross AZ traffic. We can also use message compression strategies provided by Kafka to reduce costs during replication. In this talk, we will discuss the architectural choices that allow us to ensure a Kafka message is produced and consumed within a single AZ, as well as an algorithm that lets consumers intelligently subscribe to partitions with leaders in the same AZ. We will also cover use cases in which cross-AZ message streaming is unavoidable due to design limitations. Talk outline: 1) A review of Kafka replication, 2) Cross-AZ traffic implications, 3) Architectural choices for AZ-aware message streaming, 4) Algorithms for AZ-aware producers and consumers, 5) Results, 6) Limitations, 7) Takeaways.
Unified Stream Processing at Scale with Apache Samza - BDS2017Jacob Maes
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day. Many of these are new applications, but there have also been more migrations from existing online and offline applications. To support the influx of new use cases, we have improved the flexibility, efficiency and reliability of Apache Samza.
In this talk, we will take a brief look at the broader streaming ecosystem at LinkedIn, then we will zoom in on a few representative use cases and explain how they are powered by recent advancements to Apache Samza including a unified high level API, flexible deployment model, batch processing, and more.
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021StreamNative
Today we will span from edge to any and all clouds to support data collection, real-time streaming, sensor ingest, edge computing, IoT use cases and edge AI. Apache Pulsar allows us to build computing at the edge and produce and consume messages at scale in any IoT, hybrid or cloud environment. Apache Pulsar supports MoP which allows for MQTT protocol to be used for high speed messaging.
We will teach you to quickly build scalable open source streaming applications regardless of if you are running in containers, pods, edge devices, VMs, on-premise servers, moving vehicles and any cloud.
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
Rakesh Kumar and Thomas Weise explore how Lyft dynamically prices its rides with a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability, and scalability—allowing the pricing system to be more adaptable to real-world changes.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
This talk was presented at the Apache Big Data 2016, North America conference that was held in Vancouver, CA (http://events.linuxfoundation.org/events/archive/2016/apache-big-data-north-america/program/schedule)
Sensor data is streamed in realtime from Arduino + accelerometeres, gyroscopes & compass 3D, ultrasound distance sensor, etc. using UDP protocol. The data processing is done with reactive Java alterantive implementations: callbacks, CompletableFutures and using Spring 5 Reactor library. The web 3D visualization with Three.js is streamed using Server Sent Events (SSE).
A video for the IoT demo is available @YouTube: https://www.youtube.com/watch?v=AB3AWAfcy9U
All source code of the demo is freely available @GitHub: https://github.com/iproduct/reactive-demos-iot
There are more reactive Java demos in the same repository - callbacks, CompletableFuture, realtime event streaming. Soon I'll add a description how to build the device and upload Arduino sketch, as well as describe CompletableFuture and Reactor demos and 3D web visualization part with Three.js. Please stay tuned :)
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
An in-depth look at Apache Flink’s Streaming Dataflow Engine. Flink executes data streaming programs directly as streams with low latency and flexible user-defined state and models batch programs as streaming programs on finite data streams.
The slides cover the general design of the runtime and show how the engine is able to support diverse features and workloads without compromising on performance or usability.
Flink Forward, Berlin
October 13, 2015
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...Flink Forward
Python is popular amongst data scientists and engineers for data processing tasks. The big data ecosystem has traditionally been rather JVM centric. Often Java (or Scala) are the only viable option to implement data processing pipelines. That sometimes poses an adoption barrier for organizations that have already invested in other language ecosystems. The Apache Beam project provides a unified programming model for data processing and its ongoing portability effort aims to enable multiple language SDKs (currently Java, Python and Go) on a common set of runners. The combination of Python streaming on the Apache Flink runner is one example. Let’s take a look how the Flink runner translates the Beam model into the native DataStream (or DataSet) API, how the runner is changing to support portable pipelines, how Python user code execution is coordinated with gRPC based services and how a sample pipeline runs on Flink.
Python is popular amongst data scientists and engineers for data processing tasks. The big data ecosystem has traditionally been rather JVM centric. Often Java (or Scala) are the only viable option to implement data processing pipelines. That sometimes poses an adoption barrier for organizations that have already invested in other language ecosystems. The Apache Beam project provides a unified programming model for data processing and its ongoing portability effort aims to enable multiple language SDKs (currently Java, Python and Go) on a common set of runners. The combination of Python streaming on the Apache Flink runner is one example. Let’s take a look how the Flink runner translates the Beam model into the native DataStream (or DataSet) API, how the runner is changing to support portable pipelines, how Python user code execution is coordinated with gRPC based services and how a sample pipeline runs on Flink.
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
Insights can only be as good as the data. The data quality domain is enormously large, so you need to understand your company pain points to know what to focus on first.
https://www.bigdataspain.org/2017/talk/big-data-big-quality
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
2gether is a financial platform based on Blockchain, Big Data and Artificial Intelligence that allows interaction between users and third-party services in a single interface.
https://www.bigdataspain.org/2017/talk/scaling-a-backend-for-a-big-data-and-blockchain-environment
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability.
https://www.bigdataspain.org/2017/talk/disaster-recovery-for-big-data
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
In this presentation, attendees will see how to speed up existing Hadoop and Spark deployments by just making Apache Ignite responsible for RAM utilization. No code modifications, no new architecture from scratch!
https://www.bigdataspain.org/2017/talk/boost-hadoop-and-spark-with-in-memory-technologies
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
The power of this new set of tools for Data Science. Is really easy to start applying these technics in your current workflow.
https://www.bigdataspain.org/2017/talk/data-science-for-lazy-people-automated-machine-learning
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
GPUs on the cloud as Infrastructure as a Service (IaaS) seem a commodity. However to efficiently distribute deep learning tasks on several GPUs is challenging.
https://www.bigdataspain.org/2017/talk/training-deep-learning-models-on-multiple-gpus-in-the-cloud
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
Unbalanced data is a specific data configuration that appears commonly in nature. Applying machine learning techniques to this kind of data is a difficult process, usually addressed by unbalanced reduction techniques.
https://www.bigdataspain.org/2017/talk/unbalanced-data-same-algorithms-different-techniques
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
Time series related problems have traditionally been solved using engineered features obtained by heuristic processes.
https://www.bigdataspain.org/2017/talk/state-of-the-art-time-series-analysis-with-deep-learning
Big Data Spain 2017
November 16th - 17th
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
Not long ago only banks and hedge funds could afford doing automated and High Frequency Trading, that is, the ability to send buy commodities in microseconds intervals.
https://www.bigdataspain.org/2017/talk/trading-at-market-speed-with-the-latest-kafka-features
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale.
https://www.bigdataspain.org/2017/talk/the-analytic-platform-behind-ibms-watson-data-platform
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
Artificial Intelligence and Data-centric businesses.
https://www.bigdataspain.org/2017/talk/tbc
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.
https://www.bigdataspain.org/2017/talk/why-big-data-didnt-end-causal-inference
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
The Meme of the Internet Index will be the new normal to analyze and predict facts and sensations which go around the Internet.
https://www.bigdataspain.org/2017/talk/meme-index-analyzing-fads-and-sensations-on-the-internet
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
Geotab is a leader in the expanding world of Internet of Things (IoT) and telematics industry with Big Data.
https://www.bigdataspain.org/2017/talk/vehicle-big-data-that-drives-smart-city-advancement
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
The talk will focus on explaining why operational databases do not scale due to limitations in legacy transactional management.
https://www.bigdataspain.org/2017/talk/end-of-the-myth-ultra-scalable-transactional-management
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
Primary function of banking sector is promoting economic activity; which means “commerce”, exchanging what someone produces-has for something that someone consumes-desires.
https://www.bigdataspain.org/2017/talk/more-people-less-banking-blockchain
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.
https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Big Data Spain
In an era of growing data complexity and volume and the advent of Big Data, feature selection has a key role to play in helping reduce high-dimensionality in machine learning problems.
https://www.bigdataspain.org/2017/talk/feature-selection-for-big-data-advances-and-challenges
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017
1.
2. 1
Unified Processing at Scale with Apache
Samza
Jake Maes
Staff SW Engineer at LinkedIn
Apache Samza PMC
3. 2
About Me
● Apache Samza PMC member
● LinkedIn 3 years
● 8 years performance & infra development
● Passionate about scale
● Long walks on the peaks
4. 3
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
5. 4
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
6. 5
About
● Production at LinkedIn since 2014
● Apache top level project since 2014
● 16 Committers
● 74 Contributors
● Known for
Scale
Pluggability
Kafka integration
7. 6
● Low latency
● One message at a time
● Checkpointing, state, durability
● All I/O with high-performance message brokers
Traditional Stream Processing
10. 9
Typical Flow - Two Stages Minimum
Re-
partitio
n
windo
w
ma
p
sendT
o
PageVie
w
Event
PageViewEven
t
ByMemberId
PageViewEventP
er
MemberStream
PageViewRepartitionTask PageViewByMemberIdCounterTask
11. 10
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
12. 11
Stream Processing Ecosystem – The Dream
Applications and Services
Samz
a
Kafka
Storag
e
Externa
l
Stream
s
Storage
&
Serving
Brooklin
13. 12
Stream Processing Ecosystem - Reality
Applications and Services
Samz
a
Kafka
Storag
e
Externa
l
Stream
s
Storage
&
Serving
Brooklin
14. 13
Expansion of Stream Processing at LinkedIn
● Influx of new applications
10 -> over 200
● New use cases
Batch Streaming
Remote I/O
Composable API
● Incoming applications have different expectations
● Let’s take a look at two
Services
15. 14
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
16. 15
Online Service + Stream Processing
Requirements:
● Deployment model
Cluster environment not suitable
● Remote I/O
Dependencies on other services
I/O latency stalls single threaded processor
Container parallelism - too much overhead
Services
17. 16
App Instance
Embedded Samza
● Zookeeper-based JobCoordinator
Uses Zookeeper for leader election
Leader assigns work to the processors
ZooKeeperZooKeeper
Stream Processor
Samza
Container
Job
Coordinato
r*
App Instance
Stream Processor
Samza
Container
Job
Coordinato
r
App Instance
Stream Processor
Samza
Container
Job
Coordinato
r
* Leader
19. 18
Checkpointing
● Sync – Barrier
● Async - Watermark
t1 t2 t3 tc
t4
checkpoint
callback
3
complet
e
time
callback
1
complet
e
callback
2compl
ete
callback
4
complet
e
20. 19
Performance for Remote I/O
Baseline
Thread pool size =
10
Max concurrency =
1
Thread pool size =
10
Max concurrency =
3
Sync I/O with MultithreadingSingle thread
21. 20
Case Study – Notification Scheduler
Processor
User Chat
Event
User
Action
Event
Connectio
n Activity
Event
Restful
Service
s
Member
profile
database
Aggregatio
n Engine
Channel
Selection
State
store
input1
input2
input3
① Local Data Access
② Remote Database
Lookup
③ Remote Service
Call
outp
ut
22. 21
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
23. 22
Offline Jobs
Requirements:
● Performance and low latency
● Resource hungry
Finite jobs can hog resources
Infinite jobs need to be better citizens
● Composable API
● Same app in batch and streaming
Best of both worlds
● HDFS I/O
25. 24
High Level Logic
public class RepartitionAndCounterExample implements StreamApplication {
@Override public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pve =
graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyOutputType, MyOutputType> mpv = graph
.getOutputStream("memberPageViews", m -> m.memberId, m -> m);
pve
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), () -> 0,
(m, c) -> c + 1))
.map(MyOutputType::new)
.sendTo(mpv);
}
} Built-in transform
functions
26. 25
High Level API - Composable Operators
filter select a subset of messages from the stream
map map one input message to an output message
flatMap map one input message to 0 or more output messages
merge union all inputs into a single output stream
partitionBy re-partition the input messages based on a specific field
sendTo send the result to an output stream
sink send the result to an external system (e.g. external DB)
window window aggregation on the input stream
join join messages from two input streams
Stateless
Functions
I/O
Function
s
Stateful
Functions
30. 29
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
31. 30
What’s Next?
● SQL
Prototyped 2015
Now getting full time attention
● High Level API extensions
Better config, I/O, windowing, and more
● Beam Runner
Samza performance with Beam API
● Table support