Data stream processing platforms and microservices platform infrastructure and strategies are converging. As we edge towards larger, more complex and decoupled systems, combined with the continual growth of the global information graph, our frontiers of unsolved challenges grow equally as fast. Central challenges for distributed systems include persistence strategies across DCs, zones or regions, network partitions, data optimization, system stability in all phases.
How does leveraging CRDTs and Event Sourcing address several core distributed systems challenges? What are useful strategies and patterns involved in the design, deployment, and running of stateful and stateless applications for the cloud, for example with Kubernetes. Combined with code samples, we will see how Akka Cluster, Multi-DC Persistence, Split Brain, Sharding and Distributed Data can help solve these problems.
This document discusses Mantis, a reactive stream processing system for operational insights. Mantis allows querying data on-demand, reusing data and results between jobs for efficiency. It enables job chaining through discovery of job outputs and auto-scales jobs and clusters based on workload. Mantis provides high throughput and low latency stream processing while maintaining data guarantees.
(1) The document discusses using an event streaming platform like Apache Kafka for advanced time series analysis (TSA). Typical processing patterns are described for converting raw data into time series and reconstructing graphs and networks from time series data.
(2) A challenge discussed is integrating data streams, experiments, and decision making. The document argues that stream processing using Kafka is better suited than batch processing for real-time business in changing environments and iterative research projects.
(3) The document describes approaches for performing time series analysis and network analysis using Kafka to create time series from event streams and graphs from time series pairs. A simplified architecture for complex streaming analytics using reusable building blocks is presented.
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
1) The document discusses the art of building event streaming applications using various techniques like bounded contexts, stream processors, and architectural pillars.
2) Key aspects include modeling the application as a collection of loosely coupled bounded contexts, handling state using Kafka Streams, and building reusable stream processing patterns for instrumentation.
3) Composition patterns involve choreographing and orchestrating interactions between bounded contexts to capture business workflows and functions as event-driven data flows.
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lightbend
In this guest webinar with Chris McDermott, Lead Data Engineer at HPE, learn how HPE InfoSight–powered by Lightbend Platform–has emerged as the go-to solution for providing real-time metrics and predictive analytics across various network, server, storage, and data center technologies.
Rediscovering the Value of Apache Kafka® in Modern Data Architectureconfluent
This document discusses the origins and value of Apache Kafka in modern data architectures. It describes how Kafka was created to handle continuous flows of data, addressing limitations in databases and messaging systems. Kafka provides a unified solution for messaging, data storage, and stream processing. It originated from the ideas of treating the log as a first-class citizen and combining messaging, durable storage, and stream processing capabilities into a streaming platform. The document demonstrates how Kafka can be used to build a game scoring application using streams and tables. It recommends ways to learn more about Kafka including trying Confluent Cloud, tutorials, books, and attending Kafka Summit.
Real-time processing of large amounts of dataconfluent
This document discusses real-time processing of large amounts of data using a streaming platform. It begins with an agenda for the presentation, then discusses how streaming platforms can be used as a central nervous system in enterprises. Several use cases are presented, including using Apache Kafka and the Confluent Platform for applications like fraud detection, customer analytics, and migrating from batch to stream-based data processing. The rest of the document goes into details on Kafka, Confluent Platform, and how they can be used to build stream processing applications.
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkTodd Fritz
This document discusses a presentation titled "Reactive Fast Data & the Data Lake with Akka, Kafka, Spark" given by Todd Fritz at DevNexus in February 2017. The presentation agenda covers reactive systems and patterns, fast data, data lakes, the intersection of these topics, and architecture considerations for building systems that can scale to millions of users and billions of messages. Key technologies discussed include Akka, Kafka, and Spark.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
This document discusses Mantis, a reactive stream processing system for operational insights. Mantis allows querying data on-demand, reusing data and results between jobs for efficiency. It enables job chaining through discovery of job outputs and auto-scales jobs and clusters based on workload. Mantis provides high throughput and low latency stream processing while maintaining data guarantees.
(1) The document discusses using an event streaming platform like Apache Kafka for advanced time series analysis (TSA). Typical processing patterns are described for converting raw data into time series and reconstructing graphs and networks from time series data.
(2) A challenge discussed is integrating data streams, experiments, and decision making. The document argues that stream processing using Kafka is better suited than batch processing for real-time business in changing environments and iterative research projects.
(3) The document describes approaches for performing time series analysis and network analysis using Kafka to create time series from event streams and graphs from time series pairs. A simplified architecture for complex streaming analytics using reusable building blocks is presented.
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
1) The document discusses the art of building event streaming applications using various techniques like bounded contexts, stream processors, and architectural pillars.
2) Key aspects include modeling the application as a collection of loosely coupled bounded contexts, handling state using Kafka Streams, and building reusable stream processing patterns for instrumentation.
3) Composition patterns involve choreographing and orchestrating interactions between bounded contexts to capture business workflows and functions as event-driven data flows.
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lightbend
In this guest webinar with Chris McDermott, Lead Data Engineer at HPE, learn how HPE InfoSight–powered by Lightbend Platform–has emerged as the go-to solution for providing real-time metrics and predictive analytics across various network, server, storage, and data center technologies.
Rediscovering the Value of Apache Kafka® in Modern Data Architectureconfluent
This document discusses the origins and value of Apache Kafka in modern data architectures. It describes how Kafka was created to handle continuous flows of data, addressing limitations in databases and messaging systems. Kafka provides a unified solution for messaging, data storage, and stream processing. It originated from the ideas of treating the log as a first-class citizen and combining messaging, durable storage, and stream processing capabilities into a streaming platform. The document demonstrates how Kafka can be used to build a game scoring application using streams and tables. It recommends ways to learn more about Kafka including trying Confluent Cloud, tutorials, books, and attending Kafka Summit.
Real-time processing of large amounts of dataconfluent
This document discusses real-time processing of large amounts of data using a streaming platform. It begins with an agenda for the presentation, then discusses how streaming platforms can be used as a central nervous system in enterprises. Several use cases are presented, including using Apache Kafka and the Confluent Platform for applications like fraud detection, customer analytics, and migrating from batch to stream-based data processing. The rest of the document goes into details on Kafka, Confluent Platform, and how they can be used to build stream processing applications.
Reactive Fast Data & the Data Lake with Akka, Kafka, SparkTodd Fritz
This document discusses a presentation titled "Reactive Fast Data & the Data Lake with Akka, Kafka, Spark" given by Todd Fritz at DevNexus in February 2017. The presentation agenda covers reactive systems and patterns, fast data, data lakes, the intersection of these topics, and architecture considerations for building systems that can scale to millions of users and billions of messages. Key technologies discussed include Akka, Kafka, and Spark.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
The document discusses using Saga patterns and event sourcing with Kafka. It begins with introductions of Rafael Benevides and Roan Brasil Monteiro. It then provides an overview of moving from a monolithic to microservices architecture and challenges with synchronous calls. It introduces event sourcing, command sourcing, and Saga patterns including choreography-based and orchestration-based approaches. It discusses using Kafka streams to create an orchestrator and demonstrates Saga patterns with a booking room use case. It provides a link to a demo implementation on GitHub.
Kafka summit SF 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed realtime database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS). Building upon this, I explain how to build common business functionality by stepping through patterns for Scalable payment processing Run it on rails: Instrumentation and monitoring Control flow patterns (start, stop, pause) Finally, all of these concepts are combined in a solution architecture that can be used at enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
From data stream management to distributed dataflows and beyondVasia Kalavri
Recent efforts by academia and open-source communities have established stream processing as a principal data analysis technology across industry. All major cloud vendors offer streaming dataflow pipelines and online analytics as managed services. Notable use-cases include real-time fault detection in space networks, city traffic management, dynamic pricing for car-sharing, and anomaly detection in financial transactions. At the same time, streaming dataflow systems are increasingly being used for event-driven applications beyond analytics, such as orchestrating microservices and model serving. In the past decades, streaming technology has evolved significantly, however, emerging applications are once more challenging the design decisions of modern streaming systems. In this talk, I will discuss the evolution of stream processing and bring current trends and open problems to the attention of our community.
Reliable Data Intestion in BigData / IoTGuido Schmutz
Many of the Big Data and IoT use cases are based on combing data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
A Journey to Modern Apps with Containers, Microservices and Big DataEdward Hsu
This document discusses the transition to modern enterprise applications using containers, microservices, and big data technologies. It provides examples of how Mayo Clinic and Uber have revolutionized their industries using these technologies. The key benefits of DC/OS are that it provides a datacenter operating system that simplifies deploying and managing modern apps at scale. It allows turning infrastructure into a unified pool of resources and installing distributed services like Spark, Kafka with one command. The DC/OS community contributes many open services and it can be tried locally in under 20 minutes. Mesosphere Enterprise DC/OS provides additional support for production use.
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
The talk will focus on explaining why operational databases do not scale due to limitations in legacy transactional management.
https://www.bigdataspain.org/2017/talk/end-of-the-myth-ultra-scalable-transactional-management
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Architecting Microservices Applications with Instant Analyticsconfluent
View recording here: https://www.confluent.io/online-talks/architecting-microservices-applications-with-instant-analytics
The next generation architecture for exploring and visualizing event-driven data in real-time requires the right technology. Microservices deliver significant deployment and development agility, but raise questions of how data will move between services and how it will be analyzed. This online talk explores how Apache Druid and Apache Kafka® can turn a microservices ecosystem into a distributed real-time application with instant analytics. Apache Kafka and Druid form the backbone of an architecture that meet the demands imposed on the next generation applications you are building right now. Join industry experts Tim Berglund, Confluent, and Rachel Pedreschi, Imply, as they discuss architecting microservices apps with Druid and Apache Kafka.
What every software engineer should know about streams and tables in kafka ...confluent
This document provides an overview of streams and tables in Apache Kafka. It begins with defining events, streams, and tables. Streams record event history as a sequence, while tables represent the current state. It then discusses how to create tables from streams using aggregation. The document also covers topics, partitions, processing with ksqlDB and Kafka Streams, and other concepts like fault tolerance, elasticity, and capacity planning.
Jun Rao, Confluent | Kafka Summit SF 2019 Keynote ft. Chris Kasten, Walmart Labsconfluent
Apache Kafka is a widely used open-source platform for building real-time data pipelines and streaming applications. It addresses limitations of using databases to handle high volumes of event data by providing a distributed, scalable, and fault-tolerant event streaming platform. Major companies like Royal Bank of Canada and Carnival Cruise Line rely on Kafka's capabilities for applications like fraud detection, digital marketing, and building event-driven systems.
Concepts and Patterns for Streaming Services with KafkaQAware GmbH
Cloud Native Night March 2020, Mainz: Talk by Perry Krol (@perkrol, Confluent)
=== Please download slides if blurred! ===
Abstract: Proven approaches such as service-oriented and event-driven architectures are joined by newer techniques such as microservices, reactive architectures, DevOps, and stream processing. Many of these patterns are successful by themselves, but they provide a more holistic and compelling approach when applied together. In this session Confluent will provide insights how service-based architectures and stream processing tools such as Apache Kafka® can help you build business-critical systems. You will learn why streaming beats request-response based architectures in complex, contemporary use cases, and explain why replayable logs such as Kafka provide a backbone for both service communication and shared datasets.
Based on these principles, we will explore how event collaboration and event sourcing patterns increase safety and recoverability with functional, event-driven approaches, apply patterns including Event Sourcing and CQRS, and how to build multi-team systems with microservices and SOA using patterns such as “inside out databases” and “event streams as a source of truth”.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Presentation @ Oracle Code Berlin.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can we make sure that all these events are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amounts of messages between a source and a target. This session will start with an introduction of Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table.
Partner Development Guide for Kafka Connectconfluent
This guide is intended to provide useful background to developers implementing Kafka Connect sources and sinks for their data stores. Visit www.confluent.io for more information.
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIconfluent
For many industries the need to group together related events based on a period of activity or inactivity is key. Advertising businesses, content producers are just a few examples of where session windows can be used to better understand user behavior.
While such sessionization has been possible in Apache Kafka up to this point, implementing it has been rather complex and required leveraging low-level APIs. In the most recent release of Kafka, however, new capabilities have been added making session windows much easier to implement.
In this online talk, we’ll introduce the concept of a session window, talk about common use cases, and walk through how Apache Kafka can be used for session-oriented use cases.
The document discusses the importance of data governance and schemas for streaming data platforms using Apache Kafka. It recommends using a schema registry to define schemas for Kafka topics, handle schema changes, and prevent incompatible changes. A schema registry provides a single source of truth for schemas, prevents bad data, and allows for increased agility when modifying schemas while maintaining compatibility. It benefits the entire application lifecycle from development to production.
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...HostedbyConfluent
As cyber threats continuously grow in sophistication and frequency, companies need to quickly acclimate to effectively detect, respond, and protect their environments. At Intel, we’ve addressed this need by implementing a modern, scalable Cyber Intelligence Platform (CIP) based on Splunk and Apache Kafka. We believe that CIP positions us for the best defense against cyber threats well into the future.
Our CIP ingests tens of terabytes of data each day and transforms it into actionable insights through streams processing, context-smart applications, and advanced analytics techniques. Kafka serves as a massive data pipeline within the platform. It achieves economies of scale by acquiring data once and consuming it many times. It reduces technical debt by eliminating custom point-to-point connections for producing and consuming data. At the same time, it provides the ability to operate on data in-stream, enabling us to reduce Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR). Faster detection and response ultimately lead to better prevention.
In our session, we’ll discuss the details described in the IT@Intel white paper that was published in Nov 2020 with same title. We’ll share some stream processing techniques, such as filtering and enriching in Kafka to deliver contextually rich data to Splunk and many of our security controls.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the event streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and also used to be called Complex Event Processing (CEP). In the last 3 years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Apache Samza as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Event and Stream Processing and present what differences you might find between the more traditional CEP and the more modern Stream Processing solutions and show that a combination will bring the most value.
Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...confluent
Centene is fundamentally modernizing its legacy monolithic systems to support distributed, real-time event-driven healthcare information processing. A key part of our architecture is the development of a universal eventing framework to accommodate transformation into an event-driven architecture (EDA). Our application provides a representational state transfer (REST) and remote procedure call (gRPC) interface that allows development teams to publish and consume events with a simple Noun-Verb-Object (NVO) syntax. Embedded within the framework are structured schema evolutions with Confluent Schema Registry and AVRO, configurable (self-service) event-routing with K-Tables, dynamic event-aggregation with Kafka Streams, distributed event-tracing with Jaeger, and event querying against a MongoDB event-store hydrated by Kafka Connect. Lastly, we developed techniques to handle long-term event storage within Kafka; specifically surrounding the automated deletion of expired events and re-hydration of missing events. In Centene's first business use case, events related to claim processing of provider reconsiderations was used to provide real-time updates to providers on the status of their claim appeals. To satisfy the business requirement, multiple monolith systems independently leveraged the event framework, to stream status updates for display on the Centene Provider Portal instantly. This provided a capability that was brand new to Centene: the ability to interact and engage with our providers in real-time through the use of event streams. In this presentation, we will walk you through the architecture of the eventing framework and showcase how our business requirements within our claims adjudication domain were able to be solved leveraging the Kafka Stream DSL and the Confluent Platform. And more importantly, how Centene plans on leveraging this framework, written on-top of Kafka Streams, to change our culture from batch processing to real-time stream processing.
Deep Learning at Extreme Scale (in the Cloud) with the Apache Kafka Open Sou...Kai Wähner
How to Build a Machine Learning Infrastructure with Kafka, Connect, Streams, KSQL, etc…
This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem - and why this is a great fit for machine learning at extreme scale.
The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.
Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.
Key takeaways for the audience
• Learn how to build a Machine Learning infrastructure at extreme scale and how to productionize the built models in mission-critical real time applications
• Understand the benefits of a machine learning platform on the public cloud
• Learn about an extreme scale Machine Learning architecture around the Apache Kafka open source ecosystem including Kafka Connect, Kafka Streams and KSQL
• See a live demo for an Internet of Things use case: Sensor analytics for predictive alerting in real time
Data scientists and data engineers love Python for transforming, filtering, and processing data to train and deploy analytic models with frameworks such as TensorFlow. However, in real-world deployments, all of these steps require a scalable and reliable infrastructure. This session shows how data experts can use Python for data processing and model inference at scale, leveraging Python, Jupyter, Apache Kafka, and KSQL.
Talk from Oracle Code One / Oracle World 2019 in San Francisco.
Introducing Events and Stream Processing into Nationwide Building Society (Ro...confluent
Facing Open Banking regulation, rapidly increasing transaction volumes and increasing customer expectations, Nationwide took the decision to take load off their back-end systems through real-time streaming of data changes into Kafka. Hear about how Nationwide started their journey with Kafka, from their initial use case of creating a real-time data cache using Change Data Capture, Kafka and Microservices to how Kafka allowed them to build a stream processing backbone used to reengineer the entire banking experience including online banking, payment processing and mortgage applications. See a working demo of the system and what happens to the system when the underlying infrastructure breaks. Technologies covered include: Change Data Capture, Kafka (Avro, partitioning and replication) and using KSQL and Kafka Streams Framework to join topics and process data.
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Helena Edelson
Building Self Healing, Intelligent Platforms, systems that learn, multi-datacenter, removing human intervention with ML. Reactive Summit 2016 @helenaedelson
The document discusses using Saga patterns and event sourcing with Kafka. It begins with introductions of Rafael Benevides and Roan Brasil Monteiro. It then provides an overview of moving from a monolithic to microservices architecture and challenges with synchronous calls. It introduces event sourcing, command sourcing, and Saga patterns including choreography-based and orchestration-based approaches. It discusses using Kafka streams to create an orchestrator and demonstrates Saga patterns with a booking room use case. It provides a link to a demo implementation on GitHub.
Kafka summit SF 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed realtime database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS). Building upon this, I explain how to build common business functionality by stepping through patterns for Scalable payment processing Run it on rails: Instrumentation and monitoring Control flow patterns (start, stop, pause) Finally, all of these concepts are combined in a solution architecture that can be used at enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
From data stream management to distributed dataflows and beyondVasia Kalavri
Recent efforts by academia and open-source communities have established stream processing as a principal data analysis technology across industry. All major cloud vendors offer streaming dataflow pipelines and online analytics as managed services. Notable use-cases include real-time fault detection in space networks, city traffic management, dynamic pricing for car-sharing, and anomaly detection in financial transactions. At the same time, streaming dataflow systems are increasingly being used for event-driven applications beyond analytics, such as orchestrating microservices and model serving. In the past decades, streaming technology has evolved significantly, however, emerging applications are once more challenging the design decisions of modern streaming systems. In this talk, I will discuss the evolution of stream processing and bring current trends and open problems to the attention of our community.
Reliable Data Intestion in BigData / IoTGuido Schmutz
Many of the Big Data and IoT use cases are based on combing data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
A Journey to Modern Apps with Containers, Microservices and Big DataEdward Hsu
This document discusses the transition to modern enterprise applications using containers, microservices, and big data technologies. It provides examples of how Mayo Clinic and Uber have revolutionized their industries using these technologies. The key benefits of DC/OS are that it provides a datacenter operating system that simplifies deploying and managing modern apps at scale. It allows turning infrastructure into a unified pool of resources and installing distributed services like Spark, Kafka with one command. The DC/OS community contributes many open services and it can be tried locally in under 20 minutes. Mesosphere Enterprise DC/OS provides additional support for production use.
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
The talk will focus on explaining why operational databases do not scale due to limitations in legacy transactional management.
https://www.bigdataspain.org/2017/talk/end-of-the-myth-ultra-scalable-transactional-management
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Architecting Microservices Applications with Instant Analyticsconfluent
View recording here: https://www.confluent.io/online-talks/architecting-microservices-applications-with-instant-analytics
The next generation architecture for exploring and visualizing event-driven data in real-time requires the right technology. Microservices deliver significant deployment and development agility, but raise questions of how data will move between services and how it will be analyzed. This online talk explores how Apache Druid and Apache Kafka® can turn a microservices ecosystem into a distributed real-time application with instant analytics. Apache Kafka and Druid form the backbone of an architecture that meet the demands imposed on the next generation applications you are building right now. Join industry experts Tim Berglund, Confluent, and Rachel Pedreschi, Imply, as they discuss architecting microservices apps with Druid and Apache Kafka.
What every software engineer should know about streams and tables in kafka ...confluent
This document provides an overview of streams and tables in Apache Kafka. It begins with defining events, streams, and tables. Streams record event history as a sequence, while tables represent the current state. It then discusses how to create tables from streams using aggregation. The document also covers topics, partitions, processing with ksqlDB and Kafka Streams, and other concepts like fault tolerance, elasticity, and capacity planning.
Jun Rao, Confluent | Kafka Summit SF 2019 Keynote ft. Chris Kasten, Walmart Labsconfluent
Apache Kafka is a widely used open-source platform for building real-time data pipelines and streaming applications. It addresses limitations of using databases to handle high volumes of event data by providing a distributed, scalable, and fault-tolerant event streaming platform. Major companies like Royal Bank of Canada and Carnival Cruise Line rely on Kafka's capabilities for applications like fraud detection, digital marketing, and building event-driven systems.
Concepts and Patterns for Streaming Services with KafkaQAware GmbH
Cloud Native Night March 2020, Mainz: Talk by Perry Krol (@perkrol, Confluent)
=== Please download slides if blurred! ===
Abstract: Proven approaches such as service-oriented and event-driven architectures are joined by newer techniques such as microservices, reactive architectures, DevOps, and stream processing. Many of these patterns are successful by themselves, but they provide a more holistic and compelling approach when applied together. In this session Confluent will provide insights how service-based architectures and stream processing tools such as Apache Kafka® can help you build business-critical systems. You will learn why streaming beats request-response based architectures in complex, contemporary use cases, and explain why replayable logs such as Kafka provide a backbone for both service communication and shared datasets.
Based on these principles, we will explore how event collaboration and event sourcing patterns increase safety and recoverability with functional, event-driven approaches, apply patterns including Event Sourcing and CQRS, and how to build multi-team systems with microservices and SOA using patterns such as “inside out databases” and “event streams as a source of truth”.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Presentation @ Oracle Code Berlin.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can we make sure that all these events are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amounts of messages between a source and a target. This session will start with an introduction of Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table.
Partner Development Guide for Kafka Connectconfluent
This guide is intended to provide useful background to developers implementing Kafka Connect sources and sinks for their data stores. Visit www.confluent.io for more information.
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIconfluent
For many industries the need to group together related events based on a period of activity or inactivity is key. Advertising businesses, content producers are just a few examples of where session windows can be used to better understand user behavior.
While such sessionization has been possible in Apache Kafka up to this point, implementing it has been rather complex and required leveraging low-level APIs. In the most recent release of Kafka, however, new capabilities have been added making session windows much easier to implement.
In this online talk, we’ll introduce the concept of a session window, talk about common use cases, and walk through how Apache Kafka can be used for session-oriented use cases.
The document discusses the importance of data governance and schemas for streaming data platforms using Apache Kafka. It recommends using a schema registry to define schemas for Kafka topics, handle schema changes, and prevent incompatible changes. A schema registry provides a single source of truth for schemas, prevents bad data, and allows for increased agility when modifying schemas while maintaining compatibility. It benefits the entire application lifecycle from development to production.
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...HostedbyConfluent
As cyber threats continuously grow in sophistication and frequency, companies need to quickly acclimate to effectively detect, respond, and protect their environments. At Intel, we’ve addressed this need by implementing a modern, scalable Cyber Intelligence Platform (CIP) based on Splunk and Apache Kafka. We believe that CIP positions us for the best defense against cyber threats well into the future.
Our CIP ingests tens of terabytes of data each day and transforms it into actionable insights through streams processing, context-smart applications, and advanced analytics techniques. Kafka serves as a massive data pipeline within the platform. It achieves economies of scale by acquiring data once and consuming it many times. It reduces technical debt by eliminating custom point-to-point connections for producing and consuming data. At the same time, it provides the ability to operate on data in-stream, enabling us to reduce Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR). Faster detection and response ultimately lead to better prevention.
In our session, we’ll discuss the details described in the IT@Intel white paper that was published in Nov 2020 with same title. We’ll share some stream processing techniques, such as filtering and enriching in Kafka to deliver contextually rich data to Splunk and many of our security controls.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the event streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and also used to be called Complex Event Processing (CEP). In the last 3 years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Apache Samza as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Event and Stream Processing and present what differences you might find between the more traditional CEP and the more modern Stream Processing solutions and show that a combination will bring the most value.
Building an Enterprise Eventing Framework (Bryan Zelle, Centene; Neil Buesing...confluent
Centene is fundamentally modernizing its legacy monolithic systems to support distributed, real-time event-driven healthcare information processing. A key part of our architecture is the development of a universal eventing framework to accommodate transformation into an event-driven architecture (EDA). Our application provides a representational state transfer (REST) and remote procedure call (gRPC) interface that allows development teams to publish and consume events with a simple Noun-Verb-Object (NVO) syntax. Embedded within the framework are structured schema evolutions with Confluent Schema Registry and AVRO, configurable (self-service) event-routing with K-Tables, dynamic event-aggregation with Kafka Streams, distributed event-tracing with Jaeger, and event querying against a MongoDB event-store hydrated by Kafka Connect. Lastly, we developed techniques to handle long-term event storage within Kafka; specifically surrounding the automated deletion of expired events and re-hydration of missing events. In Centene's first business use case, events related to claim processing of provider reconsiderations was used to provide real-time updates to providers on the status of their claim appeals. To satisfy the business requirement, multiple monolith systems independently leveraged the event framework, to stream status updates for display on the Centene Provider Portal instantly. This provided a capability that was brand new to Centene: the ability to interact and engage with our providers in real-time through the use of event streams. In this presentation, we will walk you through the architecture of the eventing framework and showcase how our business requirements within our claims adjudication domain were able to be solved leveraging the Kafka Stream DSL and the Confluent Platform. And more importantly, how Centene plans on leveraging this framework, written on-top of Kafka Streams, to change our culture from batch processing to real-time stream processing.
Deep Learning at Extreme Scale (in the Cloud) with the Apache Kafka Open Sou...Kai Wähner
How to Build a Machine Learning Infrastructure with Kafka, Connect, Streams, KSQL, etc…
This talk shows how to build Machine Learning models at extreme scale and how to productionize the built models in mission-critical real time applications by leveraging open source components in the public cloud. The session discusses the relation between TensorFlow and the Apache Kafka ecosystem - and why this is a great fit for machine learning at extreme scale.
The Machine Learning architecture includes: Kafka Connect for continuous high volume data ingestion into the public cloud, TensorFlow leveraging Deep Learning algorithms to build an analytic model on powerful GPUs, Kafka Streams for model deployment and inference in real time, and KSQL for real time analytics of predictions, alerts and model accuracy.
Sensor analytics for predictive alerting in real time is used as real world example from Internet of Things scenarios. A live demo shows the out-of-the-box integration and dynamic scalability of these components on Google Cloud.
Key takeaways for the audience
• Learn how to build a Machine Learning infrastructure at extreme scale and how to productionize the built models in mission-critical real time applications
• Understand the benefits of a machine learning platform on the public cloud
• Learn about an extreme scale Machine Learning architecture around the Apache Kafka open source ecosystem including Kafka Connect, Kafka Streams and KSQL
• See a live demo for an Internet of Things use case: Sensor analytics for predictive alerting in real time
Data scientists and data engineers love Python for transforming, filtering, and processing data to train and deploy analytic models with frameworks such as TensorFlow. However, in real-world deployments, all of these steps require a scalable and reliable infrastructure. This session shows how data experts can use Python for data processing and model inference at scale, leveraging Python, Jupyter, Apache Kafka, and KSQL.
Talk from Oracle Code One / Oracle World 2019 in San Francisco.
Introducing Events and Stream Processing into Nationwide Building Society (Ro...confluent
Facing Open Banking regulation, rapidly increasing transaction volumes and increasing customer expectations, Nationwide took the decision to take load off their back-end systems through real-time streaming of data changes into Kafka. Hear about how Nationwide started their journey with Kafka, from their initial use case of creating a real-time data cache using Change Data Capture, Kafka and Microservices to how Kafka allowed them to build a stream processing backbone used to reengineer the entire banking experience including online banking, payment processing and mortgage applications. See a working demo of the system and what happens to the system when the underlying infrastructure breaks. Technologies covered include: Change Data Capture, Kafka (Avro, partitioning and replication) and using KSQL and Kafka Streams Framework to join topics and process data.
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...Helena Edelson
Building Self Healing, Intelligent Platforms, systems that learn, multi-datacenter, removing human intervention with ML. Reactive Summit 2016 @helenaedelson
This document discusses data intensive applications and some of the challenges, tools, and best practices related to them. The key challenges with data intensive applications include large quantities of data, complex data structures, and rapidly changing data. Common tools mentioned include NoSQL databases, message queues, caches, search indexes, and batch/stream processing frameworks. The document also discusses concepts like distributed systems architectures, outage case studies, and strategies for improving reliability, scalability, and maintainability in data systems. Engineers working in this field need an accurate understanding of various tools and how to apply the right tools for different use cases while avoiding common pitfalls.
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...confluent
Relational databases can become rigid and limit flexibility over time as data needs change. This can lead to services becoming tightly coupled and difficult to independently deploy (Relational Database Stockholm Syndrome). The document discusses an alternative approach that uses a distributed log (Apache Kafka) to store data as events, with domain-specific services processing these events independently. This allows for greater agility, flexibility and independent deployment of services.
This document discusses patterns for scaling systems incrementally. It introduces the ACD/C approach of making systems async, caching results, distributing work, and compromising on consistency as needed. Specific architectures like map reduce and distributed queues are presented. The challenges of partial failures, upgrades, and changing topologies are discussed. Testing is emphasized as critical for managing scaled systems.
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...confluent
Eventing and streaming open a world of compelling new possibilities to our software and platform designs. They can reduce time to decision and action while lowering total platform cost. But they are not a panacea. Understanding the edges and limits of these architectures can help you avoid painful missteps. This talk will focus on event driven and streaming architectures and how Apache Kafka can help you implement these. It will also discuss key tradeoffs you will face along the way from partitioning schemes to the impact of availability vs. consistency (CAP Theorem). Finally we’ll discuss some challenges of scale for patterns like Event Sourcing and how you can use other tools and even features of Kafka to work around them. This talk assumes a basic understanding of Kafka and distributed computing, but will include brief refresher sections.
Chapter Introductionn to distributed system .pptxTekle12
This document provides an introduction to distributed systems. It discusses that a distributed system connects autonomous computers through a network to act as a single system. Key characteristics include distribution of resources, concurrency, and failure independence. Examples given are the internet, cloud computing, and peer-to-peer networks. The document also outlines several design goals for distributed systems like scalability, reliability, performance, and transparency. Finally, it describes different types of distributed systems including cluster computing, grid computing, cloud computing, and internet of things systems.
Building large scale, job processing systems with Scala Akka Actor frameworkVignesh Sukumar
The document discusses building massive scale, fault tolerant job processing systems using the Scala Akka framework. It describes implementing a master-slave architecture with actors where an agent runs on each storage node to process jobs locally, achieving high throughput. It also covers controlling system load by dynamically adjusting parallelism, and implementing fine-grained fault tolerance through actor supervision strategies.
The document discusses evolving data warehousing strategies and architecture options for implementing a modern data warehousing environment. It begins by describing traditional data warehouses and their limitations, such as lack of timeliness, flexibility, quality, and findability of data. It then discusses how data warehouses are evolving to be more modern by handling all types and sources of data, providing real-time access and self-service capabilities for users, and utilizing technologies like Hadoop and the cloud. Key aspects of a modern data warehouse architecture include the integration of data lakes, machine learning, streaming data, and offering a variety of deployment options. The document also covers data lake objectives, challenges, and implementation options for storing and analyzing large amounts of diverse data sources.
Observability – the good, the bad, and the uglyTimetrix
This document discusses observability and incident management. It notes that incidents are expensive and reduce credibility. Common causes of outages include changes, network failures, bugs, human errors, hardware failures, and unspecified issues. The timeline of an outage includes detection, investigation, escalation, and fixing. Many companies have a "zoo" of monitoring solutions that are difficult to manage. Common anti-patterns include an exponential growth of metrics that nobody understands. The document advocates focusing on key performance indicator metrics and using time-series databases, distributed tracing, and machine learning to more quickly detect anomalies and reduce incident timelines. It describes an open source project called Timetrix that combines metrics, events and traces for improved observability.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1M9hGVj.
Helena Edelson addresses new architectures emerging for large scale streaming analytics - based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other streaming analytics platforms and frameworks using Apache Flink or GearPump. Edelson discusses the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. Filmed at qconsf.com.
Helena Edelson is a committer to the Spark Cassandra Connector and a contributor to Akka, adding new features in Akka Cluster such as the initial version of the cluster metrics API and AdaptiveLoadBalancingRouter.
This webinar by Orkhan Gasimov (Senior Solution Architect, Consultant, GlobalLogic) was delivered at Java Community Webinar #3 on October 16, 2020.
During webinar we had simplified overview of classical and modern architecture patterns and concepts that are used for development of distributed applications during the last decade.
More details and presentation: https://www.globallogic.com/ua/about/events/java-community-webinar-3/
The adoption of container native and cloud native development practices presents new operational challenges. Today’s microservice environments are polyglot, distributed, container-based, highly-scalable, and ephemeral. To understand your system, you need to be able to follow the life of a request across numerous components distributed in multiple environments. Without the proper tools it can feel impossible to determine a root cause of an issue. This requires a new approach to operations. We will review a series of open source observability tools for logging, monitoring, and tracing to help developers achieve operational excellence for running container-based workloads.
Harness the Power of Data in a Big Data Lake discusses strategies for ingesting and processing data in a data lake. It describes how to design a data ingestion framework that accounts for factors like data format, source, size, and location. The document contrasts ETL vs ELT approaches and discusses techniques for batched and change data capture ingestion of both structured and unstructured data. It also provides an overview of tools like Sqoop that can be used to ingest data from relational databases into a data lake.
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx
From our Feb 25, 2016 webcast on operating Elasticsearch at scale, the metrics to monitor, and how to create low-noise meaningful alerts on Elasticsearch performance.
Product Information - Fuse Management Central 1.0.0antonio.carvalho
Fuse Management Central is an administration platform for OpenText Content Suite/Extended ECM, enabling a centralized management of system while monitoring its components.
Due to its architecture, it separate system administration from business administration, introducing a new layer of security on OpenText Content Suite administration.
Performing Oracle Health Checks Using APEXDatavail
With the heavy workload that most, if not all, DBAs face, it’s no wonder there is little time left to perform routine health checks. This presentation deck reviews the real value of health checks, based on the thousands of them performed for clients and how APEX can be used to standardize health checks.
The document discusses transitioning from a monolithic architecture to microservices architecture for an IoT cloud platform. Some key points include:
- The goals of enabling scalability, supporting new markets, and innovation.
- Moving to a microservices architecture can help with scalability, fault tolerance, and independent deployability compared to a monolith.
- Organizational structure should also transition from function-based to product-based to align with the architecture.
- Technical considerations in building microservices include service interfaces, data management, fault tolerance, and DevOps practices.
Similar to Toward Predictability and Stability (20)
Humans have a tendency to invent new problems rather than solve old ones. As we build larger, more complex systems, we unearth global challenges around networks, compute resources and data. Have we neglected to see more elegant examples which existed all along?
It is possible for even the most complex systems to be organized and simplified in ways that may not occur to us. In situations where we still search for the right algorithms, by turning to complex natural systems around us we can find the problem was solved long ago. What we think is a new protocol may in fact be one that has been tested and evolving over hundreds or millions of years. One invented for the early internet is incredibly similar to a strategy evolved by desert ants millions of years ago. And this is why it works.
This talk will address these questions with examples of self-organization, decentralization and diversification from emergent phenomena found in nature.
Disorder And Tolerance In Distributed Systems At ScaleHelena Edelson
Rethinking intelligent resilient systems. Re-framing problems changes how we see and solve them. The intersection of scientific thought and principles parallels much of what we solve as engineers of information (e.g. uncertainty, time, distribution) and need. This talk is an interdisciplinary look at complex adaptive systems and how they innately solve things like resource distribution, growth and rebalancing. From the context of intelligence and systems, this talk will look at ideas around entropy and time, ensemble forecasting, self-organization theory, the butterfly effect, virus-human co-evolution and adaption, natural feedback loops, self-balancing, and adaptation.
Can we leverage these principles, behaviors and strategies to design intelligent systems at scale?
Can seeing things in an interdisciplinary way benefit solving common problems and speed innovation?
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
Slides from my talk with Evan Chan at Strata San Jose: NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis. Streaming analytics architecture in big data for fast streaming, ad hoc and batch, with Kafka, Spark Streaming, Akka, Mesos, Cassandra and FiloDB. Simplifying to a unified architecture.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
This document discusses a new approach to building scalable data processing systems using streaming analytics with Spark, Kafka, Cassandra, and Akka. It proposes moving away from architectures like Lambda and ETL that require duplicating data and logic. The new approach leverages Spark Streaming for a unified batch and stream processing runtime, Apache Kafka for scalable messaging, Apache Cassandra for distributed storage, and Akka for building fault tolerant distributed applications. This allows building real-time streaming applications that can join streaming and historical data with simplified architectures that remove the need for duplicating data extraction and loading.
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
Scala Days, Amsterdam, 2015: Lambda Architecture - Batch and Streaming with Spark, Cassandra, Kafka, Akka and Scala; Fault Tolerance, Data Pipelines, Data Flows, Data Locality, Akka Actors, Spark, Spark Cassandra Connector, Big Data, Asynchronous data flows. Time series data, KillrWeather, Scalable Infrastructure, Partition For Scale, Replicate For Resiliency, Parallelism
Isolation, Data Locality, Location Transparency
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
This document provides an overview of streaming big data with Spark, Kafka, Cassandra, Akka, and Scala. It discusses delivering meaning in near-real time at high velocity and an overview of Spark Streaming, Kafka and Akka. It also covers Cassandra and the Spark Cassandra Connector as well as integration in big data applications. The presentation is given by Helena Edelson, a Spark Cassandra Connector committer and Akka contributor who is a Scala and big data conference speaker working as a senior software engineer at DataStax.
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity At Massive Scale with Apache Spark, Apache Kafka, Apache Cassandra, Akka and the Spark Cassandra Connector. Why this pairing of technologies and How easy it is to implement. Example application: https://github.com/killrweather/killrweather
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
2. @helenaedelson
Helena Edelson
● Principal Engineer @ Lightbend
● Member of the Akka team
● Former: Apple, Crowdstrike, VMware,
SpringSource, Tuplejump
● github.com/helena
● twitter.com/helenaedelson
● speakerdeck.com/helenaedelson
Data, Analytics & ML Platform Infrastructure and Cloud Engineer
Former biologist
4. @helenaedelson
When systems reach a critical level of dynamism we have to change our way of
modeling and designing them
• Stateful in a stateless world
• Automation of everything - Ops, *aaS platforms
• Persistence strategies across DCs, zones and regions
• Data and query optimization
• System availability and stability in all states of deployment and rolling restarts
• Leveraging AI / ML to
Rethinking Strategies
5. @helenaedelson
Computational model embracing non-determinism
- Actor Model of Computation, Carl Hewitt
• Mathematical theory treating "Actors" as primitives of concurrent computation
• Framework for a theoretical understanding of concurrency
• Asynchronous communication
• Stateful isolated processes
• Non-observable state within
• Decoupling in space and time
The Network and Autonomous Processes
6. @helenaedelson
Principles that Akka stands on can be traced back to the ’70s and ’80s
• Carl Hewitt invented the Actor Model, early 70s
• Jim Gray and Pat Helland on the Tandem System, 80s
• Joe Armstrong, Robert Virding and Mike Williams on Erlang, 1986
Look Back Before Looking Forward
7. @helenaedelson
• From the ’40s and still being heavily developed today across many fields of
research and application in industry.
• 1940s: Cellular automata (CA), originally discovered by Stanislaw Ulam and John
von Neumann, Los Alamos National Laboratory
• 1970s: Conway's Game of Life
• Asynchronous Cellular Automaton
Complex Adaptive Systems, Systems Theory,
early AI
8. @helenaedelson
Can solve problems difficult or impossible for an individual agent or a monolithic
system to solve
• The foundations for artificial neural networks and NLP
• Composed of multiple autonomous agents, interacting to achieve common goals
• Decentralized, no control point of decisions making
• More fault tolerant, no single point of failure
• Reach higher degrees of dependability
Multi-Agent Systems (MAS)
9. @helenaedelson@helenaedelson
Complex Adaptive Systems (CAS)
Self-Organization
Theory
Emergence
Synchronization
Amplification
Distributed
Networks
cellular
automata
Feedback
Loops
Systems
Evolution
Swarming
localAsynchronous
Unpredictable
Non-Linear
Adaptive
Versatile
13. @helenaedelson
• Stateful - in-memory yet durable and resilient state
• Long-lived - lifecycle is not bound to a specific session, context available until
explicitly destroyed
• Virtual - location transparent and not bound to a physical location
• Addressable - referenced through a stable address
Akka Actors Also Happen To Be
20. @helenaedelson
• Complex Event Processing (CEP) - developed 1989-1995 to analyze event-driven simulations of
distributed systems, abstracting causal event histories, patterns, filtering and aggregation in large,
distributed, time-sensitive systems
• Stream Processing - mid-1990s research in real-time event data analysis, internet companies
processing large number of events
• Event Sourcing (ES) - from domain-driven design and enterprise development, processing very
complex data models with often smaller datasets than internet companies
• Command Query Responsibility Segregation (CQRS) - isn't about events, but often combined with ES
• Also - CDC
Structuring data as a stream of events
21. @helenaedelson
• How data from system behavior is structured
• Capture all changes as a sequence of events in time
• Store events as an immutable event log / append-only storage
• Preserves the happened-before causality of events
• Replay event log to reconstruct state within a given time window or all
Event Sourcing
22. @helenaedelson
Requirements - forensics
• Auditable - what is the current state and how it arrived there
• Causality - observe and analyze a system's causal structure
Applications For ES In Distributed
Asynchronous Systems
For example
• Cybersecurity and Vulnerability Detection
• Banking - what is the account balance and how did it arrive at that
• Click stream
• Accounting & Ledgers
• Shopping Cart
• Anything with a sequence of events that lead to X which must be preserved
23. @helenaedelson
A pattern decoupling the write path (commands) from the read path (queries)
• Different access patterns and differing ratios of reads to writes is typical
• Different schemas / data structures
• Typically different teams around orgs owning the write and using/owning the read
• No reason to share structure and bad practice (no monolith, loose coupling, etc.)
• Command - Writers / Publishers publish without having awareness who needs to
receive it or how to reach them (location, protocol...)
• Query - Readers / Subscribers should be able to subscribe and asynchronously receive
from topics of interest
Command Query Responsibility
Segregation (CQRS)
24. @helenaedelson
My old diagram from 3 years ago: Kafka Summit:
Real Time Bidding (RTB)
The write path and model is naturally separate and differs from the read:
25. @helenaedelson
• Ingest large amounts of data, from multiple
sources, sometimes bursty so it can't overload
the system
• Write the raw data to a store so that
• when algorithms change I can run the data
stream over for new meaning
• when nodes or applications fail I can replay
data from a checkpoint to recover
• Route the event streams to my ML/Analytics
streams
It Doesn't Matter What We Call It
or Whether It's Microservices Or A
Streaming Data Pipeline
• Process and aggregate inbound data and store
aggregates for querying historical against the
stream
• Not loose data
• Be secure, probably encrypt/decrypt everything
• Not pay massive cloud and data storage fees
• Be sure my team can handle infrastructure
TOC
28. @helenaedelson
Akka Persistence Stateful Actors
• Enables stateful actors to persist their state for recovery and replay from failure
and error
• Events persisted to storage, nothing is mutated (no read-modify-write)
• Allows higher transaction rates and efficient replication
• Only events received by the actor are persisted
• Snapshotting for checkpoint replay
• At least once message delivery semantics
Event Stream As Replication Fabric
29. @helenaedelson
Connect different event logs with Event-sourced processors for event processing
pipelines or graphs
• Cassandra, Redis, DynamoDB, Couchbase, MongoDB, Hazelcast, JDBC and
more
• Built-in: in-memory heap based journal, local file-system based snapshot-store
and LevelDB based journal
Storage Plugins
30. @helenaedelson
• Your algorithms have changed, you need to replay historic data against the new
logic
• Rolling upgrade, restart, cluster migration
• Error, e.g. after a JVM crash
• Failure, e.g. cluster nodes or a DC went down, a network outage or partition
• Cloud compute layer planned maintenance restarts
• Application throws exception, if a persistent Actor is configured to restart by a
supervisor
Replay Reasons
31. @helenaedelson
Akka out of the box gives us tooling for each of these steps:
• Failure awareness and lifecycle
• Save state of failed node before failure
• Load state that was in flight at time of failure (define time slice)
• Replay from a checkpoint in a snapshot or run the full history
• Resume operations
Failure And Recovery
33. @helenaedelson
● Decentralized peer-to-peer
● Cluster Formation and membership service
● Communication and Consensus
● Leader and Roles
● Cluster Lifecycle and Events
● Failure Detector
● Self-Healing
● CoordinatedShutdown
Akka Cluster: Quick Premise
34. @helenaedelson
Cluster User API
• What roles am I in, what is my address
• Join, Leave, Down
• Programatic membership control
• Register listeners to cluster events
• Startup when configurable cluster size
reached
• Highly tunable behavior
40. @helenaedelson
• ClusterDomainEvent: base type
• MemberUp: member status changed to Up
• UnreachableMember: member considered unreachable by failure detector
• MemberRemoved: member completely removed from the cluster
• MemberEvent: member status change Up, Removed
• Leader events
• Reachability events
Cluster Events
41. @helenaedelson
• CurrentClusterState: current snapshot state of the cluster, sent to new
subscribers, unless InitialStateAsEvents specified
• InitialStateAsEvents to receive messages which replay events to restore the
current snapshot of the cluster state
Cluster State
44. @helenaedelson
(leader)
• Masterless
• No Leader Election
• Role of the leader: only one
who can change status
• joining to up
• exiting to removed
Leader decisions are local to
DC
Cluster Leader
47. @helenaedelson
Cluster Membership State
A CRDT which can be deterministically merged
Joining
Up
Leaving
Exiting
removedDown
User Action
Join
Leader
Action
User Action
Leave Leader
Action
Leader
Action
User Action
Down
54. @helenaedelson
Cluster Singleton
Single point of cluster-wide decisions or coordination
ClusterSingletonManager
ClusterSingletonManager
(oldest)
SingletonActor
ClusterSingletonManager
58. @helenaedelson
Cluster Singleton: On Failure
(oldest)
Failover
Message
ClusterSingletonManager
SingletonActorDowned or Network Partition
ClusterSingletonProxy
ClusterSingletonManager
59. @helenaedelson
Strong Consistency Always Available
Guarantees one instance of a particular
actor type per cluster
Cluster Singleton
doc.akka.io/docs/akka/current/scala/cluster-singleton
61. @helenaedelson
An approach to eventual distributed consistency
• Replicate data across the network
• Concurrent updates from different nodes without coordination
• Mathematical properties guarantee eventual consistency
• Updates execute immediately, unaffected by network faults
• Consistency without consensus
• Highly scalable and fault tolerant
Conflict-Free Replicated Data Types (CRDT)
A comprehensive study of Convergent and Commutative Replicated Data Types
62. @helenaedelson
A replicated counter, which converges because the increment / decrement operations
commute
• Service Discovery
• Shopping Cart
• Priority on low latency and full availability
• Computation in delay-tolerant networks
• Data aggregation
• Partition-tolerant cloud computing
• Collaborative text editing
Application Of CRDTs
A few implementations:
• Riak Data Types
• SoundCloud Roshi
• Akka Distributed Data
63. @helenaedelson
1976: The maintenance of duplicate databases, Paul Johnson, Robert Thomas
1984: Efficient solutions to the replicated log and dictionary problems, Gene Wuu, Arthur Bernstein
1988: Scale and performance in a distributed file system, J. Howard, M. Kazar, S. Menees, D. Nichols, M.
Satyanarayanan, R. Sidebotham, M. West
1988: Commutativity-based concurrency control for abstract data types, W. Weihl
1989: Concurrency control in groupware systems, C. Ellis, S. Gibbs
1994: Resolving file conflicts in the Ficus file system, P. Reiher, J. Heidemann, D. Ratner, G. Skinner, and G. Popek
1994: Detecting causal relationships in distributed computations: In search of the holy grail, R. Schwarz, F. Mattern
1997: Specification of convergent abstract data types for autonomous mobile computing, C. Baquero, F. Moura
1999: Using structural characteristics for autonomous operation, Carlos Baquero, Francisco Moura
2009: A commutative replicated data type for cooperative editing, N. Preguiça, J. Marquès, M. Shapiro, M. Leţia
2011: A comprehensive study of Convergent and Commutative Replicated Data Types, M. Shapiro, N. Preguiça, C.
Baquero, M. Zawirski
Not New
64. @helenaedelson
• Low latency and high availability
• Data availability despite network partitions
• Nodes concurrently update as multi-master
• Async state replication across the cluster
• Granular control of consistency level for reads and writes
• Key-value store like API
Akka Distributed Data
doc.akka.io/docs/akka/current/scala/distributed-data
Replicated in-memory data store using CvRDT to share data between cluster nodes
65. @helenaedelson
Concurrent updates from different nodes resolve via the monotonic merge function,.
Counters GCounter grow-only, PNCounter (2 GCounters) increment decrement
Registers Flag toggle boolean, LWWRegister - Last Write Wins register
Sets GSet grow-only merge by union, ORSet observer-remove version vector
Maps ORMap, ORMultiMap, LWWMap, PNCounterMap
Graphs DAG
Composable For More Advanced Types
A comprehensive study of Convergent and Commutative Replicated Data Types
66. @helenaedelson
Delta State CRDTs (δ-CRDTs)
• A way to reduce the need for sending the full state for updates
• Sending only what changed
• Merging done on the receiving side
• Eventually consistent by default, and supports opt-in causal
consistency
Delta State Replicated Data Types
GCounter
GSet
PNCounter
PNCounterMap
LWWMap
ORMap
ORMultiMap
ORSet
LWWRegister
75. @helenaedelson
• By default the data is only kept in memory and replicated to other nodes
• If all nodes are stopped the data is lost
• You can configure it to store on the local disk on each node (LMDB)
• Or implement your own to another store via the trait
• It will be loaded the next time the replicator is started
Configurable Durable Storage
76. @helenaedelson
Strong Consistency Always Available
doc.akka.io/docs/akka/current/distributed-data
Distributed Data
Eventually consistent - always accepts writes
77. @helenaedelson
• Needing high consistency over availability and low latency
• Big Data - not currently intended for billions of entries
• When a new node is added to the cluster all entries are propagated to it,
hence top level entries should not exceed 100000
• Data is held in memory
• If not using a delta-CRDT, when a data entry is changed the full state of that
entry may be replicated to other nodes.
Not Designed For
78. @helenaedelson
Cluster Sharding
Scale, Resilience & Consistency
• Automatically distribute entities of the same type over several nodes
• Balance resources (memory, disk space, network traffic) across
multiple nodes for scalability
• Location transparency: Interact by logical ID
• Increased fault tolerance - relocation on failure
Life beyond Distributed Transactions
Node 1
SR1
S1 S2 S3
79. @helenaedelson
Each Entity Is A Consistency Boundary
Sender on Node 1
Local ShardRegion
Shards: groups of entities
Node 1
SR 1
S1 S2 S3
Your Code, Supervised By Shards
Message(gid)
80. @helenaedelson
• Creates entity actors on demand
• Supervises group of entities - defined by the shard ID extraction
N-Shards Per Cluster Node
Entity B-1
SR2
SC
SR1
Shard A
Shard B
Entity A-1
Entity A-2
Entity C-1
Shard C
SR3
ShardCoordinator
ShardRegion 1
ShardRegion 2
ShardRegion 3
81. @helenaedelson
• Creates and supervises its shards
• Knows how to route messages by routing key
ShardRegion Per Cluster Node
Envelope(“c-1”)
Entity B-1
Shard A
Shard B
Entity A-1
Entity A-2
Entity C-1
Shard C
ShardCoordinator
ShardRegion 1
ShardRegion 2
ShardRegion 3
Node 1
Node 2 Node 3
82. @helenaedelson
• Stores Shard to Region mappings with Akka Persistence
• Monitors all cluster node status
• If the SC goes down it starts up on another node and
replays the state
Shard Coordination
Entity B-1
Shard A
Shard B
Entity A-1
Entity A-2
Entity C-1
Shard C
ShardCoordinator
(Cluster Singleton)
ShardRegion 1
ShardRegion 2
ShardRegion 3
83. @helenaedelson
Start Cluster Sharding On Node
Sending data
Your Entity ID
Extraction function
Your Shard ID
Extraction function
Your custom shard
allocation strategy
Your Envelope type
Or use built-in
HashExtractor
84. @helenaedelson
Cluster Sharding: Failover
Entity B-1
Shard A
Shard B
Entity A-1
Entity A-2
ShardCoordinator
Downed
Location Transparency
Failover
Entity C-1
Shard C
ShardRegion 1
ShardRegion 2Envelope(“c-1”)
85. @helenaedelson
Strong Consistency Always Available
Each entity is a boundary of consistency
Guarantees one instance per entity type at a time per cluster
doc.akka.io/docs/akka/current/scala/cluster-sharding
Cluster Sharding
86. @helenaedelson
"Serverless is a new generation of platform-as-a-service offerings where
the infrastructure provider takes responsibility for receiving client
requests and responding to them, capacity planning, task scheduling,
and operational monitoring. Developers need to worry only about the
logic for processing client requests."
- Adzic et al
Serverless computing: economic and architectural impact
Serverless
87. @helenaedelson
• Automated infrastructure running in a container pool
• A classic data-shipping architecture - we move data to the code, not the other
way round
• Pay be execution time
• Autoscales with load
• Event driven
• Stateless
• Ephemeral (5-15 minutes)
FaaS
89. @helenaedelson
• Load and event spikes needing massive parallelism
• Scaling from 0 to 10000s requests and down to zero
• Simplifies delivery of scale and availability
• As integration layer between various (ephemeral and durable) data sources
• Processing stateless intensive workloads
• As data backbone moving data from A to B and transforming it
• Can work well for event-driven use cases
What Is FaaS Good At Currently?
91. @helenaedelson
• Functions handle only one event source
• Functions are stateless, ephemeral, and short-lived
• Computational context easily lost
• Limited options for managing and coordinating distributed state
• Limited options for the right consistency guarantees
• Limited options for durable state, that is scalable and available
• Expensive to load and store state from storage repeatedly
Limitations With Serverless
Distributed state is not well supported for complex distributed data workflows
92. @helenaedelson
• No direct communication which means applications must pub-sub all data over a
storage medium
• Too high latency for general purpose distributed computing problems
For a discussion on this, and other limitations with FaaS read the paper,
“Serverless Computing: One Step Forward, Two Steps Back”
by Joe Hellerstein, et al.
FaaS Does Not Have Addressability
97. @helenaedelson
Kubernetes Pod
Kubernetes Pod
Kubernetes Pod
Knative stateful serving
Knative Events
User Function
(JavaScript, Go, Java,…)
KNative Serving of Stateful Functions
User Function
(JavaScript, Go, Java,…)
User Function
(JavaScript, Go, Java,…)
Distributed Datastore
(Cassandra, DynamoDB, Spanner,…)
gRPC
98. @helenaedelson
Kubernetes Pod
Kubernetes Pod
Kubernetes Pod
Kubernetes Pod
Kubernetes Pod
Kubernetes Pod
Knative stateful serving
User Function
(JavaScript, Go, Java,…)
Powered by Akka Cluster Sidecars
User Function
(JavaScript, Go, Java,…)
User Function
(JavaScript, Go, Java,…)
Akka Sidecar
Akka Sidecar
Akka Sidecar
Akka Cluster
Distributed Datastore
(Cassandra, DynamoDB, Spanner,…)