This document discusses streaming architectures and libraries for processing streaming data. It provides an overview of Kafka Streams and Akka Streams, highlighting that Kafka Streams is ideal for ETL, aggregations, joins and "effectively once" requirements, while Akka Streams is suitable for low latency, mid-volume workloads based on graphs of processing nodes. It also includes a Kafka Streams example in Scala for scoring streaming data records using a streaming machine learning model.
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache KafkaLightbend
Since its stable release in 2016, Akka Streams is quickly becoming the de facto standard integration layer between various Streaming systems and products. Enterprises like PayPal, Intel, Samsung and Norwegian Cruise Lines see this is a game changer in terms of designing Reactive streaming applications by connecting pipelines of back-pressured asynchronous processing stages.
This comes from the Reactive Streams initiative in part, which has been long led by Lightbend and others, allowing multiple streaming libraries to inter-operate between each other in a performant and resilient fashion, providing back-pressure all the way. But perhaps even more so thanks to the various integration drivers that have sprung up in the community and the Akka team—including drivers for Apache Kafka, Apache Cassandra, Streaming HTTP, Websockets and much more.
In this webinar for JVM Architects, Konrad Malawski explores the what and why of Reactive integrations, with examples featuring technologies like Akka Streams, Apache Kafka, and Alpakka, a new community project for building Streaming connectors that seeks to “back-pressurize” traditional Apache Camel endpoints.
* An overview of Reactive Streams and what it will look like in JDK 9, and the Akka Streams API implementation for Java and Scala.
* Introduction to Alpakka, a modern, Reactive version of Apache Camel, and its growing community of Streams connectors (e.g. Akka Streams Kafka, MQTT, AMQP, Streaming HTTP/TCP/FileIO and more).
* How Akka Streams and Akka HTTP work with Websockets, HTTP and TCP, with examples in both in Java and Scala.
Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming EngineLightbend
For many businesses, the batch-oriented architecture of Big Data–where data is captured in large, scalable stores, then processed later–is simply too slow: a new breed of “Fast Data” architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage.
There are many stream processing tools, so which ones should you choose? It helps to consider several factors in the context of your applications:
* Low latency: How low is necessary?
* High volume: How high is required?
* Integration with other tools: Which ones and how?
* Data processing: What kinds? In bulk? As individual events?
In this talk by Dean Wampler, PhD., VP of Fast Data Engineering at Lightbend, we’ll look at the criteria you need to consider when selecting technologies, plus specific examples of how four streaming tools–Akka Streams, Kafka Streams, Apache Flink and Apache Spark serve particular needs and use cases when working with continuous streams of data.
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
Things were easier when all our data used to be offline, analyzed overnight in batches. Now our data is online, in motion, and generated constantly. For architects, developers and their businesses, this means that there is an urgent need for tools and applications that can deliver real-time (or near real-time) streaming ETL capabilities.
In this session by Konrad Malawski, author, speaker and Senior Akka Engineer at Lightbend, you will learn how to build these streaming ETL pipelines with Akka Streams, Alpakka and Apache Kafka, and why they matter to enterprises that are increasingly turning to streaming Fast Data applications.
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the JobLightbend
For many businesses, the batch-oriented architecture of Big Data–where data is captured in large, scalable stores, then processed later–is simply too slow: a new breed of “Fast Data” architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage.
There are many stream processing tools, so which ones should you choose? It helps to consider several factors in the context of your applications:
* Low latency: How low (or high) is needed?
* High volume: How much volume must be handled?
* Integration with other tools: Which ones and how?
* Data processing: What kinds? In bulk? As individual events?
In this talk by Dean Wampler, PhD., VP of Fast Data Engineering at Lightbend, we’ll look at the criteria you need to consider when selecting technologies, plus specific examples of how four streaming tools–Akka Streams, Kafka Streams, Apache Flink and Apache Spark serve particular needs and use cases when working with continuous streams of data.
Revitalizing Enterprise Integration with Reactive StreamsLightbend
With Viktor Klang, Deputy CTO Lightbend, Inc.
As software grows more and more interconnected, and with several generations of software having to interoperate, a new take on the integration of systems is needed—ad hoc, unversioned, and unreplicated scripts just won’t suffice, and the traditional Enterprise Service Bus (ESB) concept has experienced stability, reliability, performance, and scalability problems.
In this webinar, Viktor explores a new take on Enterprise Integration Patterns:
First, he will explore the Reactive Streams standard, an orchestration layer where transformations are standalone, composable, reusable, and—most importantly—using asynchronous flow-control—back pressure—to maintain predictable, stable, behavior over time.
Furthermore, he will go through how one-off workloads relate to continuous, and batch, workloads, and how they can be addressed by that very same orchestration layer.
Finally, he will review how this type of design achieves resilience, scalability, and ultimately—responsiveness.
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSLightbend
Apache Kafka–part of Lightbend Fast Data Platform–is a distributed streaming platform that is best suited to run close to the metal on dedicated machines in statically defined clusters. For most enterprises, however, these fixed clusters are quickly becoming extinct in favor of mixed-use clusters that take advantage of all infrastructure resources available.
In this webinar by Sean Glover, Fast Data Engineer at Lightbend, we will review leading Kafka implementations on DC/OS and Kubernetes to see how they reliably run Kafka in container orchestrated clusters and reduce the overhead for a number of common operational tasks with standard cluster resource manager features. You will learn specifically about concerns like:
* The need for greater operational knowhow to do common tasks with Kafka in static clusters, such as applying broker configuration updates, upgrading to a new version, and adding or decommissioning brokers.
* The best way to provide resources to stateful technologies while in a mixed-use cluster, noting the importance of disk space as one of Kafka’s most important resource requirements.
* How to address the particular needs of stateful services in a model that natively favors stateless, transient services.
Akka A to Z: A Guide To The Industry’s Best Toolkit for Fast Data and Microse...Lightbend
Microservices. Streaming data. Event Sourcing and CQRS. Concurrency, routing, self-healing, persistence, clustering… You get the picture. The Akka toolkit makes all of this simple for Java and Scala developers at Amazon, LinkedIn, Starbucks, Verizon and others. So how does Akka provide all these features out of the box?
Join Hugh McKee, Akka expert and Developer Advocate at Lightbend, on an illustrated journey that goes deep into how Akka works–from individual Akka actors to fully distributed clusters across multiple datacenters.
Operationalizing Machine Learning: Serving ML ModelsLightbend
Join O’Reilly author and Lightbend Principal Architect, Boris Lublinsky, as he discusses one of the hottest topics in software engineering today: serving machine learning models.
Typically with machine learning, different groups are responsible for model training and model serving. Data scientists often introduce their own machine-learning tools, causing software engineers to create complementary model-serving frameworks to keep pace. It’s not a very efficient system. In this webinar, Boris demonstrates a more standardized approach to model serving and model scoring:
* How to develop an architecture for serving models in real time as part of input stream processing
* How this approach enables data science teams to update models without restarting existing applications
* Different ways to build this model-scoring solution, using several popular stream processing engines and frameworks
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache KafkaLightbend
Since its stable release in 2016, Akka Streams is quickly becoming the de facto standard integration layer between various Streaming systems and products. Enterprises like PayPal, Intel, Samsung and Norwegian Cruise Lines see this is a game changer in terms of designing Reactive streaming applications by connecting pipelines of back-pressured asynchronous processing stages.
This comes from the Reactive Streams initiative in part, which has been long led by Lightbend and others, allowing multiple streaming libraries to inter-operate between each other in a performant and resilient fashion, providing back-pressure all the way. But perhaps even more so thanks to the various integration drivers that have sprung up in the community and the Akka team—including drivers for Apache Kafka, Apache Cassandra, Streaming HTTP, Websockets and much more.
In this webinar for JVM Architects, Konrad Malawski explores the what and why of Reactive integrations, with examples featuring technologies like Akka Streams, Apache Kafka, and Alpakka, a new community project for building Streaming connectors that seeks to “back-pressurize” traditional Apache Camel endpoints.
* An overview of Reactive Streams and what it will look like in JDK 9, and the Akka Streams API implementation for Java and Scala.
* Introduction to Alpakka, a modern, Reactive version of Apache Camel, and its growing community of Streams connectors (e.g. Akka Streams Kafka, MQTT, AMQP, Streaming HTTP/TCP/FileIO and more).
* How Akka Streams and Akka HTTP work with Websockets, HTTP and TCP, with examples in both in Java and Scala.
Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming EngineLightbend
For many businesses, the batch-oriented architecture of Big Data–where data is captured in large, scalable stores, then processed later–is simply too slow: a new breed of “Fast Data” architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage.
There are many stream processing tools, so which ones should you choose? It helps to consider several factors in the context of your applications:
* Low latency: How low is necessary?
* High volume: How high is required?
* Integration with other tools: Which ones and how?
* Data processing: What kinds? In bulk? As individual events?
In this talk by Dean Wampler, PhD., VP of Fast Data Engineering at Lightbend, we’ll look at the criteria you need to consider when selecting technologies, plus specific examples of how four streaming tools–Akka Streams, Kafka Streams, Apache Flink and Apache Spark serve particular needs and use cases when working with continuous streams of data.
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
Things were easier when all our data used to be offline, analyzed overnight in batches. Now our data is online, in motion, and generated constantly. For architects, developers and their businesses, this means that there is an urgent need for tools and applications that can deliver real-time (or near real-time) streaming ETL capabilities.
In this session by Konrad Malawski, author, speaker and Senior Akka Engineer at Lightbend, you will learn how to build these streaming ETL pipelines with Akka Streams, Alpakka and Apache Kafka, and why they matter to enterprises that are increasingly turning to streaming Fast Data applications.
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the JobLightbend
For many businesses, the batch-oriented architecture of Big Data–where data is captured in large, scalable stores, then processed later–is simply too slow: a new breed of “Fast Data” architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage.
There are many stream processing tools, so which ones should you choose? It helps to consider several factors in the context of your applications:
* Low latency: How low (or high) is needed?
* High volume: How much volume must be handled?
* Integration with other tools: Which ones and how?
* Data processing: What kinds? In bulk? As individual events?
In this talk by Dean Wampler, PhD., VP of Fast Data Engineering at Lightbend, we’ll look at the criteria you need to consider when selecting technologies, plus specific examples of how four streaming tools–Akka Streams, Kafka Streams, Apache Flink and Apache Spark serve particular needs and use cases when working with continuous streams of data.
Revitalizing Enterprise Integration with Reactive StreamsLightbend
With Viktor Klang, Deputy CTO Lightbend, Inc.
As software grows more and more interconnected, and with several generations of software having to interoperate, a new take on the integration of systems is needed—ad hoc, unversioned, and unreplicated scripts just won’t suffice, and the traditional Enterprise Service Bus (ESB) concept has experienced stability, reliability, performance, and scalability problems.
In this webinar, Viktor explores a new take on Enterprise Integration Patterns:
First, he will explore the Reactive Streams standard, an orchestration layer where transformations are standalone, composable, reusable, and—most importantly—using asynchronous flow-control—back pressure—to maintain predictable, stable, behavior over time.
Furthermore, he will go through how one-off workloads relate to continuous, and batch, workloads, and how they can be addressed by that very same orchestration layer.
Finally, he will review how this type of design achieves resilience, scalability, and ultimately—responsiveness.
Putting Kafka In Jail – Best Practices To Run Kafka On Kubernetes & DC/OSLightbend
Apache Kafka–part of Lightbend Fast Data Platform–is a distributed streaming platform that is best suited to run close to the metal on dedicated machines in statically defined clusters. For most enterprises, however, these fixed clusters are quickly becoming extinct in favor of mixed-use clusters that take advantage of all infrastructure resources available.
In this webinar by Sean Glover, Fast Data Engineer at Lightbend, we will review leading Kafka implementations on DC/OS and Kubernetes to see how they reliably run Kafka in container orchestrated clusters and reduce the overhead for a number of common operational tasks with standard cluster resource manager features. You will learn specifically about concerns like:
* The need for greater operational knowhow to do common tasks with Kafka in static clusters, such as applying broker configuration updates, upgrading to a new version, and adding or decommissioning brokers.
* The best way to provide resources to stateful technologies while in a mixed-use cluster, noting the importance of disk space as one of Kafka’s most important resource requirements.
* How to address the particular needs of stateful services in a model that natively favors stateless, transient services.
Akka A to Z: A Guide To The Industry’s Best Toolkit for Fast Data and Microse...Lightbend
Microservices. Streaming data. Event Sourcing and CQRS. Concurrency, routing, self-healing, persistence, clustering… You get the picture. The Akka toolkit makes all of this simple for Java and Scala developers at Amazon, LinkedIn, Starbucks, Verizon and others. So how does Akka provide all these features out of the box?
Join Hugh McKee, Akka expert and Developer Advocate at Lightbend, on an illustrated journey that goes deep into how Akka works–from individual Akka actors to fully distributed clusters across multiple datacenters.
Operationalizing Machine Learning: Serving ML ModelsLightbend
Join O’Reilly author and Lightbend Principal Architect, Boris Lublinsky, as he discusses one of the hottest topics in software engineering today: serving machine learning models.
Typically with machine learning, different groups are responsible for model training and model serving. Data scientists often introduce their own machine-learning tools, causing software engineers to create complementary model-serving frameworks to keep pace. It’s not a very efficient system. In this webinar, Boris demonstrates a more standardized approach to model serving and model scoring:
* How to develop an architecture for serving models in real time as part of input stream processing
* How this approach enables data science teams to update models without restarting existing applications
* Different ways to build this model-scoring solution, using several popular stream processing engines and frameworks
Akka Revealed: A JVM Architect's Journey From Resilient Actors To Scalable Cl...Lightbend
By now, you’ve probably heard of Akka, the JVM toolkit for building scalable, resilient and resource efficient applications in Java or Scala. With over 12 open-source and commercial modules in the toolkit, Akka takes developers from actors on a single JVM, all the way out to network partition healing and clusters of servers distributed across fleets of JVMs. But with such a broad range of features, how can Architects and Developers grok Akka from a high-level perspective?
In this technical webinar by Hugh McKee, O’Reilly author and Developer Advocate at Lightbend, we introduce Akka from A to Z, starting with a tour from the humble actor and finishing all the way at the clustered systems level. Specifically, we will review:
*How Akka Actors behave, create systems, and manage supervision and routing
*The way Akka embraces Reactive Streams with Akka Streams and Alpakka
*How various components of the Akka toolkit provide out-of-the-box solutions for distributed data, distributed persistence, pub-sub, and ES/CQRS
*How Akka works with microservices, and brings this functionality into Lagom and Play Frameworks
*Looking at Akka clusters, how Akka is used to build distributed clustered systems incorporate clusters within clusters
*What’s needed to orchestrate and deploy complete Reactive Systems
Slides from my madlab presentation on Akka Streams & Reactive Kafka (October 2015), full slides and source here:
https://github.com/markglh/AkkaStreams-Madlab-Slides
Developing Secure Scala Applications With Fortify For ScalaLightbend
From banks to airlines to credit rating agencies, security continues to be a major focus for organizations across various industries. As the newspapers show, it’s heavily damaging to enterprises when security vulnerabilities in their code, infrastructure, or open source frameworks/libraries get exploited.
The good news is that your Scala development team now has a powerful ally for securing their applications. Co-developed by the Fortify team along with Lightbend, the upcoming Fortify for Scala Plugin is the only Static Application Security Testing (SAST) solution to use the official Scala compiler. This plugin automatically identifies code-level security vulnerabilities early in the SDLC, so you can confidently and reliably secure your mission-critical Scala-based applications.
In this webinar by Seth Tisue, Scala Committer and Senior Scala Engineer at Lightbend, and Poonam Yadav, Product Manager for Fortify at Micro Focus, you will learn about:
* Some of the more than 200 vulnerabilities that the Fortify plugin for Scala can catch and help you resolve,
* How the plugin works to analyze, identify and provide actionable recommendations,
* How to integrate it into your modern DevOps environment,
* Why this plugin was co-developed by Lightbend and the Fortify team, and how it benefits your organization’s security professionals / CISO office.
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google CloudLightbend
As the number of systems within an IT infrastructure increases, the number of integrations needed by enterprises also multiplies. Recognizing that the old times of overnight file exchanges are no longer meeting real-time demands, a well-organized enterprise integration strategy is a critical success factor when your systems need to be connected all day.
In this webinar with Enno Runne, Tech Lead for Alpakka at Lightbend, Inc., we’ll look at why integrations should be viewed as streams of data, and how Alpakka—a Reactive Enterprise Integration library for Java and Scala based on Reactive Streams and Akka—fits perfectly for today’s demands on system integrations. Specifically, we will review:
* How Alpakka brings streaming data flows directly to the surface, utilizing the features of Akka to tame the complexity of streams.
* Supported connectors for Amazon Web Services, Microsoft Azure, and Google Cloud, as well as others for event sourcing/persistence/DB technologies and traditional interfaces like FTP, HTTP, etc.
* A deeper look into the use cases for Alpakka’s most utilized interfaces to popular technologies like Apache Kafka, MQTT, and MongoDB.
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
Akka Streams and its amazing handling of streaming with back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially in use cases where the amount of work continues to increase as you’re processing it. This is where back-pressure really shines.
In this talk for Architects and Dev Managers by Akara Sucharitakul, Principal MTS for Global Platform Frameworks at PayPal, Inc., we look at how back-pressure based on Akka Streams and Kafka is being used at PayPal to handle very bursty workloads.
In addition, Akara will also share experiences in creating a platform based on Akka and Akka Streams that currently processes over 1 billion transactions per day (on just 8 VMs), with the aim of helping teams adopt these technologies. In this webinar, you will:
*Start with a sample web crawler use case to examine what happens when each processing pass expands to a larger and larger workload to process.
*Review how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
*Look at lessons learned, plus some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
UDF/UDAF: the extensibility framework for KSQL (Hojjat Jafapour, Confluent) K...confluent
KSQL is the streaming SQL engine for Apache Kafka. It provides an easy and completely interactive SQL interface for stream processing on Kafka. Users can express their processing logic in SQL like statements and KSQL will compile and execute them as Kafka Streams applications. Although KSQL provides a rich set of features and built in functions, many use cases require more domain specific processing logic that cannot be expressed in pure SQL. To enable users to use KSQL in such scenarios, KSQL provides a framework to define complex processing logic as User Defined Functions (UDFs) and User Defined Aggregate Functions (UDAFs). In this talk, we provide a deep dive into the UDF/UDAF framework in KSQL. We explain how users can define their custom UDFs/UDAFs and use them in their queries. We also describe how KSQL utilizes the provided UDFs/UDAFs under the hood to process streams and tables. This deep dive will include an insight into how UDFs process data and how UDAFs keep track of their state. Armed with such knowledge, KSQL users will be able to define and utilize complex data processing logic in their KSQL queries. They will also be able to diagnose and fix issues in defining and deploying their UDFs/UDAFs more efficiently.
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend
The term 'streams' has been getting pretty overloaded recently–it's hard to know where to best use different technologies with streams in the name. In this talk by noted hAkker Konrad Malawski, we'll disambiguate what streams are and what they aren't, taking a deeper look into Akka Streams (the implementation) and Reactive Streams (the standard).
You'll be introduced to a number of real life scenarios where applying back-pressure helps to keep your systems fast and healthy at the same time. While the focus is mainly on the Akka Streams implementation, the general principles apply to any kind of asynchronous, message-driven architectures.
What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent
With the introduction of connect and streams API in 2016, Apache Kafka is becoming the defacto solution for anyone looking to build a streaming platform. The community continues to add additional capabilities to make it the complete solution for streaming data.
Join us as we review the latest additions in Apache Kafka 0.10.2. In addition, we’ll cover what’s new in Confluent Enterprise 3.2 that makes it possible for running Kafka at scale.
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataLightbend
In a recent survey, 90% of over 2400 developers reported having at least some real-time functionality in their systems. Enterprises are realizing that the ability to extract value from streaming data in near real-time is the new competitive advantage.
Two technologies–Akka Streams and Kafka Streams–have emerged as popular tools to use with Apache Kafka for addressing the shared requirements of availability, scalability, and resilience for both streaming microservices and Fast Data. So which one should you use for specific use cases?
Application development has come a long way. From client-server, to desktop, to web based applications served by monolithic application servers, the need to serve billions of users and hundreds of devices have become crucial to today's business. Typesafe Reactive Platform helps you to modernize your applications by transforming the most critical parts into microservice-style architectures which support extremely high workloads and allow you to serve millions of end-users.
Kafka Streams: the easiest way to start with stream processingYaroslav Tkachenko
Stream processing is getting more & more important in our data-centric systems. In the world of Big Data, batch processing is not enough anymore - everyone needs interactive, real-time analytics for making critical business decisions, as well as providing great features to the customers.
There are many stream processing frameworks available nowadays, but the cost of provisioning infrastructure and maintaining distributed computations is usually very high. Sometimes you just have to satisfy some specific requirements, like using HDFS or YARN.
Apache Kafka is de facto a standard for building data pipelines. Kafka Streams is a lightweight library (available since 0.10) that uses powerful Kafka abstractions internally and doesn't require any complex setup or special infrastructure - you just deploy it like any other regular application.
In this session I want to talk about the goals behind stream processing, basic techniques and some best practices. Then I'm going to explain main fundamental concepts behind Kafka and explore Kafka Streams syntax and streaming features. By the end of the session you'll be able to write stream processing applications in your domain, especially if you already use Kafka as your data pipeline.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Making Scala Faster: 3 Expert Tips For Busy Development TeamsLightbend
In this special guest webinar with Mirco Dotta, co-founder of Triplequote LLC (the creators of Hydra), we take a deeper look into what affects Scala compilation speed, why a combination of language features, external libraries, and type annotations make compilation times generally unpredictable, and what you can do to speed it up by orders of magnitude. We’ll go through:
* Understanding some of the most common bottlenecks in Scala builds.
* Effective use of type class auto-derivation for cutting compilation times.
* What are some average compilation speeds, and how to know if you have a productivity blocker.
Streaming ETL with Apache Kafka and KSQLNick Dearden
Companies new and old are all recognizing the importance of a low-latency, scalable, fault-tolerant data backbone - in the form of the Apache Kafka streaming platform. With Kafka developers can integrate multiple systems and data sources to enable low-latency analytics, event-driven architectures, and the population of downstream systems. What's more, these data pipelines can be built using configuration alone.
In this talk, we'll see how easy it is to capture a stream of data changes in real-time from a database such as MySQL into Kafka using the Kafka Connect framework and then use KSQL to filter, aggregate and join it to other data, and finally stream the results from Kafka out into multiple targets such as Elasticsearch and MySQL. All of this can be accomplished without a single line of Java code!
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
My talk at Google DevFest Switzerland, Fribourg, Oct 2017.
https://devfest.ch/schedule/day1?sessionId=118
Abstract:
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka.
Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice.
In this session we talk about how Apache Kafka helps you to radically simplify your data architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. We discuss common use cases to realize that stream processing in practice often requires database-like functionality, and how Kafka allows you to bridge the worlds of streams and databases when implementing your own core business applications (inventory management for large retailers, patient monitoring in healthcare, fleet tracking in logistics, etc), for example in the form of event-driven, containerized microservices. We will also give a brief shout-out to the recently launched KSQL, a streaming SQL engine for Apache Kafka.
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent
In this talk we’ll look at the relationship between three of the most disruptive software engineering paradigms: event sourcing, stream processing and serverless. We’ll debunk some of the myths around event sourcing. We’ll look at the inevitability of event-driven programming in the serverless space and we’ll see how stream processing links these two concepts together with a single ‘database for events’. As the story unfolds we’ll dive into some use cases, examine the practicalities of each approach-particularly the stateful elements-and finally extrapolate how their future relationship is likely to unfold. Key takeaways include: The different flavors of event sourcing and where their value lies. The difference between stream processing at application- and infrastructure-levels. The relationship between stream processors and serverless functions. The practical limits of storing data in Kafka and stream processors like KSQL."
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll
Video recording: https://www.youtube.com/watch?v=o7zSLNiTZbA
Slides of my talk at Berlin Buzzwords in June 2016.
Abstract:
"In the past few years Apache Kafka has established itself as the world's most popular real-time, large-scale messaging system. It is used across a wide range of industries by thousands of companies such as Netflix, Cisco, PayPal, Twitter, and many others.
In this session I am introducing the audience to Kafka Streams, which is the latest addition to the Apache Kafka project. Kafka Streams is a stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a high-level DSL for writing stream processing applications. As such it is the most convenient yet scalable option to process and analyze data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Apache Storm and Spark Streaming, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka."
Landoop presenting how to simplify your ETL process using Kafka Connect for (E) and (L). Introducing KCQL - the Kafka Connect Query Language & how it can simplify fast-data (ingress & egress) pipelines. How KCQL can be used to set up Kafka Connectors for popular in-memory and analytical systems and live demos with HazelCast, Redis and InfluxDB. How to get started with a fast-data docker kafka development environment. Enhance your existing Cloudera (Hadoop) clusters with fast-data capabilities.
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...Matt Stubbs
Date: 13th November 2018
Location: Fast Data Theatre
Time: 15:50 - 16:20
Speaker: Dean Wampler
Organisation: Lightbend
About: What if you used microservices for streaming data processing, rather than systems like Spark? I'll examine Kafka-based, microservice applications that use Akka Streams and Kafka Streams libraries for stream processing. I'll discuss the strengths and weaknesses of each tool for particular design needs, with lessons that are applicable to other library choices, too. I'll also contrast them with Spark Streaming and Flink; when should you choose them instead?
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
High level introduction to Kafka Connect and Kafka Streams, two components of the Apache Kafka open source framework. See the concepts, architecture and features.
Akka Revealed: A JVM Architect's Journey From Resilient Actors To Scalable Cl...Lightbend
By now, you’ve probably heard of Akka, the JVM toolkit for building scalable, resilient and resource efficient applications in Java or Scala. With over 12 open-source and commercial modules in the toolkit, Akka takes developers from actors on a single JVM, all the way out to network partition healing and clusters of servers distributed across fleets of JVMs. But with such a broad range of features, how can Architects and Developers grok Akka from a high-level perspective?
In this technical webinar by Hugh McKee, O’Reilly author and Developer Advocate at Lightbend, we introduce Akka from A to Z, starting with a tour from the humble actor and finishing all the way at the clustered systems level. Specifically, we will review:
*How Akka Actors behave, create systems, and manage supervision and routing
*The way Akka embraces Reactive Streams with Akka Streams and Alpakka
*How various components of the Akka toolkit provide out-of-the-box solutions for distributed data, distributed persistence, pub-sub, and ES/CQRS
*How Akka works with microservices, and brings this functionality into Lagom and Play Frameworks
*Looking at Akka clusters, how Akka is used to build distributed clustered systems incorporate clusters within clusters
*What’s needed to orchestrate and deploy complete Reactive Systems
Slides from my madlab presentation on Akka Streams & Reactive Kafka (October 2015), full slides and source here:
https://github.com/markglh/AkkaStreams-Madlab-Slides
Developing Secure Scala Applications With Fortify For ScalaLightbend
From banks to airlines to credit rating agencies, security continues to be a major focus for organizations across various industries. As the newspapers show, it’s heavily damaging to enterprises when security vulnerabilities in their code, infrastructure, or open source frameworks/libraries get exploited.
The good news is that your Scala development team now has a powerful ally for securing their applications. Co-developed by the Fortify team along with Lightbend, the upcoming Fortify for Scala Plugin is the only Static Application Security Testing (SAST) solution to use the official Scala compiler. This plugin automatically identifies code-level security vulnerabilities early in the SDLC, so you can confidently and reliably secure your mission-critical Scala-based applications.
In this webinar by Seth Tisue, Scala Committer and Senior Scala Engineer at Lightbend, and Poonam Yadav, Product Manager for Fortify at Micro Focus, you will learn about:
* Some of the more than 200 vulnerabilities that the Fortify plugin for Scala can catch and help you resolve,
* How the plugin works to analyze, identify and provide actionable recommendations,
* How to integrate it into your modern DevOps environment,
* Why this plugin was co-developed by Lightbend and the Fortify team, and how it benefits your organization’s security professionals / CISO office.
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google CloudLightbend
As the number of systems within an IT infrastructure increases, the number of integrations needed by enterprises also multiplies. Recognizing that the old times of overnight file exchanges are no longer meeting real-time demands, a well-organized enterprise integration strategy is a critical success factor when your systems need to be connected all day.
In this webinar with Enno Runne, Tech Lead for Alpakka at Lightbend, Inc., we’ll look at why integrations should be viewed as streams of data, and how Alpakka—a Reactive Enterprise Integration library for Java and Scala based on Reactive Streams and Akka—fits perfectly for today’s demands on system integrations. Specifically, we will review:
* How Alpakka brings streaming data flows directly to the surface, utilizing the features of Akka to tame the complexity of streams.
* Supported connectors for Amazon Web Services, Microsoft Azure, and Google Cloud, as well as others for event sourcing/persistence/DB technologies and traditional interfaces like FTP, HTTP, etc.
* A deeper look into the use cases for Alpakka’s most utilized interfaces to popular technologies like Apache Kafka, MQTT, and MongoDB.
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
Akka Streams and its amazing handling of streaming with back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially in use cases where the amount of work continues to increase as you’re processing it. This is where back-pressure really shines.
In this talk for Architects and Dev Managers by Akara Sucharitakul, Principal MTS for Global Platform Frameworks at PayPal, Inc., we look at how back-pressure based on Akka Streams and Kafka is being used at PayPal to handle very bursty workloads.
In addition, Akara will also share experiences in creating a platform based on Akka and Akka Streams that currently processes over 1 billion transactions per day (on just 8 VMs), with the aim of helping teams adopt these technologies. In this webinar, you will:
*Start with a sample web crawler use case to examine what happens when each processing pass expands to a larger and larger workload to process.
*Review how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
*Look at lessons learned, plus some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
UDF/UDAF: the extensibility framework for KSQL (Hojjat Jafapour, Confluent) K...confluent
KSQL is the streaming SQL engine for Apache Kafka. It provides an easy and completely interactive SQL interface for stream processing on Kafka. Users can express their processing logic in SQL like statements and KSQL will compile and execute them as Kafka Streams applications. Although KSQL provides a rich set of features and built in functions, many use cases require more domain specific processing logic that cannot be expressed in pure SQL. To enable users to use KSQL in such scenarios, KSQL provides a framework to define complex processing logic as User Defined Functions (UDFs) and User Defined Aggregate Functions (UDAFs). In this talk, we provide a deep dive into the UDF/UDAF framework in KSQL. We explain how users can define their custom UDFs/UDAFs and use them in their queries. We also describe how KSQL utilizes the provided UDFs/UDAFs under the hood to process streams and tables. This deep dive will include an insight into how UDFs process data and how UDAFs keep track of their state. Armed with such knowledge, KSQL users will be able to define and utilize complex data processing logic in their KSQL queries. They will also be able to diagnose and fix issues in defining and deploying their UDFs/UDAFs more efficiently.
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend
The term 'streams' has been getting pretty overloaded recently–it's hard to know where to best use different technologies with streams in the name. In this talk by noted hAkker Konrad Malawski, we'll disambiguate what streams are and what they aren't, taking a deeper look into Akka Streams (the implementation) and Reactive Streams (the standard).
You'll be introduced to a number of real life scenarios where applying back-pressure helps to keep your systems fast and healthy at the same time. While the focus is mainly on the Akka Streams implementation, the general principles apply to any kind of asynchronous, message-driven architectures.
What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent
With the introduction of connect and streams API in 2016, Apache Kafka is becoming the defacto solution for anyone looking to build a streaming platform. The community continues to add additional capabilities to make it the complete solution for streaming data.
Join us as we review the latest additions in Apache Kafka 0.10.2. In addition, we’ll cover what’s new in Confluent Enterprise 3.2 that makes it possible for running Kafka at scale.
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataLightbend
In a recent survey, 90% of over 2400 developers reported having at least some real-time functionality in their systems. Enterprises are realizing that the ability to extract value from streaming data in near real-time is the new competitive advantage.
Two technologies–Akka Streams and Kafka Streams–have emerged as popular tools to use with Apache Kafka for addressing the shared requirements of availability, scalability, and resilience for both streaming microservices and Fast Data. So which one should you use for specific use cases?
Application development has come a long way. From client-server, to desktop, to web based applications served by monolithic application servers, the need to serve billions of users and hundreds of devices have become crucial to today's business. Typesafe Reactive Platform helps you to modernize your applications by transforming the most critical parts into microservice-style architectures which support extremely high workloads and allow you to serve millions of end-users.
Kafka Streams: the easiest way to start with stream processingYaroslav Tkachenko
Stream processing is getting more & more important in our data-centric systems. In the world of Big Data, batch processing is not enough anymore - everyone needs interactive, real-time analytics for making critical business decisions, as well as providing great features to the customers.
There are many stream processing frameworks available nowadays, but the cost of provisioning infrastructure and maintaining distributed computations is usually very high. Sometimes you just have to satisfy some specific requirements, like using HDFS or YARN.
Apache Kafka is de facto a standard for building data pipelines. Kafka Streams is a lightweight library (available since 0.10) that uses powerful Kafka abstractions internally and doesn't require any complex setup or special infrastructure - you just deploy it like any other regular application.
In this session I want to talk about the goals behind stream processing, basic techniques and some best practices. Then I'm going to explain main fundamental concepts behind Kafka and explore Kafka Streams syntax and streaming features. By the end of the session you'll be able to write stream processing applications in your domain, especially if you already use Kafka as your data pipeline.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Making Scala Faster: 3 Expert Tips For Busy Development TeamsLightbend
In this special guest webinar with Mirco Dotta, co-founder of Triplequote LLC (the creators of Hydra), we take a deeper look into what affects Scala compilation speed, why a combination of language features, external libraries, and type annotations make compilation times generally unpredictable, and what you can do to speed it up by orders of magnitude. We’ll go through:
* Understanding some of the most common bottlenecks in Scala builds.
* Effective use of type class auto-derivation for cutting compilation times.
* What are some average compilation speeds, and how to know if you have a productivity blocker.
Streaming ETL with Apache Kafka and KSQLNick Dearden
Companies new and old are all recognizing the importance of a low-latency, scalable, fault-tolerant data backbone - in the form of the Apache Kafka streaming platform. With Kafka developers can integrate multiple systems and data sources to enable low-latency analytics, event-driven architectures, and the population of downstream systems. What's more, these data pipelines can be built using configuration alone.
In this talk, we'll see how easy it is to capture a stream of data changes in real-time from a database such as MySQL into Kafka using the Kafka Connect framework and then use KSQL to filter, aggregate and join it to other data, and finally stream the results from Kafka out into multiple targets such as Elasticsearch and MySQL. All of this can be accomplished without a single line of Java code!
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
My talk at Google DevFest Switzerland, Fribourg, Oct 2017.
https://devfest.ch/schedule/day1?sessionId=118
Abstract:
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka.
Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice.
In this session we talk about how Apache Kafka helps you to radically simplify your data architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. We discuss common use cases to realize that stream processing in practice often requires database-like functionality, and how Kafka allows you to bridge the worlds of streams and databases when implementing your own core business applications (inventory management for large retailers, patient monitoring in healthcare, fleet tracking in logistics, etc), for example in the form of event-driven, containerized microservices. We will also give a brief shout-out to the recently launched KSQL, a streaming SQL engine for Apache Kafka.
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent
In this talk we’ll look at the relationship between three of the most disruptive software engineering paradigms: event sourcing, stream processing and serverless. We’ll debunk some of the myths around event sourcing. We’ll look at the inevitability of event-driven programming in the serverless space and we’ll see how stream processing links these two concepts together with a single ‘database for events’. As the story unfolds we’ll dive into some use cases, examine the practicalities of each approach-particularly the stateful elements-and finally extrapolate how their future relationship is likely to unfold. Key takeaways include: The different flavors of event sourcing and where their value lies. The difference between stream processing at application- and infrastructure-levels. The relationship between stream processors and serverless functions. The practical limits of storing data in Kafka and stream processors like KSQL."
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll
Video recording: https://www.youtube.com/watch?v=o7zSLNiTZbA
Slides of my talk at Berlin Buzzwords in June 2016.
Abstract:
"In the past few years Apache Kafka has established itself as the world's most popular real-time, large-scale messaging system. It is used across a wide range of industries by thousands of companies such as Netflix, Cisco, PayPal, Twitter, and many others.
In this session I am introducing the audience to Kafka Streams, which is the latest addition to the Apache Kafka project. Kafka Streams is a stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a high-level DSL for writing stream processing applications. As such it is the most convenient yet scalable option to process and analyze data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Apache Storm and Spark Streaming, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka."
Landoop presenting how to simplify your ETL process using Kafka Connect for (E) and (L). Introducing KCQL - the Kafka Connect Query Language & how it can simplify fast-data (ingress & egress) pipelines. How KCQL can be used to set up Kafka Connectors for popular in-memory and analytical systems and live demos with HazelCast, Redis and InfluxDB. How to get started with a fast-data docker kafka development environment. Enhance your existing Cloudera (Hadoop) clusters with fast-data capabilities.
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...Matt Stubbs
Date: 13th November 2018
Location: Fast Data Theatre
Time: 15:50 - 16:20
Speaker: Dean Wampler
Organisation: Lightbend
About: What if you used microservices for streaming data processing, rather than systems like Spark? I'll examine Kafka-based, microservice applications that use Akka Streams and Kafka Streams libraries for stream processing. I'll discuss the strengths and weaknesses of each tool for particular design needs, with lessons that are applicable to other library choices, too. I'll also contrast them with Spark Streaming and Flink; when should you choose them instead?
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
High level introduction to Kafka Connect and Kafka Streams, two components of the Apache Kafka open source framework. See the concepts, architecture and features.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
SSR: Structured Streaming for R and Machine Learningfelixcss
Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.
Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.
Session hashtag: #SFdev2
SSR: Structured Streaming on R for Machine Learning with Felix CheungDatabricks
Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.
Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
Apache Spark 2.0 has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
What’s new in Spark 2.0
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
NoSql DBs are really popular in the BigData landscape, but SQL semantic is taking revenge. Instead of learning many DSL, developers prefer to use the well know and universal SQL query, so roughly all big data solutions are forced to support SQL semantic over their data models.
From Document to Graph DBs, from search to streaming platforms, all the ways to query Big data through SQL.
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKai Wähner
Agenda:
Apache Kafka Ecosystem
Kafka Streams as Foundation for KSQL
Motivation for KSQL
KSQL Concepts
Live Demo #1 – Intro to KSQL
KSQL Architecture
Live Demo #2 - Clickstream Analysis
Building a User Defined Function (Example: Machine Learning)
Getting Started
###
The rapidly expanding world of stream processing can be daunting, with new concepts such as various types of time semantics, windowed aggregates, changelogs, and programming frameworks to master.
KSQL is an open-source, Apache 2.0 licensed streaming SQL engine on top of Apache Kafka which aims to simplify all this and make stream processing available to everyone. Even though it is simple to use, KSQL is built for mission-critical and scalable production deployments (using Kafka Streams under the hood).
Benefits of using KSQL include No coding required; no additional analytics cluster needed; streams and tables as first-class constructs; access to the rich Kafka ecosystem. This session introduces the concepts and architecture of KSQL. Use cases such as Streaming ETL, Real-Time Stream Monitoring or Anomaly Detection are discussed. A live demo shows how to setup and use KSQL quickly and easily on top of your Kafka ecosystem.
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
KSQL – An Open Source Streaming Engine for Apache KafkaKai Wähner
The rapidly expanding world of stream processing can be daunting, with new concepts such as various types of time semantics, windowed aggregates, changelogs, and programming frameworks to master. KSQL is an open-source, Apache 2.0 licensed streaming SQL engine on top of Apache Kafka which aims to simplify all this and make stream processing available to everyone. The project is managed and open sourced by Confluent.
KSQL makes it easy to read, write, and process streaming data in real-time, at scale, using SQL-like semantics. It offers an easy way to express stream processing logic as an alternative to writing an application in a programming language such as Java, Python or Go. Benefits of using KSQL include: No coding required; no additional analytics cluster needed; streams and tables as first-class constructs; access to the rich Kafka ecosystem.
This session introduces the concepts and architecture of KSQL. Use cases such as Streaming ETL, Real Time Stream Monitoring or Anomaly Detection are discussed. A live demo shows how to setup and use KSQL quickly and easily on top of your Kafka ecosystem.
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQLconfluent
Speaker: Robin Moffatt, Developer Advocate, Confluent
In this talk, we'll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API and KSQL. We'll stream data in from MySQL, transform it with KSQL and stream it out to Elasticsearch. Options for integrating databases with Kafka using CDC and Kafka Connect will be covered as well.
This is part 2 of 3 in Streaming ETL - The New Data Integration series.
Watch the recording: https://videos.confluent.io/watch/4cVXUQ2jCLgJNmg4kjCRqo?.
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
KSQL is an open source streaming SQL engine for Apache Kafka. Come hear how KSQL makes it easy to get started with a wide-range of stream processing applications such as real-time ETL, sessionization, monitoring and alerting, or fraud detection. We'll cover both how to get started with KSQL and some under-the-hood details of how it all works.
Similar to Streaming Microservices With Akka Streams And Kafka Streams (20)
IoT 'Megaservices' - High Throughput Microservices with AkkaLightbend
**********
Watch this presentation on-demand!
https://info.lightbend.com/iot-megaservices-high-throughput-microservices-with-akka-register.html
**********
In this interactive presentation by Hugh McKee, Developer Advocate at Lightbend, we’ll share our experiences helping our clients create a system architecture that can support high throughput microservices (aka "Megaservices"). We’ll do that using IoT demo applications designed to push cloud service providers like Amazon and Google to their limits. Using sample code that you can later run on your own machine, we’ll look at:
* Modeling real-life digital twins for hundreds of thousands of IoT devices in the field, looking into how these megaservices are implemented in Akka.
* Visualizing Akka Actors–which represent IoT digital twins–in a “crop circle” formation that represents a complete distributed Reactive application, and watching at messages are processed across Akka Cluster nodes using cluster sharding.
* Some code behind the whole set up, which is built using OSS like Akka, Java, JavaScript, and Kubernetes.
Follow us on social:
TW: https://twitter.com/lightbend
LI: https://www.linkedin.com/company/lightbend-inc-/
FB: https://www.facebook.com/lightbendOfficial/
For more about Lightbend:
Blog: https://www.lightbend.com/blog
Newsletter: https://www.lightbend.com/newsletter
How Akka Cluster Works: Actors Living in a ClusterLightbend
Hugh McKee, Developer Advocate at Lightbend, demonstrates how Akka Actors work inside of a cluster, including the code and in-browser visualizations you need to grok it.
See the full content with videos here: https://www.lightbend.com/blog/how-akka-cluster-works-actors-living-in-a-cluster
The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsLightbend
In this presentation by Jonas Bonér, creator of Akka and founder/CTO of Lightbend, we review a set of eight Reactive Principles that enable the design and implementation of Cloud Native applications–applications that are highly concurrent, distributed, performant, scalable, and resilient, while at the same time conserving resources when deploying, operating, and maintaining them.
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesLightbend
In this webinar with Hugh McKee, Developer Advocate for Akka Platform, we’ll look at “What on Earth”, a demo exploring how Akka Microservices serves as an ideal solution for high-scale digital twinning for IoT.
For the full presentation, including video, visit: https://www.lightbend.com/blog/iot-building-digital-twins-with-akka-microservices
Akka at Enterprise Scale: Performance Tuning Distributed ApplicationsLightbend
Organizations like Starbucks, HPE, and PayPal (see our customers) have selected the Akka toolkit for their enterprise scale distributed applications; and when it comes to squeezing out the best possible performance, the secret is using two particular modules in tandem: Akka Cluster and Akka Streams.
In this webinar by Nolan Grace, Senior Solution Architect at Lightbend, we look at these two Akka modules and discuss the features that will push your application architecture to the next tier of performance.
For the full blog post, including the video, visit: https://www.lightbend.com/blog/akka-at-enterprise-scale-performance-tuning-distributed-applications
Digital Transformation with Kubernetes, Containers, and MicroservicesLightbend
See the full presentation here: https://www.lightbend.com/blog/digital-transformation-kubernetes-containers-microservices
In this talk by David Ogren, Principal Enterprise Architect at Lightbend, we draw from experiences helping our clients successfully create, migrate to, and manage cloud-native system architectures.
Detecting Real-Time Financial Fraud with Cloudflow on KubernetesLightbend
Deploying a robust streaming data pipeline can be a daunting task when your company’s financial information is at risk. For starters, how do you ensure proper provisioning of resources? How do you preserve end-to-end application and data consistency? How do you make all of this work in the cloud with Kubernetes and avoid YAML hell? Answer: Cloudflow, a new open-source toolkit for simplifying the development, deployment, and operation of streaming data pipelines.
In this webinar by Jonas Bonér, creator of Akka and CTO/Co-Founder of Lightbend, we take a look at Cloudstate, an OSS tool built on Akka, gRPC, Knative, GraalVM, and Kubernetes. Cloudstate lets you model, manage, and scale stateful services while preserving responsiveness by designing for resilience and elasticity.
Digital Transformation from Monoliths to Microservices to Serverless and BeyondLightbend
Join this highly-visual presentation by Hugh McKee, Developer Advocate at Lightbend, to learn more about the ramifications and opportunities along the evolution from monolithic systems, to microservices architectures, to serverless (FaaS).
See the video presentation on the Lightbend blog at: https://www.lightbend.com/blog/digital-transformation-from-monoliths-to-microservices-to-serverless-and-beyond
Akka Anti-Patterns, Goodbye: Six Features of Akka 2.6Lightbend
In this special guest webinar with Akka expert and Reactive System Consultant, Manuel Bernhardt, we review Akka 2.6 release highlights and a selection of 6 former anti-patterns that have now been rendered impossible by design.
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lightbend
In this guest webinar with Chris McDermott, Lead Data Engineer at HPE, learn how HPE InfoSight–powered by Lightbend Platform–has emerged as the go-to solution for providing real-time metrics and predictive analytics across various network, server, storage, and data center technologies.
Microservices, Kubernetes, and Application Modernization Done RightLightbend
In this talk by David Ogren, Enterprise Architect at Lightbend, we draw from experiences helping our clients successfully create, migrate to, and manage cloud-native system architectures. We look at some of the common pitfalls and anti-patterns of modernization efforts, and some of the best practices for taking an incremental approach to transforming legacy systems.
See the full post with video on the Lightbend blog: https://www.lightbend.com/blog/microservices-kubernetes-application-modernization
In this guest webinar by Kevin Webber, we cover the entire architecture of a Reactive system, from a responsive UI implemented with Vue.js, to a fully event sourced collection of microservices implemented with Java, Lagom, Cassandra, and Kafka.
For the full recording, visit: https://www.lightbend.com/blog/full-stack-reactive-in-practice-webinar
Akka and Kubernetes: A Symbiotic Love StoryLightbend
In this webinar by Hugh McKee, Developer Advocate at Lightbend, we take a look at how Akka and Kubernetes enjoy a symbiotic relationship, using live “crop circle” visuals to help. See the full video, slides, and additional resources here:
https://www.lightbend.com/blog/akka-and-kubernetes-a-symbiotic-love-story
Scala 3 Is Coming: Martin Odersky Shares What To KnowLightbend
Join Dr. Martin Odersky, the creator of Scala and co-founder of Lightbend, on a tour of what is in store and highlight some of his favorite features of Scala 3!
Migrating From Java EE To Cloud-Native Reactive SystemsLightbend
A lot of businesses that never before considered themselves as “technology companies” are now faced with digital modernization imperatives that force them to rethink their application and infrastructure architecture. On the path to becoming a digital, on-demand provider, development speed is the ultimate competitive advantage.
This presents challenges to many organizations that have huge investments in legacy Java EE infrastructure, where technical debt and monolithic system architectures require modernization in order to confront various business risks. Usually, changes need to be made within existing frameworks to keep pace with new web-scale organizations.
If your legacy monolith is no longer serving the expanding needs of your business, then join Markus Eisele, Director of Developer Advocacy at Lightbend, to learn what you can do to migrate from Java EE to cloud-native, Reactive systems—as defined by the Reactive Manifesto.
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsLightbend
In this talk by Sean Glover, Principal Engineer at Lightbend, we will review how the Strimzi Kafka Operator, a supported technology in Lightbend Platform, makes many operational tasks in Kafka easy, such as the initial deployment and updates of a Kafka and ZooKeeper cluster.
See the blog post containing the YouTube video here: https://www.lightbend.com/blog/running-kafka-on-kubernetes-with-strimzi-for-real-time-streaming-applications
Designing Events-First Microservices For A Cloud Native WorldLightbend
In this talk by Jonas Bonér, Lightbend CTO/Co-Founder and creator of Akka, we will explore the nature of events, what it means to be event-driven, and how we can unleash the power of events and commands by applying an events first, domain-driven design to microservices-based architectures.
For more information, head over to lightbend.com/blog!
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaLightbend
Join Jeremy Daggett, Solutions Architect at Lightbend, to see how Fortify SCA for Scala works differently from existing Static Code Analysis tools to help you uncover security issues early in the SDLC of your mission-critical applications.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Streaming Microservices With Akka Streams And Kafka Streams
1.
2. Check out these resources:
Dean’s book
Webinars
etc.
Fast Data Architectures
for Streaming Applications
Getting Answers Now from Data Sets that Never End
By Dean Wampler, Ph. D., VP of Fast Data Engineering
LIGHTBEND.COM/LEARN
2
5. Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N * M links ConsumersProducers
Before:
Why Kafka?
6. Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N * M links ConsumersProducers
Before:
Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N + M links ConsumersProducers
After:
Why Kafka?
7. Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N + M links ConsumersProducers
After:
Why Kafka?
Kafka:
• Simplify dependencies between
services
• Reduce data loss when a service
crashes
• M producers, N consumers
• Simplicity of one “API” for
communication
11. “Record-centric” μ-services
Events Records
A Spectrum of Microservices
Event-driven μ-services
…
Browse
REST
AccountOrders
Shopping
Cart
API Gateway
Inventory
storage
Data
Model
Training
Model
Serving
Other
Logic
13. “Record-centric” μ-services
Events Records
A Spectrum of Microservices
storage
Data
Model
Training
Model
Serving
Other
Logic
Emerged from the right-hand
side.
Kafka Streams pushes to the
left, supporting many event-
processing scenarios.
18. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
19. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
20. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
21. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
22. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
23. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
24. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
25. val builder = new StreamsBuilderS // New Scala Wrapper API.
val data = builder.stream[Array[Byte], Array[Byte]](rawDataTopic)
val model = builder.stream[Array[Byte], Array[Byte]](modelTopic)
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
model.mapValues(bytes => Model.parseBytes(bytes)) // array => record
.filter((key, model) => model.valid) // Successful?
.mapValues(model => ModelImpl.findModel(model))
.process(() => modelProcessor, …) // Set up actual model
data.mapValues(bytes => DataRecord.parseBytes(bytes))
.filter((key, record) => record.valid)
.mapValues(record => new ScoredRecord(scorer.score(record),record))
.to(scoredRecordsTopic)
val streams = new KafkaStreams(
builder.build, streamsConfiguration)
streams.start()
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Scored
Records
30. Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
implicit val system = ActorSystem("ModelServing")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
val modelStream: Source[ModelImpl, Consumer.Control] =
Consumer.atMostOnceSource(modelConsumerSettings,
Subscriptions.topics(modelTopic))
.map(input => Model.parseBytes(input.value()))
.filter(model => model.valid).map(_.get)
.map(model => ModelImpl.findModel(model))
.filter(model => model.valid).map(_.get)
val dataStream: Source[Record, Consumer.Control] =
Consumer.atMostOnceSource(dataConsumerSettings,
Subscriptions.topics(rawDataTopic))
.map(input => DataRecord.parseBytes(input.value()))
.filter(record => record.valid).map(_.get)
31. implicit val system = ActorSystem("ModelServing")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
val modelStream: Source[ModelImpl, Consumer.Control] =
Consumer.atMostOnceSource(modelConsumerSettings,
Subscriptions.topics(modelTopic))
.map(input => Model.parseBytes(input.value()))
.filter(model => model.valid).map(_.get)
.map(model => ModelImpl.findModel(model))
.filter(model => model.valid).map(_.get)
val dataStream: Source[Record, Consumer.Control] =
Consumer.atMostOnceSource(dataConsumerSettings,
Subscriptions.topics(rawDataTopic))
.map(input => DataRecord.parseBytes(input.value()))
.filter(record => record.valid).map(_.get)
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
32. implicit val system = ActorSystem("ModelServing")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
val modelStream: Source[ModelImpl, Consumer.Control] =
Consumer.atMostOnceSource(modelConsumerSettings,
Subscriptions.topics(modelTopic))
.map(input => Model.parseBytes(input.value()))
.filter(model => model.valid).map(_.get)
.map(model => ModelImpl.findModel(model))
.filter(model => model.valid).map(_.get)
val dataStream: Source[Record, Consumer.Control] =
Consumer.atMostOnceSource(dataConsumerSettings,
Subscriptions.topics(rawDataTopic))
.map(input => DataRecord.parseBytes(input.value()))
.filter(record => record.valid).map(_.get)
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
33. implicit val system = ActorSystem("ModelServing")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
val modelStream: Source[ModelImpl, Consumer.Control] =
Consumer.atMostOnceSource(modelConsumerSettings,
Subscriptions.topics(modelTopic))
.map(input => Model.parseBytes(input.value()))
.filter(model => model.valid).map(_.get)
.map(model => ModelImpl.findModel(model))
.filter(model => model.valid).map(_.get)
val dataStream: Source[Record, Consumer.Control] =
Consumer.atMostOnceSource(dataConsumerSettings,
Subscriptions.topics(rawDataTopic))
.map(input => DataRecord.parseBytes(input.value()))
.filter(record => record.valid).map(_.get)
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
34. implicit val system = ActorSystem("ModelServing")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
val modelProcessor = new ModelProcessor
val scorer = new Scorer(modelProcessor)
val modelStream: Source[ModelImpl, Consumer.Control] =
Consumer.atMostOnceSource(modelConsumerSettings,
Subscriptions.topics(modelTopic))
.map(input => Model.parseBytes(input.value()))
.filter(model => model.valid).map(_.get)
.map(model => ModelImpl.findModel(model))
.filter(model => model.valid).map(_.get)
val dataStream: Source[Record, Consumer.Control] =
Consumer.atMostOnceSource(dataConsumerSettings,
Subscriptions.topics(rawDataTopic))
.map(input => DataRecord.parseBytes(input.value()))
.filter(record => record.valid).map(_.get)
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
35. val model = ModelStage(modelProcessor)
def keepModelMaterializedValue[M1,M2,M3](m1:M1, m2:M2, m3:M3):M3 = m3
val modelPredictions: Source[Option[Double], ReadableModelStateStore] =
Source.fromGraph {
GraphDSL.create(dataStream, modelStream, model)(
keepModelMaterializedValue) { implicit builder => (d, m, w) =>
import GraphDSL.Implicits._
// Wire the input streams with the model stage (2 in, 1 out)
// dataStream --> | |
// | model | -> predictions
// modelStream -> | |
d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
36. val model = ModelStage(modelProcessor)
def keepModelMaterializedValue[M1,M2,M3](m1:M1, m2:M2, m3:M3):M3 = m3
val modelPredictions: Source[Option[Double], ReadableModelStateStore] =
Source.fromGraph {
GraphDSL.create(dataStream, modelStream, model)(
keepModelMaterializedValue) { implicit builder => (d, m, w) =>
import GraphDSL.Implicits._
// Wire the input streams with the model stage (2 in, 1 out)
// dataStream --> | |
// | model | -> predictions
// modelStream -> | |
d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
case class ModelStage(modelProcessor: …) extends
GraphStageWithMaterializedValue[…, …] {
val scorer = new Scorer(modelProcessor)
val dataRecordIn = Inlet[Record]("dataRecordIn")
val modelRecordIn = Inlet[ModelImpl]("modelRecordIn")
val scoringResultOut = Outlet[ScoredRecord]("scoringOut")
…
setHandler(dataRecordIn, new InHandler {
override def onPush(): Unit = {
val record = grab(dataRecordIn)
val newRecord = new ScoredRecord(scorer.score(record),record))
push(scoringResultOut, Some(newRecord))
pull(dataRecordIn)
})
…
}
37. val model = ModelStage(modelProcessor)
def keepModelMaterializedValue[M1,M2,M3](m1:M1, m2:M2, m3:M3):M3 = m3
val modelPredictions: Source[Option[Double], ReadableModelStateStore] =
Source.fromGraph {
GraphDSL.create(dataStream, modelStream, model)(
keepModelMaterializedValue) { implicit builder => (d, m, w) =>
import GraphDSL.Implicits._
// Wire the input streams with the model stage (2 in, 1 out)
// dataStream --> | |
// | model | -> predictions
// modelStream -> | |
d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
38. val model = ModelStage(modelProcessor)
def keepModelMaterializedValue[M1,M2,M3](m1:M1, m2:M2, m3:M3):M3 = m3
val modelPredictions: Source[Option[Double], ReadableModelStateStore] =
Source.fromGraph {
GraphDSL.create(dataStream, modelStream, model)(
keepModelMaterializedValue) { implicit builder => (d, m, w) =>
import GraphDSL.Implicits._
// Wire the input streams with the model stage (2 in, 1 out)
// dataStream --> | |
// | model | -> predictions
// modelStream -> | |
d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
39. val model = ModelStage(modelProcessor)
def keepModelMaterializedValue[M1,M2,M3](m1:M1, m2:M2, m3:M3):M3 = m3
val modelPredictions: Source[Option[Double], ReadableModelStateStore] =
Source.fromGraph {
GraphDSL.create(dataStream, modelStream, model)(
keepModelMaterializedValue) { implicit builder => (d, m, w) =>
import GraphDSL.Implicits._
// Wire the input streams with the model stage (2 in, 1 out)
// dataStream --> | |
// | model | -> predictions
// modelStream -> | |
d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
40. d ~> w.dataRecordIn
m ~> w.modelRecordIn
SourceShape(w.scoringResultOut)
}
}
val materializedReadableModelStateStore: ReadableModelStateStore =
modelPredictions
.to(Sink.ignore) // we do not read the results directly
.run() // we run the stream, materializing the stage's StateStore
Akka Cluster
Data
Model
Training
Model
Serving
Raw
Data
Model
Params
Alpakka
41. •Scale scoring with workers and
routers, across a cluster
•Persist actor state with Akka
Persistence
•Connect to almost anything with
Alpakka
• Enterprise Suite
•for production
Other Concerns
Akka Cluster
Model Serving
Other
Logic
Alpakka
Alpakka
Router
Worker
Worker
Worker
Worker
Stateful
Logic
Persistence
actor state
storage
Akka Cluster
Data
Model
Training
Model
Serving
Other
Logic
Raw
Data
Model
Params
Final
Records
Alpakka
Alpakka
storage
42. •Extremely low latency
•Minimal I/O and memory overhead
•Reactive Streams backpressure
•M producers, N consumers, but
directly connected (sort of)
•Use Akka Persistence for durable state
Go Direct or Through Kafka?
Akka Cluster
Model
Serving
Other
Logic
Alpakka
Alpakka
Model
Serving
Other
Logic
Scored
Records
vs. ?
•Higher latency (including queue depth)
•Higher I/O overhead
•Very large buffer (disk size)
•M producers, N consumers, completely
disconnected
•Automatic durability (topics on disk)
43. •Use for smaller, faster
messaging between
“components”.
•Watch for consumer “backup”
•Use Akka Persistence for
important state!
Akka Cluster
Model
Serving
Other
Logic
Alpakka
Alpakka
Model
Serving
Other
Logic
Scored
Records
vs. ?
•Use for larger volumes, more
course-grained service
interactions
•Plan partitioning and
replication carefully
Go Direct or Through Kafka?
46. Check out these resources:
Dean’s book
Webinars
etc.
Fast Data Architectures
for Streaming Applications
Getting Answers Now from Data Sets that Never End
By Dean Wampler, Ph. D., VP of Fast Data Engineering
46
LIGHTBEND.COM/LEARN
47. Serving Machine Learning Models
A Guide to Architecture, Stream Processing Engines,
and Frameworks
By Boris Lublinsky, Fast Data Platform Architect
47
LIGHTBEND.COM/LEARN
48. For even more information:
Tutorial - Building Streaming Applications with Kafka:
Software Architecture Conference New York
Strata Data Conference San Jose
Strata Data Conference London
My talk:
Strata Data Conference San Jose
48