Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2UkZRIC.
Monal Daxini presents a blueprint for streaming data architectures and a review of desirable features of a streaming engine. He also talks about streaming application patterns and anti-patterns, and use cases and concrete examples using Apache Flink. Filmed at qconsf.com.
Monal Daxini is the Tech Lead for Stream Processing platform for business insights at Netflix. He helped build the petabyte scale Keystone pipeline running on the Flink powered platform. He introduced Flink to Netflix, and also helped define the vision for this platform. He has over 17 years of experience building scalable distributed systems.
This Tutorial will discuss and demonstrate how to implement different realtime streaming analytics patterns. We will start with counting usecases and progress into complex patterns like time windows, tracking objects, and detecting trends. We will start with Apache Storm and progress into Complex Event Processing based technologies.
Solving DEBS Grand Challenge with WSO2 CEPSrinath Perera
The DEBS Grand Challenge is an annual event in which different event-based systems compete to solve a real-world problem. The 2014 challenge is to demonstrate scalable real- time analytics using high-volume sensor data collected from smart plugs over a one and a half month period. This paper aims to show how a general-purpose commercially available event-based system - the WSO2 Complex Event Processor (WSO2 CEP) - was used to solve this problem. We achieved 300k TPS with one node and neared 1Millions TPS with 4 nodes. In addition, we explore areas where we created extensions to the WSO2 CEP engine to better solve the challenge.
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera
Large scale data processing analyses and makes sense of large amounts of data. Although the field itself is not new, it is finding many usecases under the theme "Bigdata" where Google itself, IBM Watson, and Google's Driverless car are some of success stories. Spanning many fields, Large scale data processing brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture. Some usecases like Urban Planning can be slow, which is done in batch mode, while others like stock markets need results within Milliseconds, which are done in streaming fashion. There are different technologies for each case: MapReduce for batch processing and Complex Event Processing and Stream Processing for real-time usecases. Furthermore, the type of analysis range from basic statistics like mean to complicated prediction models based on machine Learning. In this talk, we will discuss data processing landscape: concepts, usecases, technologies and open questions while drawing examples from real world scenarios.
http://icter.org/conference/invited_speeches
With tens of thousands of Java servers running in production in enterprise, Java has become a language of choice for building production systems. If our machines are to exhibit acceptable performance, they require regular tuning.This talk takes a detailed look at techniques for tuning a Java Server.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...Srinath Perera
Sun Tzu said “if you know your enemies and know yourself, you can win a hundred battles without a single loss.” Those words have never been truer than in our time. We are faced with an avalanche of data. Many believe the ability to process and gain insights from a vast array of available data will be the primary competitive advantage for organizations in the years to come.
To make sense of data, you will have to face many challenges: how to collect, how to store, how to process, and how to react fast. Although you can build these systems from bottom up, it is a significant problem. There are many technologies, both open source and proprietary, that you can put together to build your analytics solution, which will likely save you effort and provide a better solution.
In this session, Srinath will discuss WSO2’s middleware offering in BigData and explain how you can put them together to build a solution that will make sense of your data. The session will cover technologies like thrift for collecting data, Cassandra for storing data, Hadoop for analyzing data in batch mode, and Complex event processing for analyzing data real time.
This Tutorial will discuss and demonstrate how to implement different realtime streaming analytics patterns. We will start with counting usecases and progress into complex patterns like time windows, tracking objects, and detecting trends. We will start with Apache Storm and progress into Complex Event Processing based technologies.
Solving DEBS Grand Challenge with WSO2 CEPSrinath Perera
The DEBS Grand Challenge is an annual event in which different event-based systems compete to solve a real-world problem. The 2014 challenge is to demonstrate scalable real- time analytics using high-volume sensor data collected from smart plugs over a one and a half month period. This paper aims to show how a general-purpose commercially available event-based system - the WSO2 Complex Event Processor (WSO2 CEP) - was used to solve this problem. We achieved 300k TPS with one node and neared 1Millions TPS with 4 nodes. In addition, we explore areas where we created extensions to the WSO2 CEP engine to better solve the challenge.
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...Srinath Perera
Large scale data processing analyses and makes sense of large amounts of data. Although the field itself is not new, it is finding many usecases under the theme "Bigdata" where Google itself, IBM Watson, and Google's Driverless car are some of success stories. Spanning many fields, Large scale data processing brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture. Some usecases like Urban Planning can be slow, which is done in batch mode, while others like stock markets need results within Milliseconds, which are done in streaming fashion. There are different technologies for each case: MapReduce for batch processing and Complex Event Processing and Stream Processing for real-time usecases. Furthermore, the type of analysis range from basic statistics like mean to complicated prediction models based on machine Learning. In this talk, we will discuss data processing landscape: concepts, usecases, technologies and open questions while drawing examples from real world scenarios.
http://icter.org/conference/invited_speeches
With tens of thousands of Java servers running in production in enterprise, Java has become a language of choice for building production systems. If our machines are to exhibit acceptable performance, they require regular tuning.This talk takes a detailed look at techniques for tuning a Java Server.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...Srinath Perera
Sun Tzu said “if you know your enemies and know yourself, you can win a hundred battles without a single loss.” Those words have never been truer than in our time. We are faced with an avalanche of data. Many believe the ability to process and gain insights from a vast array of available data will be the primary competitive advantage for organizations in the years to come.
To make sense of data, you will have to face many challenges: how to collect, how to store, how to process, and how to react fast. Although you can build these systems from bottom up, it is a significant problem. There are many technologies, both open source and proprietary, that you can put together to build your analytics solution, which will likely save you effort and provide a better solution.
In this session, Srinath will discuss WSO2’s middleware offering in BigData and explain how you can put them together to build a solution that will make sense of your data. The session will cover technologies like thrift for collecting data, Cassandra for storing data, Hadoop for analyzing data in batch mode, and Complex event processing for analyzing data real time.
Streaming data presents new challenges for statistics and machine learning on extremely large data sets. Tools such as Apache Storm, a stream processing framework, can power range of data analytics but lack advanced statistical capabilities. These slides are from the Apache.con talk, which discussed developing streaming algorithms with the flexibility of both Storm and R, a statistical programming language.
At the talk I dicsussed issues of why and how to use Storm and R to develop streaming algorithms; in particular I focused on:
• Streaming algorithms
• Online machine learning algorithms
• Use cases showing how to process hundreds of millions of events a day in (near) real time
See: https://apacheconna2015.sched.org/event/09f5a1cc372860b008bce09e15a034c4#.VUf7wxOUd5o
Within this tutorial we present the results of recent research about the cloud enablement of data streaming systems. We illustrate, based on both industrial as well as academic prototypes, new emerging uses cases and research trends. Specifically, we focus on novel approaches for (1) fault tolerance and (2) scalability in large scale distributed streaming systems. In general, new fault tolerance mechanisms strive to be more robust and at the same time introduce less overhead. Novel load balancing approaches focus on elastic scaling over hundreds of instances based on the data and query workload. Finally, we present open challenges for the next generation of cloud-based data stream processing engines.
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
This presentation covers our use of Storm and the connectors we've built. It also proposes a design for integrating Storm with real-time web services by embedding parts of topologies directly into the web services layer.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them.
Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code.
Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
Stream processing engines are becoming pivotal in analyzing data. They have evolved beyond a data transport and simple processing machinery, to one that's capable of complex processing. The necessary features and building blocks of these engines are well known. And most capable engines have a familiar Dataflow based programming model.
As with any new paradigm, building streaming applications requires a different mindset and approach. Hence there is a need for identifying and describing patterns and anti-patterns for building these applications. Currently this mindshare is scarce.
Drawn from my experience working with several engineers within and outside of Netflix, this talk will present the following:
A blueprint for streaming data architectures and a review of desirable features of a streaming engine
Streaming Application patterns and anti-patterns
Use cases and concrete examples using Flink
Attendees will come away with patterns that can be applied to any capable stream processing framework such as Apache Flink.
Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share the details about this platform, and our experience building it.
Streaming data presents new challenges for statistics and machine learning on extremely large data sets. Tools such as Apache Storm, a stream processing framework, can power range of data analytics but lack advanced statistical capabilities. These slides are from the Apache.con talk, which discussed developing streaming algorithms with the flexibility of both Storm and R, a statistical programming language.
At the talk I dicsussed issues of why and how to use Storm and R to develop streaming algorithms; in particular I focused on:
• Streaming algorithms
• Online machine learning algorithms
• Use cases showing how to process hundreds of millions of events a day in (near) real time
See: https://apacheconna2015.sched.org/event/09f5a1cc372860b008bce09e15a034c4#.VUf7wxOUd5o
Within this tutorial we present the results of recent research about the cloud enablement of data streaming systems. We illustrate, based on both industrial as well as academic prototypes, new emerging uses cases and research trends. Specifically, we focus on novel approaches for (1) fault tolerance and (2) scalability in large scale distributed streaming systems. In general, new fault tolerance mechanisms strive to be more robust and at the same time introduce less overhead. Novel load balancing approaches focus on elastic scaling over hundreds of instances based on the data and query workload. Finally, we present open challenges for the next generation of cloud-based data stream processing engines.
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
This presentation covers our use of Storm and the connectors we've built. It also proposes a design for integrating Storm with real-time web services by embedding parts of topologies directly into the web services layer.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them.
Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code.
Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.
Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.
If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).
In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.
Stream processing engines are becoming pivotal in analyzing data. They have evolved beyond a data transport and simple processing machinery, to one that's capable of complex processing. The necessary features and building blocks of these engines are well known. And most capable engines have a familiar Dataflow based programming model.
As with any new paradigm, building streaming applications requires a different mindset and approach. Hence there is a need for identifying and describing patterns and anti-patterns for building these applications. Currently this mindshare is scarce.
Drawn from my experience working with several engineers within and outside of Netflix, this talk will present the following:
A blueprint for streaming data architectures and a review of desirable features of a streaming engine
Streaming Application patterns and anti-patterns
Use cases and concrete examples using Flink
Attendees will come away with patterns that can be applied to any capable stream processing framework such as Apache Flink.
Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share the details about this platform, and our experience building it.
Netflix Keystone SPaaS: Real-time Stream Processing as a Service - ABD320 - r...Amazon Web Services
Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. In this session, I share the benefits and our experience building the platform.
Introduction to apache kafka, confluent and why they matterPaolo Castagna
This is a short and introductory presentation on Apache Kafka (including Kafka Connect APIs, Kafka Streams APIs, both part of Apache Kafka) and other open source components part of the Confluent platform (such as KSQL).
This was the first Kafka Meetup in South Africa.
Flink at netflix paypal speaker seriesMonal Daxini
* Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events flowing through the Keystone stream processing infrastructure to help glean business insights and improve customer experience. The self-serve infrastructure enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. I’ll share our experience building building this platform with Flink, and lessons learnt.
O'Reilly Media Webcast: Building Real-Time Data PipelinesSingleStore
As our customers tap into new sources of data or modify to existing data pipelines, we are often asked questions like: What technologies should we consider? Where can we reduce data latency? How can we simplify our data architecture?
To eliminate the guesswork, we teamed up with Ben Lorica, Chief Data Scientist at O’Reilly Media to host a webcast centered around building real-time data pipelines.
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
Eric Frenkiel, MemSQL CEO and co-founder and Gartner Catalyst. August 11, 2015, San Diego, CA. Watch the Pinterest Demo Video here: https://youtu.be/KXelkQFVz4E
What's new in Confluent 3.2 and Apache Kafka 0.10.2 confluent
With the introduction of connect and streams API in 2016, Apache Kafka is becoming the defacto solution for anyone looking to build a streaming platform. The community continues to add additional capabilities to make it the complete solution for streaming data.
Join us as we review the latest additions in Apache Kafka 0.10.2. In addition, we’ll cover what’s new in Confluent Enterprise 3.2 that makes it possible for running Kafka at scale.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
ndependent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
The Power of Distributed Snapshots in Apache FlinkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2FcTsIA.
Stephan Ewen talks about how Apache Flink handles stateful stream processing and how to manage distributed stream processing and data driven applications efficiently with Flink's checkpoints and savepoints. Filmed at qconsf.com.
Stephan Ewen is a PMC member and one of the original creators of Apache Flink, and co-founder and CTO of data Artisans.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaGuido Schmutz
After a quick overview and introduction of Apache Kafka, this session cover two components which extend the core of Apache Kafka: Kafka Connect and Kafka Streams/KSQL.
Kafka Connects role is to access data from the out-side-world and make it available inside Kafka by publishing it into a Kafka topic. On the other hand, Kafka Connect is also responsible to transport information from inside Kafka to the outside world, which could be a database or a file system. There are many existing connectors for different source and target systems available out-of-the-box, either provided by the community or by Confluent or other vendors. You simply configure these connectors and off you go.
Kafka Streams is a light-weight component which extends Kafka with stream processing functionality. By that, Kafka can now not only reliably and scalable transport events and messages through the Kafka broker but also analyse and process these event in real-time. Interestingly Kafka Streams does not provide its own cluster infrastructure and it is also not meant to run on a Kafka cluster. The idea is to run Kafka Streams where it makes sense, which can be inside a “normal” Java application, inside a Web container or on a more modern containerized (cloud) infrastructure, such as Mesos, Kubernetes or Docker. Kafka Streams has a lot of interesting features, such as reliable state handling, queryable state and much more. KSQL is a streaming engine for Apache Kafka, providing a simple and completely interactive SQL interface for processing data in Kafka.
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...DataStax Academy
We will be talking about the solution we developed for using Mesos, Docker, Kafka, Spark, Cassandra and Solr (DataStax Enterprise Edition) all developed in Go for doing realtime log analysis at scale. Many organizations either need or want log analysis in real time where you can see within a second what is happening within your entire infrastructure. Today, with the hardware available and software systems we have in place, you can develop, build and use as a service these solutions.
http://www.oreilly.com/pub/e/3764
Keystone processes over 700 billion events per day (1 peta byte) with at-least-once processing semantics in the cloud. Monal Daxini details how they used Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. He'll also share plans on offering a Stream Processing as a Service for all of Netflix use.
Similar to Patterns of Streaming Applications (20)
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/39NIjLV.
Akhilesh Gupta does a technical deep-dive into how Linkedin uses the Play/Akka Framework and a scalable distributed system to enable live interactions like likes/comments at massive scale at extremely low costs across multiple data centers. Filmed at qconlondon.com.
Akhilesh Gupta is the technical lead for LinkedIn's Real-time delivery infrastructure and LinkedIn Messaging. He has been working on the revamp of LinkedIn’s offerings to instant, real-time experiences. Before this, he was the head of engineering for the Ride Experience program at Uber Technologies in San Francisco.
Next Generation Client APIs in Envoy MobileC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2x0Fav8.
Jose Nino guides the audience through the journey of Mobile APIs at Lyft. He focuses on how the team has reaped the benefits of API generation to experiment with the network transport layer. He also discusses recent developments the team has made with Envoy Mobile and the roadmap ahead. Filmed at qconlondon.com.
Jose Nino works as a Software Engineer at Lyft.
Software Teams and Teamwork Trends Report Q1 2020C4Media
How do we cope with an environment that has been radically disrupted, where people are suddenly thrust into remote work in a chaotic state? What are the emerging good practices and new ideas that are shaping the way in which software development teams work? What can we do to make the workplace a more secure and diverse one while increasing the productivity of our teams? This report aims to assist technical leaders in making mid- to long-term decisions that will have a positive impact on their organisations and teams and help individual contributors find the practices, approaches, tools, techniques, and frameworks that can help them get a better experience at work - irrespective of where they are working from.
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2QCmmJ0.
Mark Stoodley examines some of the strengths and weaknesses of the different Java compilation technologies, if one was to apply them in isolation. Stoodley discusses how production JVMs are assembling a combination of these tools that work together to provide excellent performance across the large spectrum of applications written in Java and JVM based languages. Filmed at qconsf.com.
Mark Stoodley joined IBM Canada to build Java JIT compilers for production use and led the team that delivered AOT compilation in the IBM SDK for Java 6. He spent the last five years leading the effort to open source nearly 4.3 million lines of source code from the IBM J9 Java Virtual Machine to create the two open source projects Eclipse OMR and Eclipse OpenJ9, and now co-leads both projects.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2y2yPiS.
Colin McCabe talks about the ongoing effort to replace the use of Zookeeper in Kafka: why they want to do it and how it will work. He discusses the limitations they have found and how Kafka benefits both in terms of stability and scalability by bringing consensus in house. He talks about their progress, what work is remaining, and how contributors can help. Filmed at qconsf.com.
Colin McCabe is a Kafka committer at Confluent, working on the scalability and extensibility of Kafka. Previously, he worked on the Hadoop Distributed Filesystem and the Ceph Filesystem.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2SXXXiD.
Katharina Probst talks about what it means to act like an owner and why teams need ownership to be high-performing. When team members, regardless of whether they have a formal leadership role or not, act like owners, magical things can happen. She shares ideas that we can apply to our own work, and talks about how to recognize when we don’t live up to our own expectations of acting like an owner. Filmed at qconsf.com.
Katharina Probst is a Senior Engineering Leader, Kubernetes & SaaS at Google. Before this, she was leading engineering teams at Netflix, being responsible for the Netflix API, which helps bring Netflix streaming to millions of people around the world. Prior to joining Netflix, she was in the cloud computing team at Google, where she saw cloud computing from the provider side.
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2T04Lw4.
Sergey Kuksenko talks about the performance benefits inline types bring to Java and how to exploit them. Inline/value types are the key part of experimental project Valhalla, which should bring new abilities to the Java language. Filmed at qconsf.com.
Sergey Kuksenko is a Java Performance Engineer at Oracle working on a variety of Java and JVM performance enhancements. He started working as Java Engineer in 1996 and as Java Performance Engineer in 2005. He has had a passion for exploring how Java works on modern hardware.
Do you need service meshes in your tech stack?
This on-line guide aims to answer pertinent questions for software architects and technical leaders, such as: what is a service mesh?, do I need a service mesh?, how do I evaluate the different service mesh offerings? In software architecture, a service mesh is a dedicated infrastructure layer for facilitating service-to-service communications between microservices, often using a sidecar proxy.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2UgQ3BU.
Christie Wilson describes what to expect from CI/CD in 2019, and how Tekton is helping bring that to as many tools as possible, such as Jenkins X and Prow. Wilson talks about Tekton itself and performs a live demo that shows how cloud native CI/CD can help debug, surface and fix mistakes faster. Filmed at qconsf.com.
Christie Wilson is a software engineer at Google, currently leading the Tekton project. Over the past decade, she has worked in the mobile, financial and video game industries. Prior to working at Google she led a team of software developers to build load testing tools for AAA video game titles, and founded the Vancouver chapter of PyLadies.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2S7lDiS.
Sasha Rosenbaum shows how a CI/CD pipeline for Machine Learning can greatly improve both productivity and reliability. Filmed at qconsf.com.
Sasha Rosenbaum is a Program Manager on the Azure DevOps engineering team, focused on improving the alignment of the product with open source software. She is a co-organizer of the DevOps Days Chicago and the DeliveryConf conferences, and recently published a book on Serverless computing in Azure with .NET.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/36epVKg.
Todd Montgomery discusses the techniques and lessons learned from implementing Aeron Cluster. His focus is on how Raft can be implemented on Aeron, minimizing the network round trip overhead, and comparing single process to a fully distributed cluster. Filmed at qconsf.com.
Todd Montgomery is a networking hacker who has researched, designed, and built numerous protocols, messaging-oriented middleware systems, and real-time data systems, done research for NASA, contributed to the IETF and IEEE, and co-founded two startups. He currently works as an independent consultant and is active in several open source projects.
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2FWc5Sk.
Ben Sigelman talks about "Deep Systems", their common properties and re-introduces the fundamentals of control theory from the 1960s, including the original conceptualizations of Observability & Controllability. He uses examples from Google & other companies to illustrate how deep systems have damaged people's ability to observe software, and what needs to be done in order to regain control. Filmed at qconsf.com.
Ben Sigelman is a co-founder and the CEO at LightStep, a co-creator of Dapper (Google’s distributed tracing system), and co-creator of the OpenTracing and OpenTelemetry projects (both part of the CNCF). His work and interests gravitate towards observability, especially where microservices, high transaction volumes, and large engineering organizations are involved.
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/39SddUL.
Victor Dibia provides a friendly introduction to machine learning, covers concrete steps on how front-end developers can create their own ML models and deploy them as part of web applications. He discusses his experience building Handtrack.js - a library for prototyping real time hand tracking interactions in the browser. Filmed at qconsf.com.
Victor Dibia is a Research Engineer with Cloudera’s Fast Forward Labs. Prior to this, he was a Research Staff Member at the IBM TJ Watson Research Center, New York. His research interests are at the intersection of human computer interaction, computational social science, and applied AI.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2s9T3Vl.
Colin Eberhardt looks at some of the internals of WebAssembly, explores how it works “under the hood”, and looks at how to create a (simple) compiler that targets this runtime. Filmed at qconsf.com.
Colin Eberhardt is the Technology Director at Scott Logic, a UK-based software consultancy where they create complex application for their financial services clients. He is an avid technology enthusiast, spending his evenings contributing to open source projects, writing blog posts and learning as much as he can.
User & Device Identity for Microservices @ Netflix ScaleC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2S9tOgy.
Satyajit Thadeshwar provides useful insights on how Netflix implemented a secure, token-agnostic, identity solution that works with services operating at a massive scale. He shares some of the lessons learned from this process, both from architectural diagrams and code. Filmed at qconsf.com.
Satyajit Thadeshwar is an engineer on the Product Edge Access Services team at Netflix, where he works on some of the most critical services focusing on user and device authentication. He has more than a decade of experience building fault-tolerant and highly scalable, distributed systems.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2Ezs08q.
Justin Ryan talks about Netflix’ scalability issues and some of the ways they addressed it. He shares successes they’ve had from unintuitively partitioning computation into multiple services to get better runtime characteristics. He introduces us to useful probabilistic data structures, innovative bi-directional data passing, open-source projects available from Netflix that make this all possible. Filmed at qconsf.com.
Justin Ryan is Playback Edge Engineering at Netflix. He works on some of the most critical services at Netflix, specifically focusing on user and device authentication. Years of building developer tools has also given him a healthy set of opinions on developer productivity.
Make Your Electron App Feel at Home EverywhereC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2Z4ZJjn.
Kilian Valkhof discusses the process of making an Electron app feel at home on all three platforms: Windows, MacOS and Linux, making devs aware of the pitfalls and how to avoid them. Filmed at qconsf.com.
Kilian Valkhof is a Front-end Developer & User-experience Designer at Firstversionist. He writes about various topics, from design to machine learning, on his personal website, kilianvalkhof.com and is a frequent contributer to open source software. He is part of the Electron governance team that oversees the development of the Electron framework.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/344PnB1.
Steve Klabnik goes over the deep details of how async/await works in Rust, covering concepts like coroutines, generators, stack-less vs stack-ful, "pinning", and more. Filmed at qconsf.com.
Steve Klabnik is on the core team of Rust, leads the documentation team, and is an author of "The Rust Programming Language." He is a frequent speaker at conferences and is a prolific open source contributor, previously working on projects such as Ruby and Ruby on Rails.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2rm4hFD.
Yevgeniy Brikman talks about how to write automated tests for infrastructure code, including the code written for use with tools such as Terraform, Docker, Packer, and Kubernetes. Topics covered include: unit tests, integration tests, end-to-end tests, dependency injection, test parallelism, retries and error handling, static analysis, property testing and CI / CD for infrastructure code. Filmed at qconsf.com.
Yevgeniy Brikman is the co-founder of Gruntwork, a company that provides DevOps as a Service. He is the author of two books published by O'Reilly Media: Hello, Startup and Terraform: Up & Running. Previously, he worked as a software engineer at LinkedIn, TripAdvisor, Cisco Systems, and Thomson Financial.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Patterns of Streaming Applications
1. Monal Daxini
11/ 6 / 2018 @ monaldax
Patterns Of
Streaming Applications
P
O
S
A
2. InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
apache-flink-streaming-app
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4. • 4+ years building stream processing platform at Netflix
• Drove technical vision, roadmap, led implementation
• 17+ years building distributed systems
Profile
@monaldax
5. Set The Stage 8 Patterns
5 Functional 3 Non-Functional
Structure Of The Talk
Stream
Processing ?
@monaldax
6. Disclaimer
Inspired by True Events encountered building and operating a
Stream Processing platform, and use cases that are in
production or in ideation phase in the cloud.
Some code and identifying details have been changed, artistic
liberties have been taken, to protect the privacy of streaming
applications, and for sharing the know-how. Some use cases
may have been simplified.
@monaldax
14. 1. Low latency insights and analytics
2. Process unbounded data sets
3. ETL as data arrives
4. Ad-hoc analytics and Event driven applications
Why Stream Processing?
@monaldax
33. • Create ingest pipelines for different event streams declaratively
• Route events to data warehouse, data stores for analytics
• With at-least-once semantics
• Streaming ETL - Allow declarative filtering and projection
1.1 Use Case / Motivation – Ingest Pipelines
@monaldax
34. 1.1 Keystone Pipeline – A Self-serve Product
• SERVERLESS
• Turnkey – ready to use
• 100% in the cloud
• No code, Managed Code & Operations
@monaldax
35. 1.1 UI To Provision 1 Data Stream, A Filter, & 3 Sinks
41. • Popular stream / topic has high fan-out factor
• Requires large Kafka Clusters, expensive
1.2 Use Case / Motivation – Ingest large streams with high fan-out Efficiently
@monaldax
R
R
Filter
Projection
Kafka
R
TopicA Cluster1
TopicB Cluster1
Events
Producer
42. 1.2 Pattern: Configurable Co-Isolated Router
@monaldax
Co-Isolated Router
R
Filter
Projection
Kafka
TopicA Cluster1
TopicB Cluster1
Merge Routing To Same Kafka Cluster
Events
Producer
44. 2. Script UDF* Component
[Static / Dynamic]
*UDF – User Defined Function
@monaldax
45. 2. Use Case / Motivation – Configurable Business Logic Code
for operations like transformations and filtering
Managed Router / Streaming Job
Source Sink
Biz
Logic
Job DAG
@monaldax
46. 2. Pattern: Static or Dynamic Script UDF (stateless) Component
Comes with all the Pros and Cons of scripting engine
Streaming Job
Source Sink
UDF
Script Engine executes
function defined in the UI
@monaldax
47. val xscript =
new DynamicConfig("x.script")
kakfaSource
.map(new ScriptFunc>on(xscript))
.filter(new ScriptFunc>on(xsricpt2))
.addSink(new NoopSink())
2. Code Snippet: Script UDF Component
// Script Function
val sm = new ScriptEngineManager()
ScriptEngine se =
m.getEngineByName ("nashorn");
se .eval(script)
Contents configurable at runtime
@monaldax
50. 3. User Case - Generating Play Events For
Personalization And Show Discovery
@monaldax
51. @monaldax
3. Use-case: Create play events with current data from services, and lookup
table for analytics. Using lookup table keeps originating events lightweight
Playback
History Service
Video
Metadata
Play Logs
Service call
Periodically updating
lookup data
Streaming Job
Resource
Rate Limiter
52. 3. Pattern: The Enricher
- Rate limit with source or service rate limiter, or with resources
- Pull or push data, Sync / async
Streaming Job
Source Sink
Side Input
• Service call
• Lookup from Data Store
• Static or Periodically updated lookup data
@monaldax
Source / Service
Rate Limiter
53. 3. Code Snippet: The Enricher
val kafkaSource = getSourceBuilder.fromKafka("topic1").build()
val parsedMessages = kafkaSource.flatMap(parser).name(”parser")
val enrichedSessions = parsedMessages.filter(reflushFilter).name(”filter")
.map(playbackEnrichment).name(”service")
.map(dataLookup)
enrichmentSessions.addSink(sink).name("sink")
@monaldax
55. 4. Use Case – Play-Impressions Conversion Rate
@monaldax
56. • 130+ M members
• 10+ B Impressions / day
• 2.5+ B Play Events / day
• ~ 2 TB Processing State
4. Impressions And Plays Scale
@monaldax
57. Streaming Job
• # Impressions per user play
• Impression attributes leading to the play
4. Join Large Streams With Delayed, Out Of Order Events Based on
Event Time
@monaldax
P1P3
I2I1
Sink
Kafka Topics
plays
impressions
58. Understanding Event Time
Image Adapted from The Apache Beam Presentation Material
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
1 hour Window
59. impressions
Streaming Job
keyBy F1
4. Use Case: Join Impressions And Plays Stream On Event Time
Co-process
@monaldax
Kafka Topics
keyBy F2playsP2 K
Keyed State
I1 K
I2 K
Merge I2 P2
& Emit
Merge I2 P2
& Emit
60. Source 1
Streaming Job
keyBy F1
Co-process
State 1
State 2
@monaldax
keyBy F2Source 2
Keyed State
Sink
4. Pattern: The Co-process Joiner
• Process and Coalesce events for each stream grouped by same key
• Join if there is a match, evict when joined or timed out
61. 4. Code Snippet – The Co-process Joiner, Setup sources
env.setStreamTimeCharacteristic(EventTime)
val impressionSource = kafkaSrc1
.filter(eventTypeFilter)
.flatMap(impressionParser)
.keyBy(in => (s"${profile_id}_${title_id}"))
val impressionSource = kafkaSrc2
.flatMap(playbackParser)
.keyBy(in => (s"${profile_id}_${title_id}"))
@monaldax
62. env.setStreamTimeCharacteristic(EventTime)
val impressionSource = kafkaSrc1.filter(eventTypeFilter)
.flatMap(impressionParser)
.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(10) ) {...})
.keyBy(in => (s"${profile_id}_${title_id}"))
val impressionSource = kafkaSrc2.flatMap(playbackParser)
.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(10) ) {...})
.keyBy(in => (s"${profile_id}_${title_id}"))
4. Code Snippet – The Co-process Joiner, Setup sources
63. 4. Code Snippet – The Co-process Joiner, Connect Streams
// Connect
impressionSource.connect(playSessionSource)
.process( new CoprocessImpressionsPlays())
.addSink(kafkaSink)
@monaldax
64. 4. Code Snippet – The Co-process Joiner, Co-process Function
class CoprocessJoin extends CoProcessFunction {
override def processElement1(value, context, collector) {
… // update and reduce state, join with stream 2, set timer
}
override def processElement2(value, context, collector) {
… // update and reduce state, join with stream 2, set timer
}
override def onTimer(timestamp, context, collector) {
… // clear up state based on event time
}
@monaldax
66. 5. Use-case: Publish movie assets CDN location to Cache, to steer
clients to the closest location for playback
EvCache
asset1 OCA1, CA2
asset2 OCA1, CA3
Playback Assets
Service
Service
asset_1 added asset_8 deleted
OCA
OCA1
OCA
OCAN
Open Connect
Appliances (CDN)
Upload asset_1 Delete asset_8
Streaming Job
Source1
Generates Events to
publish all assets
asset1 OCA1, CA2….
asset2 OCA1, CA3 …
@monaldax
Materialized View
67. 5. Use-case: Publish movie assets CDN location to Cache, to steer
clients to the closest location for playback
Streaming Job
Event Publisher
Optional Trigger
to flush view
Materialized View
@monaldax
Materialized View
Sink
68. 5. Code Snippet – Setting up Sources
val fullPublishSource = env .addSource(new FullPublishSourceFunc7on(),
TypeInfoParser.parse("Tuple3<String, Integer, com.ne7lix.AMUpdate>"))
.setParallelism(1);
val kaCaSource = getSourceBuilder().fromKaCa("am_source")
@monaldax
69. 5. Code Snippet – Union Source & Processing
val kafkaSource
.flatMap(new FlatmapFunction())) //split by movie
.assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks[...]())
.union(fullPublishSource) // union with full publish source
.keyBy(0, 1) // (cdn stack, movie)
.process(new UpdateFunction())) // update in-memory state, output at intervals.
.keyBy(0, 1) // (cdn stack, movie)
.process(new PublishToEvCacheFunction())); // publish to evcache
@monaldax
72. 6 Elastic Dev Interface
Spectrum Of Ease, Capability, & Flexibility
• Point & Click, with UDFs
• SQL with UDFs
• Annotation based API with code generation
• Code with reusable components
• (e.g., Data Hygiene, Script Transformer)
EaseofUse
Capability
76. 8. Use Case - Restate Results Due To Outage Or Bug In Business Logic
@monaldax
Time
Checkpoint y Checkpoint x
outage
Checkpoint x+1
Now
77. Time
8. Pattern: Rewind And Restatement
Rewind the source and state to a know good state
@monaldax
Checkpoint y Checkpoint x
outage
Checkpoint x+1
Now