This document discusses approaches for streaming data processing and reasoning about time in streams. It summarizes the limitations of the Lambda architecture and argues that streaming systems alone can provide low-latency and exactly-once processing if they support strong consistency, windowing, and watermark-based triggers. The document also presents Google Cloud Dataflow as a streaming data processing system that provides these capabilities through its aggregation, windowing, and triggers APIs to allow flexible reasoning about event and processing times.
Samza: Real-time Stream Processing at LinkedInC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1eGbVJv.
Chris Riccomini discusses: Samza's feature set, how Samza integrates with YARN and Kafka, how it's used at LinkedIn, and what's next on the roadmap. Filmed at qconsf.com.
Chris Riccomini is a Staff Software Engineer at LinkedIn, where he's is currently working as a committer and PMC member for Apache Samza. He's been involved in a wide range of projects at LinkedIn, including, "People You May Know", REST.li, Hadoop, engineering tooling, and OLAP systems. Prior to LinkedIn, he worked on data visualization and fraud modeling at PayPal.
Extending the Yahoo Streaming BenchmarkJamie Grier
This presentation covers describes my own benchmarking of Apache Storm and Apache Flink based on the work started by Yahoo! It shows the incredible performance of Apache Flink
Apache Flink: Streaming Done Right @ FOSDEM 2016Till Rohrmann
The talk I gave at the FOSDEM 2016 on the 31st of January.
The talk explains how we can do stateful stream processing with Apache Flink at the example of counting tweet impressions. It covers Flink's windowing semantics, stateful operators, fault tolerance and performance numbers. The talks ends with giving an outlook on what's is going to happen in the next couple of months.
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...Flink Forward
Pattern matching over event streams is increasingly being employed in many areas including financial services and click stream analysis. Flink, as a true stream processing engine, emerges as a natural candidate for these usecases. In this talk, we will present FlinkCEP, a library for Complex Event Processing (CEP) based on Flink. At the conceptual level, we will see the different patterns the library can support, we will present the main building blocks we implemented to support them, and we will discuss possible future additions that will further enhance the coverage of the library. At the practical level, we will show how the integration of FlinkCEP with Flink allows the former to take advantage of Flink's rich ecosystem (e.g. connectors) and its stream processing capabilities, such as support for event-time processing, exactly-once state semantics, fault-tolerance, savepoints and high throughput.
Webhooks do's and dont's: what we learned after integrating +100 APIs - Giuli...Codemotion
Le applicazioni moderne sono sempre più orientate ad essere una composizione di API e ad avere un architettura serverless, per questo motivo chi sviluppa API non può limitarsi ad esporre i più comuni endpoint REST. I Webhook non possono mancare in un' API moderna ma non c'è nulla nella letteratura delle API HTTP che si avvicini ad un formato standard per la loro progettazione dando vita alle implementazioni più disparate. Dopo aver integrato oltre 100 API con Stamplay vi raccontiamo quali sono i pro e i contro delle scelte progettuali che si fanno nello sviluppo di Webhook.
(DVO204) Monitoring Strategies: Finding Signal in the NoiseAmazon Web Services
"You need to monitor only a few machines and applications before fixing issues in your environment becomes very complicated. Throw in the type of dynamic infrastructure provided by Amazon EC2, and your static monitoring strategies will most likely not scale. Knowing which metrics to watch and how to troubleshoot based on those metrics will help you solve problems more quickly. In this session, we will look at a framework for your metrics and how to use it to find solutions to the issues that come up. We will cover the three types of monitoring data; what to collect; what should trigger an alert (avoiding an alert storm); and how to follow the resources to find the root causes of problems. Session sponsored by Datadog.
"
Samza: Real-time Stream Processing at LinkedInC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1eGbVJv.
Chris Riccomini discusses: Samza's feature set, how Samza integrates with YARN and Kafka, how it's used at LinkedIn, and what's next on the roadmap. Filmed at qconsf.com.
Chris Riccomini is a Staff Software Engineer at LinkedIn, where he's is currently working as a committer and PMC member for Apache Samza. He's been involved in a wide range of projects at LinkedIn, including, "People You May Know", REST.li, Hadoop, engineering tooling, and OLAP systems. Prior to LinkedIn, he worked on data visualization and fraud modeling at PayPal.
Extending the Yahoo Streaming BenchmarkJamie Grier
This presentation covers describes my own benchmarking of Apache Storm and Apache Flink based on the work started by Yahoo! It shows the incredible performance of Apache Flink
Apache Flink: Streaming Done Right @ FOSDEM 2016Till Rohrmann
The talk I gave at the FOSDEM 2016 on the 31st of January.
The talk explains how we can do stateful stream processing with Apache Flink at the example of counting tweet impressions. It covers Flink's windowing semantics, stateful operators, fault tolerance and performance numbers. The talks ends with giving an outlook on what's is going to happen in the next couple of months.
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...Flink Forward
Pattern matching over event streams is increasingly being employed in many areas including financial services and click stream analysis. Flink, as a true stream processing engine, emerges as a natural candidate for these usecases. In this talk, we will present FlinkCEP, a library for Complex Event Processing (CEP) based on Flink. At the conceptual level, we will see the different patterns the library can support, we will present the main building blocks we implemented to support them, and we will discuss possible future additions that will further enhance the coverage of the library. At the practical level, we will show how the integration of FlinkCEP with Flink allows the former to take advantage of Flink's rich ecosystem (e.g. connectors) and its stream processing capabilities, such as support for event-time processing, exactly-once state semantics, fault-tolerance, savepoints and high throughput.
Webhooks do's and dont's: what we learned after integrating +100 APIs - Giuli...Codemotion
Le applicazioni moderne sono sempre più orientate ad essere una composizione di API e ad avere un architettura serverless, per questo motivo chi sviluppa API non può limitarsi ad esporre i più comuni endpoint REST. I Webhook non possono mancare in un' API moderna ma non c'è nulla nella letteratura delle API HTTP che si avvicini ad un formato standard per la loro progettazione dando vita alle implementazioni più disparate. Dopo aver integrato oltre 100 API con Stamplay vi raccontiamo quali sono i pro e i contro delle scelte progettuali che si fanno nello sviluppo di Webhook.
(DVO204) Monitoring Strategies: Finding Signal in the NoiseAmazon Web Services
"You need to monitor only a few machines and applications before fixing issues in your environment becomes very complicated. Throw in the type of dynamic infrastructure provided by Amazon EC2, and your static monitoring strategies will most likely not scale. Knowing which metrics to watch and how to troubleshoot based on those metrics will help you solve problems more quickly. In this session, we will look at a framework for your metrics and how to use it to find solutions to the issues that come up. We will cover the three types of monitoring data; what to collect; what should trigger an alert (avoiding an alert storm); and how to follow the resources to find the root causes of problems. Session sponsored by Datadog.
"
Dave Klein, Confluent, Developer Advocate
Apache Kafka is the core of an amazing ecosystem of tools and frameworks that enable us to get more value from our data. In this session we'll have a gentle introduction to Apache Kafka and a survey of some of the more popular components in the Kafka ecosystem.
https://www.meetup.com/KafkaBayArea/events/276592389/
Aljoscha Krettek - The Future of Apache FlinkFlink Forward
http://flink-forward.org/kb_sessions/the-future-of-apache-flinktm/
In this session we will first have a look at the current state of Apache Flink before diving into some of the upcoming features that are either already in development or still in the design phase. Some of the features currently in development that we are going to cover are: – Dynamic Scaling: Adapting a running program to changing workloads. – Queryable State: External querying of internal Flink state. This has the power to replace key/value stores by turning Flink into a key value store that allows for up to date querying of results. – Side Inputs: Having additional data that evolves over time as input to a stream operation. For the glimpse at the far-off future of Apache Flink™ we dare not make any predictions yet. In the session we will look at the latest whisperings and see what the community is currently thinking up as solutions to existing problems and predicted future challenges in the stream processing space.
Building a real time Tweet map with Flink in six weeksMatthias Kricke
In this talk we present OSTMap, a tool which was build by 6 students over the course of 6 weeks. Each student only has to do as little as 5-10h per week and no experience with BigData or the used frameworks. We also present the concept of geotemporal indicies for our use-case.
Data Stream Analytics - Why they are importantParis Carbone
Streaming is cool and it can help us do quick analytics and make profit but what about tsunamis? This is a motivation talk presented at the SeRC Big Data Workshop in Sweden during spring 2016. It motivates the streaming paradigm and provides examples on Apache Flink.
This talk is an application-driven walkthrough to modern stream processing, exemplified by Apache Flink, and how this enables new applications and makes old applications easier and more efficient. In this talk, we will walk through several real-world stream processing application scenarios of Apache Flink, highlighting unique features in Flink that make these applications possible. In particular, we will see (1) how support for handling out of order streams enables real-time monitoring of cloud infrastructure, (2) how the ability handle high-volume data streams with low latency SLAs enables real-time alerts in network equipment, (3) how the combination of high throughput and the ability to handle batch as a special case of streaming enables an architecture where the same exact program is used for real-time and historical data processing, and (4) how stateful stream processing can enable an architecture that eliminates the need for an external database store, leading to more than 100x performance speedup, among many other benefits.
Zoltán Zvara - Advanced visualization of Flink and Spark jobs Flink Forward
http://flink-forward.org/kb_sessions/advanced-visualization-of-flink-and-spark-jobs/
Understanding the physical plan of a big data application is often crucial for tracking down bottlenecks and faulty behavior. Flink and Spark although offering useful Web UI components for monitoring and understanding the logical plan of the jobs, both lack a tool that helps to understand the physical plan of the scheduler and the possibility to monitor execution at a very low level, along with the communication that occur between parallel vertex instances. We propose a tool that allows users to real-time monitor and later to replay, examine job executions on any cluster currently supported by Flink or Spark. The tool also offers monitoring of the distribution of keys in a data stream and can lead to optimizing data partitioning across parallel subtasks in the future.
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...confluent
The stream table duality in Kafka lets us look at our data in two different ways, whichever is more convenient for our use. But what about when the connections between the data points add much more value to our data? For this, we need to look at our data as a graph. Graphs help drive financial fraud investigations, social media analyses, network & IT management use cases, recommendation engines, and knowledge management. These are all cases where patterns of interaction in your data (for example, a pattern of structured financial transactions) matter more than the individual data points (a single transfer). We'll cover how to easily transform Kafka streams or tables into graphs, and query them declaratively using Cypher or GraphQL. In graph shape, we can enrich our social network streams with powerful graph algorithms that tell us about user and event influence through graph centrality, then streaming results back to Kafka. Stream/table duality becomes the stream / table / graph trinity. We will demonstrate the trinity by: - Getting started with regular kafka streams, - Using confluent hub's Neo4j sink - Exposing query-able graphs with Cypher & GraphQL - Analyzing data with Neo4j's graph algorithms - Transforming graphs back into streams The trinity means not choosing between representations, but using the best one for your use case. We'll demonstrate how it can be used to tackle social network analysis problems and discuss how the approach can be extended to real-time financial fraud detection and more.
Ted Dunning-Faster and Furiouser- Flink DriftFlink Forward
http://flink-forward.org/kb_sessions/faster-and-furiouser-flink-drift/
Not long ago, we had the opportunity to test Apache Flink to see just how fast it would go on a moderately realistic task with fast hardware and with a good streaming transport layer underneath. Our goal was not so much careful comparison with other software, but flat-out speed, Flink against Flink. In the process, we learned a lot about what it takes to go fast. Some of the lessons were ones that we had “learned” a number of times before: – the bottleneck isn’t where you thought it was – copying data is expensive – context switches are expensive – measure twice, cut once But there were some real surprises along the way. The really important knobs weren’t quite what people say you should turn. One of the biggest surprises was the degree to which high performance libraries have threading built into them which makes the actual concurrrency much higher than the apparent concurrency. The result was that at least one cluster parameter needed to be adjusted by 30x to get real
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...C4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/15ACXCw.
Tyler Akidau from Google demonstrates Google's Millwheel, a streaming system that promises low latency, strong consistency, and flexibility without relying on Lambda Architecture. Filmed at qconsf.com.
Tyler Akidau is a Senior Software Engineer at Google. The current Tech Lead for the MillWheel team, he’s spent five years working on massive-scale streaming data processing systems.
AI-Powered Streaming Analytics for Real-Time Customer ExperienceDatabricks
Interacting with customers in the moment and in a relevant, meaningful way can be challenging to organizations faced with hundreds of various data sources at the edge, on-premises, and in multiple clouds.
To capitalize on real-time customer data, you need a data management infrastructure that allows you to do three things:
1) Sense-Capture event data and stream data from a source, e.g. social media, web logs, machine logs, IoT sensors.
2) Reason-Automatically combine and process this data with existing data for context.
3) Act-Respond appropriately in a reliable, timely, consistent way. In this session we’ll describe and demo an AI powered streaming solution that can tackle the entire end-to-end sense-reason-act process at any latency (real-time, streaming, and batch) using Spark Structured Streaming.
The solution uses AI (e.g. A* and NLP for data structure inference and machine learning algorithms for ETL transform recommendations) and metadata to automate data management processes (e.g. parse, ingest, integrate, and cleanse dynamic and complex structured and unstructured data) and guide user behavior for real-time streaming analytics. It’s built on Spark Structured Streaming to take advantage of unified API’s, multi-latency and event time-based processing, out-of-order data delivery, and other capabilities.
You will gain a clear understanding of how to use Spark Structured Streaming for data engineering using an intelligent data streaming solution that unifies fast-lane data streaming and batch lane data processing to deliver in-the-moment next best actions that improve customer experience.
Dave Klein, Confluent, Developer Advocate
Apache Kafka is the core of an amazing ecosystem of tools and frameworks that enable us to get more value from our data. In this session we'll have a gentle introduction to Apache Kafka and a survey of some of the more popular components in the Kafka ecosystem.
https://www.meetup.com/KafkaBayArea/events/276592389/
Aljoscha Krettek - The Future of Apache FlinkFlink Forward
http://flink-forward.org/kb_sessions/the-future-of-apache-flinktm/
In this session we will first have a look at the current state of Apache Flink before diving into some of the upcoming features that are either already in development or still in the design phase. Some of the features currently in development that we are going to cover are: – Dynamic Scaling: Adapting a running program to changing workloads. – Queryable State: External querying of internal Flink state. This has the power to replace key/value stores by turning Flink into a key value store that allows for up to date querying of results. – Side Inputs: Having additional data that evolves over time as input to a stream operation. For the glimpse at the far-off future of Apache Flink™ we dare not make any predictions yet. In the session we will look at the latest whisperings and see what the community is currently thinking up as solutions to existing problems and predicted future challenges in the stream processing space.
Building a real time Tweet map with Flink in six weeksMatthias Kricke
In this talk we present OSTMap, a tool which was build by 6 students over the course of 6 weeks. Each student only has to do as little as 5-10h per week and no experience with BigData or the used frameworks. We also present the concept of geotemporal indicies for our use-case.
Data Stream Analytics - Why they are importantParis Carbone
Streaming is cool and it can help us do quick analytics and make profit but what about tsunamis? This is a motivation talk presented at the SeRC Big Data Workshop in Sweden during spring 2016. It motivates the streaming paradigm and provides examples on Apache Flink.
This talk is an application-driven walkthrough to modern stream processing, exemplified by Apache Flink, and how this enables new applications and makes old applications easier and more efficient. In this talk, we will walk through several real-world stream processing application scenarios of Apache Flink, highlighting unique features in Flink that make these applications possible. In particular, we will see (1) how support for handling out of order streams enables real-time monitoring of cloud infrastructure, (2) how the ability handle high-volume data streams with low latency SLAs enables real-time alerts in network equipment, (3) how the combination of high throughput and the ability to handle batch as a special case of streaming enables an architecture where the same exact program is used for real-time and historical data processing, and (4) how stateful stream processing can enable an architecture that eliminates the need for an external database store, leading to more than 100x performance speedup, among many other benefits.
Zoltán Zvara - Advanced visualization of Flink and Spark jobs Flink Forward
http://flink-forward.org/kb_sessions/advanced-visualization-of-flink-and-spark-jobs/
Understanding the physical plan of a big data application is often crucial for tracking down bottlenecks and faulty behavior. Flink and Spark although offering useful Web UI components for monitoring and understanding the logical plan of the jobs, both lack a tool that helps to understand the physical plan of the scheduler and the possibility to monitor execution at a very low level, along with the communication that occur between parallel vertex instances. We propose a tool that allows users to real-time monitor and later to replay, examine job executions on any cluster currently supported by Flink or Spark. The tool also offers monitoring of the distribution of keys in a data stream and can lead to optimizing data partitioning across parallel subtasks in the future.
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...confluent
The stream table duality in Kafka lets us look at our data in two different ways, whichever is more convenient for our use. But what about when the connections between the data points add much more value to our data? For this, we need to look at our data as a graph. Graphs help drive financial fraud investigations, social media analyses, network & IT management use cases, recommendation engines, and knowledge management. These are all cases where patterns of interaction in your data (for example, a pattern of structured financial transactions) matter more than the individual data points (a single transfer). We'll cover how to easily transform Kafka streams or tables into graphs, and query them declaratively using Cypher or GraphQL. In graph shape, we can enrich our social network streams with powerful graph algorithms that tell us about user and event influence through graph centrality, then streaming results back to Kafka. Stream/table duality becomes the stream / table / graph trinity. We will demonstrate the trinity by: - Getting started with regular kafka streams, - Using confluent hub's Neo4j sink - Exposing query-able graphs with Cypher & GraphQL - Analyzing data with Neo4j's graph algorithms - Transforming graphs back into streams The trinity means not choosing between representations, but using the best one for your use case. We'll demonstrate how it can be used to tackle social network analysis problems and discuss how the approach can be extended to real-time financial fraud detection and more.
Ted Dunning-Faster and Furiouser- Flink DriftFlink Forward
http://flink-forward.org/kb_sessions/faster-and-furiouser-flink-drift/
Not long ago, we had the opportunity to test Apache Flink to see just how fast it would go on a moderately realistic task with fast hardware and with a good streaming transport layer underneath. Our goal was not so much careful comparison with other software, but flat-out speed, Flink against Flink. In the process, we learned a lot about what it takes to go fast. Some of the lessons were ones that we had “learned” a number of times before: – the bottleneck isn’t where you thought it was – copying data is expensive – context switches are expensive – measure twice, cut once But there were some real surprises along the way. The really important knobs weren’t quite what people say you should turn. One of the biggest surprises was the degree to which high performance libraries have threading built into them which makes the actual concurrrency much higher than the apparent concurrency. The result was that at least one cluster parameter needed to be adjusted by 30x to get real
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...C4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/15ACXCw.
Tyler Akidau from Google demonstrates Google's Millwheel, a streaming system that promises low latency, strong consistency, and flexibility without relying on Lambda Architecture. Filmed at qconsf.com.
Tyler Akidau is a Senior Software Engineer at Google. The current Tech Lead for the MillWheel team, he’s spent five years working on massive-scale streaming data processing systems.
AI-Powered Streaming Analytics for Real-Time Customer ExperienceDatabricks
Interacting with customers in the moment and in a relevant, meaningful way can be challenging to organizations faced with hundreds of various data sources at the edge, on-premises, and in multiple clouds.
To capitalize on real-time customer data, you need a data management infrastructure that allows you to do three things:
1) Sense-Capture event data and stream data from a source, e.g. social media, web logs, machine logs, IoT sensors.
2) Reason-Automatically combine and process this data with existing data for context.
3) Act-Respond appropriately in a reliable, timely, consistent way. In this session we’ll describe and demo an AI powered streaming solution that can tackle the entire end-to-end sense-reason-act process at any latency (real-time, streaming, and batch) using Spark Structured Streaming.
The solution uses AI (e.g. A* and NLP for data structure inference and machine learning algorithms for ETL transform recommendations) and metadata to automate data management processes (e.g. parse, ingest, integrate, and cleanse dynamic and complex structured and unstructured data) and guide user behavior for real-time streaming analytics. It’s built on Spark Structured Streaming to take advantage of unified API’s, multi-latency and event time-based processing, out-of-order data delivery, and other capabilities.
You will gain a clear understanding of how to use Spark Structured Streaming for data engineering using an intelligent data streaming solution that unifies fast-lane data streaming and batch lane data processing to deliver in-the-moment next best actions that improve customer experience.
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteStreamNative
In this talk, Till Rohrmann and Addison Higham discuss how Flink allows for ambitious stream processing workflows and how Pulsar and Flink enable new capabilities that push forward the state-of-the-art in streaming. They will also share upcoming features and new capabilities in the integrations between Flink and Pulsar and how these two communities are working together to truly advance the power of stream processing.
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
Dataflow is a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation. Cloud Dataflow frees you from operational tasks like resource management and performance optimization.
Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...Flink Forward
eal-time Processing with Flink for Machine Learning at Netflix
Machine learning plays a critical role in providing a great Netflix member experience. It is used to drive many parts of the site including video recommendations, search results ranking, and selection of artwork images. Providing high-fidelity, near real-time data is increasingly important for these machine learning pipelines, especially as multi-armed bandit and reinforcement learning techniques, in addition to more ""traditional"" supervised learning, become more prevalent. With access to this data, models are able to converge more quickly, features can be updated more frequently, and analysis can be done in a more timely manner.
In this talk, we will focus on the practical details of leveraging Flink to process trillions of events per day, work with the time dimension, and manage large and frequently-changing state. We will discuss different processing schemes and dataflows, scalability and resiliency challenges we tackled, operational considerations, and instrumentation we added for monitoring job health in production.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
The Rise of Streaming SQL and Evolution of Streaming ApplicationsSrinath Perera
First-generation stream processors, such as Apache Storm, wanted us to write code. It was a great start. However, when building real-world apps, which are used for a long time and evolve, writing code gets us into trouble.
If we want to query a database or query data stored in storage with Hadoop, we use SQL. Why can't we query data streaming using SQL? We can. Almost all open source stream processors, including Storm, Flink, and Kafka, have switched to SQL.
In this webinar, Srinath will talk about the evolution of stream processing, streaming SQL, the status quo, and what this means to stream applications. He will also dissect the experience of building streaming applications by exploring common patterns and pitfalls.
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Codemotion
Rappresentare lo scorrere del tempo non è un'impresa semplice, specialmente con strumenti "tradizionali". Purtroppo però la dimensione temporale è fondamentale in mille contesti diversi, dall'analisi statistica alla rappresentazione dei rapporti di causa-effetto, dal forecasting al controllo automatico. In questo talk vedremo come utilizzare al meglio OrientDB, un Document-Graph Database, per il salvataggio, l'elaborazione e l'interrogazione di questo tipo di informazioni.
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
20-Feb-2024
In this talk, I will walk through how someone can set up and run continuous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas, and publishing data.
We will then cover consuming Kafka data, joining Kafka topics, and inserting new events into Kafka topics as they arrive. This basic overview will show hands-on techniques, tips, and examples of how to do this.
Tim Spann
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
Batch and Streaming Data Processing and Vizualize 300Tb in 5 Seconds meetup on April 18th, 2016 (http://www.meetup.com/Big-things-are-happening-here/events/229532500)
This session takes an in-depth look at:
- Trends in stream processing
- How streaming SQL has become a standard
- The advantages of Streaming SQL
- Ease of development with streaming SQL: Graphical and Streaming SQL query editors
- Business value of streaming SQL and its related tools: Domain-specific UIs
- Scalable deployment of streaming SQL: Distributed processing
Streaming SQL to unify batch and stream processing: Theory and practice with ...Fabian Hueske
SQL is the lingua franca for querying and processing data. To this day, it provides non-programmers with a powerful tool for analyzing and manipulating data. But with the emergence of stream processing as a core technology for data infrastructures, can you still use SQL and bring real-time data analysis to a broader audience?
The answer is yes, you can. SQL fits into the streaming world very well and forms an intuitive and powerful abstraction for streaming analytics. More importantly, you can use SQL as an abstraction to unify batch and streaming data processing. Viewing streams as dynamic tables, you can obtain consistent results from SQL evaluated over static tables and streams alike and use SQL to build materialized views as a data integration tool.
Fabian Hueske and Shuyi Chen explore SQL’s role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges and how the unified stream and batch processing platform enables both technical or nontechnical users to process real-time and batch data reliably using the same SQL at Uber scale.
This slide deck explores trends in stream processing, how streaming SQL has become a standard, the advantages of streaming SQL and more.
View video: https://wso2.com/library/conference/2018/07/wso2con-usa-2018-the-rise-of-streaming-sql/
Similar to Have your cake and eat it too, further dispelling the myths of the lambda architecture (20)
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
4. - Slava Chernyak, Josh Haberman, Reuven Lax,
Daniel Mills, Paul Nordstrom, Sam McVeety,
Sam Whittle, and more...
- Robert Bradshaw, Daniel Mills,
and more...
- Robert Bradshaw, Craig Chambers, Reuven
Lax, Daniel Mills, Frances Perry, and
more...
MillWheel
Streaming Flume
Cloud Dataflow
15. • Mostly correct is not good enough
• Required for exactly-once processing
• Required for repeatable results
• Cannot replace batch without it
Why consistency is important
30. 1. Time-Agnostic Processing - Filters
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Web server traffic logs
All traffic from specific domains
Straightforward
Efficient
Limited utility
31. 1. Time-Agnostic Processing - Hash Join
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Query & Click traffic
Joined stream of Query + Click pairs
Straightforward
Efficient
Limited utility
32. 2. Approximation via Online Algorithms
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Example Input:
Example Output:
Pros:
Cons:
Twitter hashtags
Approximate top N hashtags per prefix
Efficient
Inexact
Complicated Algorithms
33. 11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
Web server request traffic
Per-minute rate of received requests
Straightforward
Results reflect contents of stream
Results don’t reflect events as they happened
If approximating event time, usefulness varies
Example Input:
Example Output:
Pros:
Cons:
3. Windowing by Stream
Time
34. 11:00 10:0016:00 15:00 14:00 13:00 12:00 Event Time
Example Input:
Example Output:
Pros:
Cons:
Twitter hashtags
Top N hashtags by prefix per hour.
Reflects events as they occurred
More complicated buffering
Completeness issues
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Fixed Windows
35. 11:00 10:0016:00 15:00 14:00 13:00 12:00 Event Time
Example Input:
Example Output:
Pros:
Cons:
User activity stream
Per-session group of activities
Reflects events as they occurred
More complicated buffering
Completeness issues
11:00 10:0016:00 15:00 14:00 13:00 12:00 Stream Time
4. Windowing by Event Time - Sessions
53. Triggers API
PCollection<KV<String, Long>> sums = Pipeline
.begin()
.read(“userRequests”)
.apply(Window.into(new FixedWindows(2, MINUTE))
.trigger(new SequenceOf(
new RepeatUntil(
new AtPeriod(1, MINUTE),
new AtWatermark()),
new AtWatermark(),
new RepeatUntil(
new AfterCount(1),
new AfterDelay(
14, DAYS, TimeDomain.EVENT_TIME))));
.apply(new Sum());
54. Lambda vs Streaming
Low-latency, approximate results
Complete, correct results as soon as possible
Ability to deal with changes upstream
56. Triggers API
PCollection<KV<String, Long>> sums = Pipeline
.begin()
.read(“userRequests”)
.apply(Window.into(new Sessions(1, MINUTE))
.trigger(new SequenceOf(
new RepeatUntil(
new AtPeriod(1, MINUTE),
new AtWatermark()),
new AtWatermark(),
new RepeatUntil(
new AfterCount(1),
new AfterDelay(
14, DAYS, TimeDomain.EVENT_TIME))));
.apply(new Sum());
58. Summary
Lambda is great
Streaming by itself is better :-)
Strong Consistency = Correctness
Streaming = Aggregation + Windowing + Triggers
Tools For Reasoning About Time = Power + Flexibility