While “software is eating the world”, those who are able to best manage the huge mass of data will emerge out on the top.
The batch processing model has been faithfully serving us for decades. However, it might have reached the end of its usefulness for all but some very specific use-cases. As the pace of businesses increases, most of the time, decision makers prefer slightly wrong data sooner, than 100% accurate data later. Stream processing - or data streaming - exactly matches this usage: instead of managing the entire bulk of data, manage pieces of them as soon as they become available.
In this talk, I’ll define the context in which the old batch processing model was born, the reasons that are behind the new stream processing one, how they compare, what are their pros and cons, and a list of existing technologies implementing the latter with their most prominent characteristics. I’ll conclude by describing in detail one possible use-case of data streaming that is not possible with batches: display in (near) real-time all trains in Switzerland and their position on a map. I’ll go through the all the requirements and the design. Finally, using an OpenData endpoint and the Hazelcast platform, I’ll try to impress attendees with a working demo implementation of it.
While “software is eating the world”, those who are able to best manage the huge mass of data will emerge out on the top.
The batch processing model has been faithfully serving us for decades. However, it might have reached the end of its usefulness for all but some very specific use-cases. As the pace of businesses increases, most of the time, decision makers prefer slightly wrong data sooner, than 100% accurate data later. Stream processing - or data streaming - exactly matches this usage: instead of managing the entire bulk of data, manage pieces of them as soon as they become available.
In this talk, I’ll define the context in which the old batch processing model was born, the reasons that are behind the new stream processing one, how they compare, what are their pros and cons, and a list of existing technologies implementing the latter with their most prominent characteristics. I’ll conclude by describing in detail one possible use-case of data streaming that is not possible with batches: display in (near) real-time all trains in Switzerland and their position on a map. I’ll go through the all the requirements and the design. Finally, using an OpenData endpoint and the Hazelcast platform, I’ll try to impress attendees with a working demo implementation of it.
SCALE - Stream processing and Open Data, a match made in HeavenNicolas Fränkel
While “software is eating the world”, those who are able to best manage the huge mass of data will emerge out on the top.
Some countries in Europe understand the potential there's in existing data that sits behind closed fences, and passed laws to make this data available to everyone.
On the other hand, the batch processing model gets more and more obsolete: users want the information as soon as possible. While there’s a trade-off between correctness of data, and its speed of delivery, most business decisions do not rely on 100% correct data.
In this talk, I’ll explain how one can leverage the data related to public transportation in Switzerland, to display them in (near) real-time on a map.
The document discusses stream processing and provides an overview of Hazelcast Jet. It begins with explaining why streaming is useful and describes different streaming approaches like event-driven programming. It then provides details on Hazelcast Jet, including its concepts of pipelines and jobs. The document also discusses open data standards like GTFS and demonstrates a sample streaming pipeline that enriches public transportation data from open APIs.
Omid is a transactional framework that allows big data applications to execute ACID transactions on top of HBase. It provides a simple and well-known interface for applications to perform multi-row and multi-table transactions on HBase in a lock-free manner using snapshot isolation. Omid has been used successfully at Yahoo to power applications requiring transactional consistency at web-scale throughput levels for HBase.
Streaming is necessary to handle data rates and latency but SQL is unquestionably the lingua franca of data. Where do the two meet?
Apache Calcite is extending SQL to include streaming, and the Samza, Storm and Flink are projects are each building it into their engines. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at the first Kafka Summit, San Francisco, 2016/04/26.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
At Strava we have extensively leveraged Apache Spark to explore our data of over a billion activities, from tens of millions of athletes. This talk will be a survey of the more unique and exciting applications: A Global Heatmap gives a ~2 meter resolution density map of one billion runs, rides, and other activities consisting of three trillion GPS points from 17 billion miles of exercise data. The heatmap was rewritten from a non-scalable system into a highly scalable Spark job enabling great gains in speed, cost, and quality. Locally sensitive hashing for GPS traces was used to efficiently cluster 1 billion activities. Additional processes categorize and extract data from each cluster, such as names and statistics. Clustering gives an automated process to extract worldwide geographical patterns of athletes.
Applications include route discovery, recommendation systems, and detection of events and races. A coarse spatiotemporal index of all activity data is stored in Apache Cassandra. Spark streaming jobs maintain this index and compute all space-time intersections (“flybys”) of activities in this index. Intersecting activity pairs are then checked for spatiotemporal correlation, indicated by connected components in the graph of highly correlated pairs form “Group Activities”, creating a social graph of shared activities and workout partners. Data from several hundred thousand runners was used to build an improved model of the relationship between running difficulty and elevation gradient (Grade Adjusted Pace).
While “software is eating the world”, those who are able to best manage the huge mass of data will emerge out on the top.
The batch processing model has been faithfully serving us for decades. However, it might have reached the end of its usefulness for all but some very specific use-cases. As the pace of businesses increases, most of the time, decision makers prefer slightly wrong data sooner, than 100% accurate data later. Stream processing - or data streaming - exactly matches this usage: instead of managing the entire bulk of data, manage pieces of them as soon as they become available.
In this talk, I’ll define the context in which the old batch processing model was born, the reasons that are behind the new stream processing one, how they compare, what are their pros and cons, and a list of existing technologies implementing the latter with their most prominent characteristics. I’ll conclude by describing in detail one possible use-case of data streaming that is not possible with batches: display in (near) real-time all trains in Switzerland and their position on a map. I’ll go through the all the requirements and the design. Finally, using an OpenData endpoint and the Hazelcast platform, I’ll try to impress attendees with a working demo implementation of it.
SCALE - Stream processing and Open Data, a match made in HeavenNicolas Fränkel
While “software is eating the world”, those who are able to best manage the huge mass of data will emerge out on the top.
Some countries in Europe understand the potential there's in existing data that sits behind closed fences, and passed laws to make this data available to everyone.
On the other hand, the batch processing model gets more and more obsolete: users want the information as soon as possible. While there’s a trade-off between correctness of data, and its speed of delivery, most business decisions do not rely on 100% correct data.
In this talk, I’ll explain how one can leverage the data related to public transportation in Switzerland, to display them in (near) real-time on a map.
The document discusses stream processing and provides an overview of Hazelcast Jet. It begins with explaining why streaming is useful and describes different streaming approaches like event-driven programming. It then provides details on Hazelcast Jet, including its concepts of pipelines and jobs. The document also discusses open data standards like GTFS and demonstrates a sample streaming pipeline that enriches public transportation data from open APIs.
Omid is a transactional framework that allows big data applications to execute ACID transactions on top of HBase. It provides a simple and well-known interface for applications to perform multi-row and multi-table transactions on HBase in a lock-free manner using snapshot isolation. Omid has been used successfully at Yahoo to power applications requiring transactional consistency at web-scale throughput levels for HBase.
Streaming is necessary to handle data rates and latency but SQL is unquestionably the lingua franca of data. Where do the two meet?
Apache Calcite is extending SQL to include streaming, and the Samza, Storm and Flink are projects are each building it into their engines. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at the first Kafka Summit, San Francisco, 2016/04/26.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
At Strava we have extensively leveraged Apache Spark to explore our data of over a billion activities, from tens of millions of athletes. This talk will be a survey of the more unique and exciting applications: A Global Heatmap gives a ~2 meter resolution density map of one billion runs, rides, and other activities consisting of three trillion GPS points from 17 billion miles of exercise data. The heatmap was rewritten from a non-scalable system into a highly scalable Spark job enabling great gains in speed, cost, and quality. Locally sensitive hashing for GPS traces was used to efficiently cluster 1 billion activities. Additional processes categorize and extract data from each cluster, such as names and statistics. Clustering gives an automated process to extract worldwide geographical patterns of athletes.
Applications include route discovery, recommendation systems, and detection of events and races. A coarse spatiotemporal index of all activity data is stored in Apache Cassandra. Spark streaming jobs maintain this index and compute all space-time intersections (“flybys”) of activities in this index. Intersecting activity pairs are then checked for spatiotemporal correlation, indicated by connected components in the graph of highly correlated pairs form “Group Activities”, creating a social graph of shared activities and workout partners. Data from several hundred thousand runners was used to build an improved model of the relationship between running difficulty and elevation gradient (Grade Adjusted Pace).
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko
Activision Data team has been running a data pipeline for a variety of Activision games for many years. Historically we used a mix of micro-batch microservices coupled with classic Big Data tools like Hadoop and Hive for ETL. As a result, it could take up to 4-6 hours for data to be available to the end customers.
In the last few years, the adoption of data in the organization skyrocketed. We needed to de-legacy our data pipeline and provide near-realtime access to data in order to improve reporting, gather insights faster, power web and mobile applications. I want to tell a story about heavily leveraging Kafka Streams and Kafka Connect to reduce the end latency to minutes, at the same time making the pipeline easier and cheaper to run. We were able to successfully validate the new data pipeline by launching two massive games just 4 weeks apart.
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.
OLAP Basics and Fundamentals by Bharat Kalia Bharat Kalia
The document discusses online analytical processing (OLAP) and the need for OLAP capabilities beyond basic data analysis. It describes how OLAP uses multidimensional data models and pre-computed aggregates to provide fast and interactive analysis of data across multiple dimensions. Different approaches for implementing OLAP like ROLAP, MOLAP, and hybrid systems are covered.
Enterprises are Increasingly demanding realtime analytics and insights to power use cases like personalization, monitoring and marketing. We will present Pulsar, a realtime streaming system used at eBay which can scale to millions of events per second with high availability and SQL-like language support, enabling realtime data enrichment, filtering and multi-dimensional metrics aggregation.
We will discuss how Pulsar integrates with a number of open source Apache technologies like Kafka, Hadoop and Kylin (Apache incubator) to achieve the high scalability, availability and flexibility. We use Kafka to replay unprocessed events to avoid data loss and to stream realtime events into Hadoop enabling reconciliation of data between realtime and batch. We use Kylin to provide multi-dimensional OLAP capabilities.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
Location Analytics - Real-Time Geofencing using Kafka Guido Schmutz
An important underlying concept behind location-based applications is called geofencing. Geofencing is a process that allows acting on users and/or devices who enter/exit a specific geographical area, known as a geo-fence. A geo-fence can be dynamically generated—as in a radius around a point location, or a geo-fence can be a predefined set of boundaries (such as secured areas, buildings, boarders of counties, states or countries). Geofencing lays the foundation for realising use cases around fleet monitoring, asset tracking, phone tracking across cell sites, connected manufacturing, ride-sharing solutions and many others. Many of the use cases mentioned above require low-latency actions taken place, if either a device enters or leaves a geo-fence or when it is approaching such a geo-fence. That’s where streaming data ingestion and streaming analytics and therefore the Kafka ecosystem comes into play. This session will present how location analytics applications can be implemented using Kafka and KSQL & Kafka Streams. It highlights the exiting features available out-of-the-box and then shows how easy it is to extend it by custom defined functions (UDFs).
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...confluent
This document contains questions and answers about various topics in Kafka Streams including:
1. How to handle out-of-order data when reading records into a KTable. The answer is to use a state store and define a window of maximum lateness.
2. How to manage RocksDB databases that are created for each stateful operation like joins and aggregations. The answer is to ensure they are on redundant storage and take periodic snapshots.
3. How fault tolerance is achieved in Kafka Streams. State is automatically migrated in case of server failure allowing another server to resume processing.
4. How to handle exceptions within user code to ensure the application can continue processing. Some operations allow returning
Uber uses streaming analytics platforms like Apache Kafka and Pinot to process billions of messages and petabytes of data per day from streams in near real-time. The presenter discusses Uber's use of SQL as a building block to build streaming applications, describing their Athena X platform that allows users to write SQL queries to define streaming jobs and build a self-service ecosystem for production support. Future work may include auto-scaling and multi-data center support for the Athena X platform.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
In this talk, we will introduce some of the new available APIs around stateful aggregation in Structured Streaming, namely flatMapGroupsWithState. We will show how this API can be used to power many complex real-time workflows, including stream-to-stream joins, through live demos using Databricks and Apache Kafka.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Confluent real time_acquisition_analysis_and_evaluation_of_data_streams_20190...confluent
Speaker: Perry Krol, Senior Sales Engineer, Confluent Germany GmbH
Title of Talk:
Introduction to Apache Kafka as Event-Driven Open Source Streaming Platform
Abstract:
Apache Kafka is a de facto standard event streaming platform, being widely deployed as a messaging system, and having a robust data integration framework (Kafka Connect) and stream processing API (Kafka Streams) to meet the needs that common attend real-time event driven data processing.
The open source Confluent Platform adds further components such as a KSQL, Schema Registry, REST Proxy, Clients for different programming languages and Connectors for different technologies and databases. This session explains the concepts, architecture and technical details, including live demos.
This document summarizes Julian Hyde's talk on streaming SQL. The key points are:
1) Streaming SQL allows for relational queries over both streaming and stored data, including joins between streams and tables.
2) Queries are valid if the system can provide data with reasonable latency, using techniques like watermarks and monotonic columns.
3) Views, materialized views, and standing queries can be used to maintain windowed histories and summaries of streaming data.
4) A standard streaming SQL allows data in motion and at rest to be accessed together, combining real-time and historical data.
These slides were designed for Apache Hadoop + Apache Apex workshop (University program).
Audience was mainly from third year engineering students from Computer, IT, Electronics and telecom disciplines.
I tried to keep it simple for beginners to understand. Some of the examples are using context from India. But, in general this would be good starting point for the beginners.
Advanced users/experts may not find this relevant.
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
This document describes a parallel and scalable approach called Big-SeqSB-Gen for generating large synthetic sequence databases. It implements Whitney enumerators to generate distinct sequences and uses a parallel sequence generator (PSG) built on Hadoop MapReduce. The PSG was tested on a French Grid5000 cluster and achieved generation of over 18 billion sequences in under 2 hours, demonstrating good scalability and throughput. Future work involves mining patterns from large real sequence datasets.
This talk will be about the reasons behind the new stream processing model, how it compare to the old batch model, what are their pros and cons, and a list of existing technologies implementing stream processing with their most prominent characteristics. It will contain details of one possible use-case of data streaming that is not possible with batches: display in (near) real-time all trains in Switzerland and their position on a map, beginning with an overview of all the requirements and the design. Finally, using an OpenData endpoint and the Hazelcast platform,showing a working demo implementation of it.
BigData conference - Introduction to stream processingNicolas Fränkel
This document discusses stream processing and summarizes a presentation about the topic. It introduces Hazelcast Jet as a stream processing engine and covers open data standards like GTFS. It also describes a demo that uses GTFS data to enrich public transit vehicle position updates in real-time using Hazelcast Jet. The presentation discusses streaming approaches, benefits over batch processing, and provides an overview of stream processing concepts.
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko
Activision Data team has been running a data pipeline for a variety of Activision games for many years. Historically we used a mix of micro-batch microservices coupled with classic Big Data tools like Hadoop and Hive for ETL. As a result, it could take up to 4-6 hours for data to be available to the end customers.
In the last few years, the adoption of data in the organization skyrocketed. We needed to de-legacy our data pipeline and provide near-realtime access to data in order to improve reporting, gather insights faster, power web and mobile applications. I want to tell a story about heavily leveraging Kafka Streams and Kafka Connect to reduce the end latency to minutes, at the same time making the pipeline easier and cheaper to run. We were able to successfully validate the new data pipeline by launching two massive games just 4 weeks apart.
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.
OLAP Basics and Fundamentals by Bharat Kalia Bharat Kalia
The document discusses online analytical processing (OLAP) and the need for OLAP capabilities beyond basic data analysis. It describes how OLAP uses multidimensional data models and pre-computed aggregates to provide fast and interactive analysis of data across multiple dimensions. Different approaches for implementing OLAP like ROLAP, MOLAP, and hybrid systems are covered.
Enterprises are Increasingly demanding realtime analytics and insights to power use cases like personalization, monitoring and marketing. We will present Pulsar, a realtime streaming system used at eBay which can scale to millions of events per second with high availability and SQL-like language support, enabling realtime data enrichment, filtering and multi-dimensional metrics aggregation.
We will discuss how Pulsar integrates with a number of open source Apache technologies like Kafka, Hadoop and Kylin (Apache incubator) to achieve the high scalability, availability and flexibility. We use Kafka to replay unprocessed events to avoid data loss and to stream realtime events into Hadoop enabling reconciliation of data between realtime and batch. We use Kylin to provide multi-dimensional OLAP capabilities.
ksqlDB is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. ksqlDB is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
ksqlDB offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using ksqlDB for most part. This will be done in a live demo on a fictitious IoT sample.
Location Analytics - Real-Time Geofencing using Kafka Guido Schmutz
An important underlying concept behind location-based applications is called geofencing. Geofencing is a process that allows acting on users and/or devices who enter/exit a specific geographical area, known as a geo-fence. A geo-fence can be dynamically generated—as in a radius around a point location, or a geo-fence can be a predefined set of boundaries (such as secured areas, buildings, boarders of counties, states or countries). Geofencing lays the foundation for realising use cases around fleet monitoring, asset tracking, phone tracking across cell sites, connected manufacturing, ride-sharing solutions and many others. Many of the use cases mentioned above require low-latency actions taken place, if either a device enters or leaves a geo-fence or when it is approaching such a geo-fence. That’s where streaming data ingestion and streaming analytics and therefore the Kafka ecosystem comes into play. This session will present how location analytics applications can be implemented using Kafka and KSQL & Kafka Streams. It highlights the exiting features available out-of-the-box and then shows how easy it is to extend it by custom defined functions (UDFs).
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...confluent
This document contains questions and answers about various topics in Kafka Streams including:
1. How to handle out-of-order data when reading records into a KTable. The answer is to use a state store and define a window of maximum lateness.
2. How to manage RocksDB databases that are created for each stateful operation like joins and aggregations. The answer is to ensure they are on redundant storage and take periodic snapshots.
3. How fault tolerance is achieved in Kafka Streams. State is automatically migrated in case of server failure allowing another server to resume processing.
4. How to handle exceptions within user code to ensure the application can continue processing. Some operations allow returning
Uber uses streaming analytics platforms like Apache Kafka and Pinot to process billions of messages and petabytes of data per day from streams in near real-time. The presenter discusses Uber's use of SQL as a building block to build streaming applications, describing their Athena X platform that allows users to write SQL queries to define streaming jobs and build a self-service ecosystem for production support. Future work may include auto-scaling and multi-data center support for the Athena X platform.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
In this talk, we will introduce some of the new available APIs around stateful aggregation in Structured Streaming, namely flatMapGroupsWithState. We will show how this API can be used to power many complex real-time workflows, including stream-to-stream joins, through live demos using Databricks and Apache Kafka.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Confluent real time_acquisition_analysis_and_evaluation_of_data_streams_20190...confluent
Speaker: Perry Krol, Senior Sales Engineer, Confluent Germany GmbH
Title of Talk:
Introduction to Apache Kafka as Event-Driven Open Source Streaming Platform
Abstract:
Apache Kafka is a de facto standard event streaming platform, being widely deployed as a messaging system, and having a robust data integration framework (Kafka Connect) and stream processing API (Kafka Streams) to meet the needs that common attend real-time event driven data processing.
The open source Confluent Platform adds further components such as a KSQL, Schema Registry, REST Proxy, Clients for different programming languages and Connectors for different technologies and databases. This session explains the concepts, architecture and technical details, including live demos.
This document summarizes Julian Hyde's talk on streaming SQL. The key points are:
1) Streaming SQL allows for relational queries over both streaming and stored data, including joins between streams and tables.
2) Queries are valid if the system can provide data with reasonable latency, using techniques like watermarks and monotonic columns.
3) Views, materialized views, and standing queries can be used to maintain windowed histories and summaries of streaming data.
4) A standard streaming SQL allows data in motion and at rest to be accessed together, combining real-time and historical data.
These slides were designed for Apache Hadoop + Apache Apex workshop (University program).
Audience was mainly from third year engineering students from Computer, IT, Electronics and telecom disciplines.
I tried to keep it simple for beginners to understand. Some of the examples are using context from India. But, in general this would be good starting point for the beginners.
Advanced users/experts may not find this relevant.
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
This document describes a parallel and scalable approach called Big-SeqSB-Gen for generating large synthetic sequence databases. It implements Whitney enumerators to generate distinct sequences and uses a parallel sequence generator (PSG) built on Hadoop MapReduce. The PSG was tested on a French Grid5000 cluster and achieved generation of over 18 billion sequences in under 2 hours, demonstrating good scalability and throughput. Future work involves mining patterns from large real sequence datasets.
This talk will be about the reasons behind the new stream processing model, how it compare to the old batch model, what are their pros and cons, and a list of existing technologies implementing stream processing with their most prominent characteristics. It will contain details of one possible use-case of data streaming that is not possible with batches: display in (near) real-time all trains in Switzerland and their position on a map, beginning with an overview of all the requirements and the design. Finally, using an OpenData endpoint and the Hazelcast platform,showing a working demo implementation of it.
BigData conference - Introduction to stream processingNicolas Fränkel
This document discusses stream processing and summarizes a presentation about the topic. It introduces Hazelcast Jet as a stream processing engine and covers open data standards like GTFS. It also describes a demo that uses GTFS data to enrich public transit vehicle position updates in real-time using Hazelcast Jet. The presentation discusses streaming approaches, benefits over batch processing, and provides an overview of stream processing concepts.
Devclub.lv - Introduction to stream processingNicolas Fränkel
While “software is eating the world”, those who are able to best manage the huge mass of data will emerge out on the top.
The batch processing model has been faithfully serving us for decades. However, it might have reached the end of its usefulness for all but some very specific use-cases. As the pace of businesses increases, most of the time, decision-makers prefer slightly wrong data sooner, than 100% accurate data later. Stream processing – or data streaming – exactly matches this usage: instead of managing the entire bulk of data, manage pieces of them as soon as they become available.
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
This document discusses Apache Apex, an open source stream processing framework. It provides an overview of stream data processing and common use cases. It then describes key Apache Apex capabilities like in-memory distributed processing, scalability, fault tolerance, and state management. The document also highlights several customer use cases from companies like PubMatic, GE, and Silver Spring Networks that use Apache Apex for real-time analytics on data from sources like IoT sensors, ad networks, and smart grids.
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder of DataTorrent presented "Streaming Analytics with Apache Apex" as part of the Big Data, Berlin v 8.0 meetup organised on the 14th of July 2016 at the WeWork headquarters.
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scaleDataScienceConferenc1
Rivian makes adventurous electric vehicles with a mission of a sustainable planet and keeping the world adventurous forever. Rivian's vehicles are born in the cloud and embody tenets of a software defined vehicle, where not only the user accessible features such as infotainment are software driven and updated, but also internals aspects such as vehicle dynamics. Real-time instrumentation and telemetry are the key underpinnings that make all this possible. Rivian has built a cutting-edge Real-time stack using a combination of open-source technologies like Kafka, Flink and Druid and in house services. This talk will go into how these are combined and leveraged to deliver real-time analytics.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
Big data serving: Processing and inference at scale in real timeItai Yaffe
Jon Bratseth (VP Architect) @ Verizon Media:
The big data world has mature technologies for offline analysis and learning from data, but have lacked options for making data-driven decisions in real time.
When it is sufficient to consider a single data point model servers such as TensorFlow serving can be used but in many cases you want to consider many data points to make decisions.
This is a difficult engineering problem combining state, distributed algorithms and low latency, but solving it often makes it possible to create far superior solutions when applying machine learning.
This talk will explain why this is a hard problem, show the advantages of solving it, and introduce the open source Vespa.ai platform which is used to implement such solutions in some of the largest scale problems in the world including the world's third largest ad serving system.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Next Gen Big Data Analytics with Apache Apex discusses Apache Apex, an open source stream processing framework. It provides an overview of Apache Apex's capabilities for processing continuous, real-time data streams at scale. Specifically, it describes how Apache Apex allows for in-memory, distributed stream processing using a programming model of operators in a directed acyclic graph. It also covers Apache Apex's features for fault tolerance, dynamic scaling, and integration with Hadoop and YARN.
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
Putting the Micro into Microservices with Stateful Stream Processingconfluent
1) The document discusses using stateful stream processing to build lightweight microservices that evolve a shared narrative. It outlines various tools from the stream processing toolkit like Kafka, KStreams, KTables, state stores, and transactions that can be used.
2) Various patterns for building stateless, stateful, and joined streaming services are presented, including gates, sidecars and stream-asides. These can be combined to process events and build views.
3) An evolutionary approach is suggested where services start small and stateless, becoming stateful if needed, and layering contexts within contexts. This allows systems to balance sunk costs and future flexibility.
The document discusses Apache Tez, a distributed execution framework for data processing applications. Tez is designed to improve performance over Hadoop MapReduce by expressing computations as dataflow graphs and optimizing resource usage. It aims to empower users with expressive APIs, a flexible runtime model, and simplifying deployment. Tez also works to improve execution performance through eliminating overhead from MapReduce, dynamic runtime optimization, and optimal resource management with YARN.
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
This presentation discusses Uber's use of real-time analytics on deep learning models by combining TensorFlow and Presto. It provides an overview of deep learning and distributed deep learning using TensorFlow and Horovod at Uber. It then discusses Uber's use of Presto, an interactive SQL query engine, to enable real-time querying of deep learning models and datasets. Specific optimizations for the Presto Elasticsearch and Tensorflow connectors are also outlined to improve query performance.
To understand an application’s performance, first you have to know what to measure. That’s the easy part. How do you take those measurements? Store them? Analyze them? Get them to the people who need them? Well, that’s where things get complicated, especially in the high-traffic distributed systems of the modern web! Like careful scientists, we must observe our subjects without altering them, and we must report our findings quickly so that we have the data necessary to make smart choices about the health and growth of the system.
Let’s explore the lessons learned by engineers at one of the world’s top web companies in their quest to find meaning at 5 MB/s. We’ll discuss the tools and techniques that enable the collection, indexing, and analysis of billions or more datapoints each hour, and learn how these same approaches can empower your applications and your business, no matter the scale.
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
Apache Spark 2.0 offers many enhancements that make continuous analytics quite simple. In this talk, we will discuss many other things that you can do with your Apache Spark cluster. We explain how a deep integration of Apache Spark 2.0 and in-memory databases can bring you the best of both worlds! In particular, we discuss how to manage mutable data in Apache Spark, run consistent transactions at the same speed as state-the-art in-memory grids, build and use indexes for point lookups, and run 100x more analytics queries at in-memory speeds. No need to bridge multiple products or manage, tune multiple clusters. We explain how one can take regulation Apache Spark SQL OLAP workloads and speed them up by up to 20x using optimizations in SnappyData.
We then walk through several use-case examples, including IoT scenarios, where one has to ingest streams from many sources, cleanse it, manage the deluge by pre-aggregating and tracking metrics per minute, store all recent data in a in-memory store along with history in a data lake and permit interactive analytic queries at this constantly growing data. Rather than stitching together multiple clusters as proposed in Lambda, we walk through a design where everything is achieved in a single, horizontally scalable Apache Spark 2.0 cluster. A design that is simpler, a lot more efficient, and let’s you do everything from Machine Learning and Data Science to Transactions and Visual Analytics all in one single cluster.
Similar to BruJUG - Introduction to data streaming (20)
SnowCamp - Adding search to a legacy applicationNicolas Fränkel
Most applications evolve to a point where they need to provide search capabilities. But updating an application is always a risk. Plus, sometimes, you don’t have access to the source code. The easiest way to access the data is by getting them directly from the database.
The initial load is the easiest step. However, how do you keep the search index in sync with the database? How do you keep the latency between the search store and the source of truth, so your users don’t have to wait for the next run of the batch to access the newest changes?
In this live coding session, we will show you how you can solve this issue by connecting Elasticsearch to the database with a touch of Hazelcast.
On dit que GitHub est le CV d'un développeur. Un rapide coup d'œil à votre historique de commit et les recruteurs savent tout de vous. Cette approche comporte quelques problèmes. La plupart des entreprises ne publient même pas leur code sous une licence Open Source. Si vous travaillez pour l'une d'entre elles, et si vous n'êtes pas un développeur Open Source les soirs et les week-ends, alors vous n'avez aucune chance.
Récemment, GitHub a permis un certain degré de personnalisation de son profil. Ainsi, même si votre historique de commit a plus de blanc que de vert, vous pouvez fournir un bon point d'entrée pour les employeurs potentiels. Mais ça ne vaut que l'effort que vous y mettez et les données perdent leur valeur rapidement. Pourtant, avec un peu de travail et l'aide d'outils d'automatisation (tels que GitHub Actions), vous pouvez présenter un profil toujours à jour.
Zero-downtime deployment on Kubernetes with HazelcastNicolas Fränkel
Kubernetes allows a lot. After discovering its features, it’s easy to think it can magically transform your application deployment process into a painless no-event. For Hello World applications, that is the case. Unfortunately, not many of us do deploy such applications day-to-day because we need to handle state. Though it would be much easier to have stateless apps, and despite our best efforts in this direction, state is found in (at least) two places: sessions and databases.
You need to think keeping the state while stopping and starting application nodes. In this talk, I’ll demo how to update a Spring Boot app deployed on a Kubernetes cluster with a non-trivial database schema change with the help of Hazelcast, while keeping the service up during the entire update process.
jLove - A Change-Data-Capture use-case: designing an evergreen cacheNicolas Fränkel
When one’s app is challenged with poor performances, it’s easy to set up a cache in front of one’s SQL database. It doesn’t fix the root cause (e.g. bad schema design, bad SQL query, etc.) but it gets the job done. If the app is the only component that writes to the underlying database, it’s a no-brainer to update the cache accordingly, so the cache is always up-to-date with the data in the database.
Things start to go sour when the app is not the only component writing to the DB. Among other sources of writes, there are batches, other apps (shared databases exist unfortunately), etc. One might think about a couple of ways to keep data in sync i.e. polling the DB every now and then, DB triggers, etc. Unfortunately, they all have issues that make them unreliable and/or fragile.
You might have read about Change-Data-Capture before. It’s been described by Martin Kleppmann as turning the database inside out: it means the DB can send change events (SELECT, DELETE and UPDATE) that one can register to. Just opposite to Event Sourcing that aggregates events to produce state, CDC is about getting events out of states. Once CDC is implemented, one can subscribe to its events and update the cache accordingly. However, CDC is quite in its early stage, and implementations are quite specific.
In this talk, I’ll describe an easy-to-setup architecture that leverages CDC to have an evergreen cache.
ADDO - Your own Kubernetes controller, not only in GoNicolas Fränkel
In Kubernetes, operators allow the API to be extended to your heart content. If one task requires too much YAML, it’s easy to create an operator to take care of the repetitive cruft, and only require a minimum amount of YAML.
On the other hand, since its beginnings, the Go language has been advertised as closer to the hardware, and is now ubiquitous in low-level programming. Kubernetes has been rewritten from Java to Go, and its whole ecosystem revolves around Go. For that reason, It’s only natural that Kubernetes provides a Go-based framework to create your own operator. While it makes sense, it requires organizations willing to go down this road to have Go developers, and/or train their teams in Go. While perfectly acceptable, this is not the only option. In fact, since Kubernetes is based on REST, why settle for Go and not use your own favorite language?
In this talk, I’ll describe what is an operator, how they work, how to design one, and finally demo a Java-based operator that is as good as a Go one.
TestCon Europe - Mutation Testing to the Rescue of Your TestsNicolas Fränkel
Unit testing ensures your production code is relevant. But what does ensure your testing code is relevant? Come discover mutation testing and make sure your never forget another assert again.
In the realm of testing, the code coverage metrics is the most often talked about. However, it doesn’t mean that the test has been useful or even that an assert has been coded. Mutation testing is a strategy to make sure that the test code is relevant.
In this talk, Nicolas will explain how Code Coverage is computed and what its inherent flaw is. Afterwards, he will describe how Mutation Testing work and how it helps pointing out code that is tested but leave out corner cases. He will also demo PIT, a Java production-grade framework that enables Mutation Testing.
OSCONF Jaipur - A Hitchhiker's Tour to Containerizing a Java applicationNicolas Fränkel
As “the Cloud” becomes more and more widespread, now is a good time to assess how you can containerize your Java application. I assume you’re able to write a a Dockerfile around the generated JAR. However, each time the application’s code will change, the whole image will need to be rebuilt. If you’re deploying to a local Kubernetes cluster environment, this increases that much the length of the feedback loop.
In this demo-based talk, I’ll present different ways to get your Java app in a container: Dockerfile, Jib, and Cloud Native Buildpacks. We will also have a look at what kind of Docker image they generate, how they layer the images, whether those images are compatible with skaffold, etc.
GeekcampSG 2020 - A Change-Data-Capture use-case: designing an evergreen cacheNicolas Fränkel
CDC is a brand new approach that "turns the database inside out": it allows to get events out of the database state. This can be leveraged to get a cache that is never stale.
JavaDay Istanbul - 3 improvements in your microservices architectureNicolas Fränkel
While a microservices architecture is more scalable than a monolith, it has a direct hit on performance.
To cope with that, one performance improvement is to set up a cache. It can be configured for database access, for REST calls or just to store session state across a cluster of server nodes. In this demo-based talk, I’ll show how Hazelcast In-Memory Data Grid can help you in each one of those areas and how to configure it. Hint: it’s much easier than one would expect.
At a point in the past, it was forecast that Java would die, but the JVM platform would be its legacy. And in fact, for a long time, the JVM has been tremendously successful. Wikipedia itself lists a bunch of languages that run on it, some of them close to Java e.g. Kotlin, some of them very remote e.g. Clojure.
But nowadays, the Cloud is becoming ubiquitous. Containerization is the way to go to alleviate some of the vendor lock-in issues. Kubernetes is a de facto platform. If a container needs to be killed for whatever reason (resource consumption, unhealthy, etc.), a new one needs to replace it as fast as possible. In that context, the JVM seems to be a dead-end: its startup time is huge in comparison to a native process. Likewise, it consumes a lot of memory that just increase the monthly bill.
What does that mean for us developers? Has all the time spent in learning the JVM ecosystem been invested with no hope of return over investment? Shall we need to invest even more time in new languages, frameworks, libraries, etc.? That is one possibility for sure. But we can also leverage our existing knowledge, and embrace the Cloud and containers ways with the help of some tools.
In this talk, I’ll create a simple URL shortener with a “standard” stack: Kotlin, JAX-RS and Hazelcast. Then, with the help of Quarkus and GraalVM, I’ll turn this application into a native executable with all Cloud/Container related work has been moved to the build process.
OSCONF Koshi - Zero downtime deployment with Kubernetes, Flyway and Spring BootNicolas Fränkel
Kubernetes allows a lot. After discovering its features, it’s easy to think it can magically transform your application deployment process into a painless no-event. For Hello World applications, that is the case. Unfortunately, not many of us do deploy such applications day-to-day. You need to think about application backward compatibility, possible rollback, database schema migration, etc. I believe the later is one of the biggest pain point. In this talk, I’ll demo how to update a Spring Boot app deployed on a Kubernetes cluster with a non-trivial database schema migration with the help of Flyway, while keeping the service up during the entire update process.
JOnConf - A CDC use-case: designing an Evergreen CacheNicolas Fränkel
This document discusses using change data capture (CDC) and Hazelcast Jet to build an evergreen cache that remains in sync with a database. It covers alternatives to cache invalidation like polling and triggers, introduces CDC and the Debezium implementation, and proposes a Jet job that watches database change events, analyzes them, and updates the cache accordingly to solve the cache freshness problem.
London In-Memory Computing Meetup - A Change-Data-Capture use-case: designing...Nicolas Fränkel
When one’s app is challenged with poor performances, it’s easy to set up a cache in front of one’s SQL database. It doesn’t fix the root cause (e.g. bad schema design, bad SQL query, etc.) but it gets the job done. If the app is the only component that writes to the underlying database, it’s a no-brainer to update the cache accordingly, so the cache is always up-to-date with the data in the database.
Things start to go sour when the app is not the only component writing to the DB. Among other sources of writes, there are batches, other apps (shared databases exist, unfortunately), etc. One might think about a couple of ways to keep data in sync i.e. polling the DB every now and then, DB triggers, etc. Unfortunately, they all have issues that make them unreliable and/or fragile.
In this talk, I will describe an easy-to-setup architecture that leverages CDC to have an evergreen cache.
Java.IL - Your own Kubernetes controller, not only in Go!Nicolas Fränkel
In Kubernetes, operators allow the API to be extended to your heart content. If one task requires too much YAML, it’s easy to create an operator to take care of the repetitive cruft, and only require a minimum amount of YAML.
On the other hand, since its beginnings, the Go language has been advertised as closer to the hardware, and is now ubiquitous in low-level programming. Kubernetes has been rewritten from Java to Go, and its whole ecosystem revolves around Go. For that reason, It’s only natural that Kubernetes provides a Go-based framework to create your own operator. While it makes sense, it requires organizations willing to go down this road to have Go developers, and/or train their teams in Go. While perfectly acceptable, this is not the only option. In fact, since Kubernetes is based on REST, why settle for Go and not use your own favorite language?
In this talk, I’ll describe what is an operator, how they work, how to design one, and finally demo a Java-based operator that is as good as a Go one.
London Java Community - An Experiment in Continuous Deployment of JVM applica...Nicolas Fränkel
A couple of years ago, continuous integration in the JVM ecosystem meant Jenkins. Since that time, a lot of other tools have been made available. But new tools don’t mean new features, just new ways. Besides that, what about continuous deployment? There’s no tool that allows deploying new versions of a JVM-based application without downtime. The only way to achieve zero downtime is to have multiple nodes deployed on a platform, and let that platform achieve that e.g. Kubernetes.
And yet, achieving true continuous deployment of bytecode on one single JVM instance is possible if one changes one’s way of looking at things. What if the compilation could be seen as changes? What if those changes could be stored in a data store, and a listener on this data store could stream those changes to the running production JVM via the Attach API?
In this talk, we'll demo exactly that using Hazelcast and Hazelcast Jet - but it’s possible to re-use the principles that will be shown using other streaming technologies.
OSCONF - Your own Kubernetes controller: not only in GoNicolas Fränkel
This document discusses creating Kubernetes operators and controllers using different programming languages besides Go. It suggests that a Java-based controller is possible using the GraalVM, which allows creating native executables from Java bytecode. Key points covered include what controllers and operators are, that no specific technology stack is required, and that the JVM could be a good option for controller development with GraalVM's support for polyglot programming and creating native applications.
vKUG - Migrating Spring Boot apps from annotation-based config to FunctionalNicolas Fränkel
Migrating Spring Boot apps from annotation-based config to Functional with Kotlin - Nicolas Fränkel
In the latest years, there has been some push-back against frameworks, and more specifically annotations: some call them magic. Obviously, they make understanding the flow of the application harder.
Spring and Spring Boot latest versions go along this trend, by offering an additional way to configure beans with explicit code instead of annotations. It’s declarative in the sense it looks like configuration, though it’s based on Domain-Specific Language(s). This talk aims to demo a step-by-step process to achieve that.
While a microservices architecture is more scalable than a monolith, it has a direct hit on performance.
To cope with that, one performance improvement is to set up a cache. It can be configured for database access, for REST calls or just to store session state across a cluster of server nodes. In this demo-based talk, I’ll show how Hazelcast In-Memory Data Grid can help you in each one of those areas and how to configure it.
Hint: it’s much easier than one would expect.
AllTheTalks.online - A Streaming Use-Case: And Experiment in Continuous Deplo...Nicolas Fränkel
Nicolas Fränkel proposes an experimental use case of continuous deployment of bytecode by dynamically loading Java agents into a running JVM using the Attach API. The agent would use the Instrumentation API to modify bytecode of running applications by redefining classes in order to deploy changes made by a continuous integration pipeline without restarting the JVM. Current limitations include a lack of tagging and an inability to add/remove fields or methods via redefinition. Improvements could include streaming changes directly from the CI pipeline and adding metadata to classes from the Git tag.
ING Meetup - Migrating Spring Boot Config Annotations to Functional with KotlinNicolas Fränkel
In the latest years, there has been some push-back against frameworks, and more specifically annotations: some call them magic. Obviously, they make understanding the flow of the application harder. Spring and Spring Boot latest versions go along this trend, by offering an additional way to configure beans with explicit code instead of annotations. It's declarative in the sense it looks like configuration, though it's based on Domain-Specific Language(s). This talk aims to demo a step-by-step process to achieve that.
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesQuickdice ERP
Explore the seamless transition to e-invoicing with this comprehensive guide tailored for Saudi Arabian businesses. Navigate the process effortlessly with step-by-step instructions designed to streamline implementation and enhance efficiency.
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfVALiNTRY360
Salesforce Healthcare CRM, implemented by VALiNTRY360, revolutionizes patient management by enhancing patient engagement, streamlining administrative processes, and improving care coordination. Its advanced analytics, robust security, and seamless integration with telehealth services ensure that healthcare providers can deliver personalized, efficient, and secure patient care. By automating routine tasks and providing actionable insights, Salesforce Healthcare CRM enables healthcare providers to focus on delivering high-quality care, leading to better patient outcomes and higher satisfaction. VALiNTRY360's expertise ensures a tailored solution that meets the unique needs of any healthcare practice, from small clinics to large hospital systems.
For more info visit us https://valintry360.com/solutions/health-life-sciences
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
WWDC 2024 Keynote Review: For CocoaCoders AustinPatrick Weigel
Overview of WWDC 2024 Keynote Address.
Covers: Apple Intelligence, iOS18, macOS Sequoia, iPadOS, watchOS, visionOS, and Apple TV+.
Understandable dialogue on Apple TV+
On-device app controlling AI.
Access to ChatGPT with a guest appearance by Chief Data Thief Sam Altman!
App Locking! iPhone Mirroring! And a Calculator!!
3. @nicolas_frankel
Hazelcast
HAZELCAST IMDG is an operational,
in-memory, distributed computing
platform that manages data using
in-memory storage, and performs
parallel execution for breakthrough
application speed and scale.
HAZELCAST JET is the ultra fast,
application embeddable, 3rd
generation stream processing
engine for low latency batch
and stream processing.
4. @nicolas_frankel
• Why streaming?
• Streaming approaches
• Hazelcast Jet
• Open Data
• General Transit Feed Specification
• The demo!
• Q&A
Schedule
12. @nicolas_frankel
• Scheduled at regular intervals
• Daily
• Weekly
• Monthly
• Yearly
• Run in a specific amount of time
Properties of batches
13. @nicolas_frankel
• When the execution time overlaps
the next execution schedule
• When the space taken by the data
exceeds the storage capacity
• When the batch fails mid-execution
• etc.
Oops
17. @nicolas_frankel
Event Sourcing
“Event sourcing persists the state of a business entity
such an Order or a Customer as a sequence of state-
changing events. Whenever the state of a business
entity changes, a new event is appended to the list of
events. Since saving an event is a single operation, it is
inherently atomic. The application reconstructs an
entity’s current state by replaying the events.”
-- https://microservices.io/patterns/data/event-sourcing.html
22. @nicolas_frankel
Streaming is smart ETL
Processing
Ingest
In-Memory
Operational
Storage
Combine
Join, Enrich,
Group, Aggregate
Stream
Windowing, Event-
Time
Processing
Compute
Distributed and
Parallel
Computation
Transform
Filter, Clean,
Convert
Publish
In-Memory,
Subscriber
Notifications
Notify if response
time is 10% over 24
hour average, second
by second
23. @nicolas_frankel
• Real-time dashboards
• Decision making
• Recommendations
• Stats (gaming, infrastructure
monitoring)
• Prediction - often based on
algorithmic prediction
• Push stream through ML model
• Complex Event Processing
Use Case: Analytics and Decision Making
25. @nicolas_frankel
• Distributed
• On-disk storage
• Messages sent and read from a
topic
• Publish-subscribe
• Queue
• Consumer can keep track of the
offset
Kafka
26. @nicolas_frankel
• Apache Flink
• Amazon Kinesis
• IBM Streams
• Hazelcast Jet
• Apache Beam
• Abstraction over some of the above
• …
In-memory stream processing engines
27. @nicolas_frankel
• Apache 2 Open Source
• Single JAR
• Leverages Hazelcast IMDG
• Unified batch/streaming API
• (Hazelcast Jet Enterprise)
Hazelcast Jet
29. @nicolas_frankel
• Declaration (code) that defines and
links sources, transforms, and
sinks
• Platform-specific SDK (Pipeline API
in Jet)
• Client submits pipeline to the
Stream Processing Engine (SPE)
Concept: Pipeline
30. @nicolas_frankel
• Running instance of pipeline in SPE
• SPE executes the pipeline
• Code execution
• Data routing
• Flow control
• Parallel and distributed execution
Concept: Job
31. @nicolas_frankel
Imperative model
final String text = "...";
final Map<String, Long> counts = new HashMap<>();
for (String word : text.split("W+")) {
Long count = counts.get(word);
counts.put(count == null ? 1L : count + 1);
}
32. @nicolas_frankel
Declarative model
Map<String, Long> counts = lines.stream()
.map(String::toLowerCase)
.flatMap(
line -> Arrays.stream(line.split("W+"))
)
.filter(word -> !word.isEmpty())
.collect(Collectors.groupingBy(
word -> word, Collectors.counting())
);
33. @nicolas_frankel
• Multiple nodes
• Scalable storage and performance
• Elasticity
• Data stored, partitioned and
replicated
• No single point of failure
What Distributed Means to Hazelcast
34. @nicolas_frankel
Distributed Parallel Processing
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<Long, String>map(BOOK_LINES))
.flatMap(line -> traverseArray(line.getValue().split("W+")))
.filter(word -> !word.isEmpty())
.groupingKey(wholeItem())
.aggregate(counting())
.drainTo(Sinks.map(COUNTS));
Data
Sink
Data
Source
from aggrmap filter to
Translate declarative code to a Directed Acyclic Graph
36. @nicolas_frankel
« Open data is the idea that some
data should be freely available to
everyone to use and republish as
they wish, without restrictions from
copyright, patents or other
mechanisms of control. »
--https://en.wikipedia.org/wiki/Open_data
Open Data
43. @nicolas_frankel
General Transit Feed Specification
”The General Transit Feed Specification (GTFS) […]
defines a common format for public transportation
schedules and associated geographic information.
GTFS feeds let public transit agencies publish their
transit data and developers write applications that
consume that data in an interoperable way.”
44. @nicolas_frankel
GTFS static model
Filename Required Defines
agency.txt Required Transit agencies with service represented in this dataset.
stops.txt Required Stops where vehicles pick up or drop off riders. Also defines stations and station entrances.
routes.txt Required Transit routes. A route is a group of trips that are displayed to riders as a single service.
trips.txt Required
Trips for each route. A trip is a sequence of two or more stops that occur during a specific
time period.
stop_times.txt Required Times that a vehicle arrives at and departs from stops for each trip.
calendar.txt
Conditionally
required
Service dates specified using a weekly schedule with start and end dates. This file is required
unless all dates of service are defined in calendar_dates.txt.
calendar_dates.txt
Conditionally
required
Exceptions for the services defined in the calendar.txt. If calendar.txt is omitted, then
calendar_dates.txt is required and must contain all dates of service.
fare_attributes.txt Optional Fare information for a transit agency's routes.
45. @nicolas_frankel
GTFS static model
Filename Required Defines
fare_rules.txt Optional Rules to apply fares for itineraries.
shapes.txt Optional Rules for mapping vehicle travel paths, sometimes referred to as route alignments.
frequencies.txt Optional
Headway (time between trips) for headway-based service or a compressed representation of fixed-
schedule service.
transfers.txt Optional Rules for making connections at transfer points between routes.
pathways.txt Optional Pathways linking together locations within stations.
levels.txt Optional Levels within stations.
feed_info.txt Optional Dataset metadata, including publisher, version, and expiration information.
translations.txt Optional Translated information of a transit agency.
attributions.txt Optional Specifies the attributions that are applied to the dataset.
47. @nicolas_frankel
• Open Data
• GTFS static available as
downloadable .txt files
• GTFS dynamic available as a REST
endpoint
Use-case: Swiss Public Transport
49. @nicolas_frankel
• Source: web service
• Split into trip updates
• Enrich with static trip data
• Enrich with static stop times data
• Transform hours into timestamp
• Enrich with static location data
• Sink: Hazelcast IMDG
The dynamic data pipeline
52. @nicolas_frankel
Recap
• Streaming has a lot of benefits
• Leverage Open Data
• It’s the Wild West out there
• No standards
• Real-world data sucks!
• But you can get cool stuff done
Real-time (latency-sensitive) operations combined with analytics
Count usages per CC in last 10 secs, fraud if > 10
Real-time querying
Based on analytics, prediction
Fraud detection ran overnight has low value
Complex event processing
Pattern detection (if A and B -> C)
SPE runs this at scale
Valuable: IOT support. Machine analytics/predictions - fits into AI
Without streaming?
SP - make use of multi-processor multi-node runtime,
Minimize costs: data shuffling, context switching
Jet Does Distributed Parallel Processing
1/ Execution plan
2/ Execute it in parallel
How can be computation parallelized?
Task parallelism - make use of multiprocessor machines
Continuous - run in parallel and exchange data
MR - just two steps
SP - make use of multi-processor multi-node runtime,
Minimize costs: data shuffling, context switching
1/ Data parallelism - distribute data partitions among available resources
DAG deployed to all cluster members = more cores
What can be parallelized?
Source - if partitioned, can read in parallel
Map/filter - can run in parallel
Extend the edges
Shuffing / moving data is expensive
Keeps data local, even reads it locally when co-located with the data source