Modern event-based/streaming distributed systems embrace the idea that change is inevitable and actually desirable! Without being change-aware, systems are inflexible, can’t evolve or react, and are simply incapable of keeping up with real-time real-world data. But how can we speed up an “Elephant” (PostgreSQL) to be as fast as a “Cheetah” (Kafka)? In this talk, we'll introduce the Debezium PostgreSQL Connector, and explain how to deploy, configure and run it on a Kafka Connect cluster, explore the semantics and format of the change data events (including Schemas and Table/Topic mapping), and test the performance. Finally, we'll show how to stream the change data events into an example downstream system, Elasticsearch, using an open source sink connector.
Presentation for PostgresConf.CN and PGConf.Asia 2021 https://www.highgo.ca/2022/01/19/2021-pg-asia-conference-delivered-another-successful-online-conference-again/
Presentation for the July 2018 @medianetlab meetup at NCSR "Demokritos"
Relative blog post can be found here: https://medianetlab.gr/mnlab-meetup-kubernetes/
and the video: https://www.youtube.com/watch?v=l2ce5U9bh6M
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...Paul Brebner
Apache Kafka's performance and scalability can be impacted by both hardware and software dimensions. In this presentation, we explore two recent experiences from running a managed Kafka service.
The first example recounts our experiences with running Kafka on AWS's Graviton2 (ARM) instances. We performed extensive benchmarking but didn't initially see the expected performance benefits. We developed multiple hypotheses to explain the unrealized performance improvement, but we could not experimentally determine the cause. We then profiled the Kafka application, and after identifying and confirming a likely cause, we found a workaround and obtained the hoped-for improved price/performance.
The second example explores the ability of Kafka to scale with increasing partitions. We revisit our previous benchmarking experiments with the newest version of Kafka (3.X), which has the option to replace Zookeeper with the new KRaft protocol. We test the theory that Kafka with KRaft can 'scale to millions of partitions' and also provide valuable experimental feedback on how close KRaft is to being production-ready.
Presentation for the ApacheCon NA Performance Engineering Track, October 6, 2022, Sheraton Hotel, New Orleans.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Presentation for the July 2018 @medianetlab meetup at NCSR "Demokritos"
Relative blog post can be found here: https://medianetlab.gr/mnlab-meetup-kubernetes/
and the video: https://www.youtube.com/watch?v=l2ce5U9bh6M
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...Paul Brebner
Apache Kafka's performance and scalability can be impacted by both hardware and software dimensions. In this presentation, we explore two recent experiences from running a managed Kafka service.
The first example recounts our experiences with running Kafka on AWS's Graviton2 (ARM) instances. We performed extensive benchmarking but didn't initially see the expected performance benefits. We developed multiple hypotheses to explain the unrealized performance improvement, but we could not experimentally determine the cause. We then profiled the Kafka application, and after identifying and confirming a likely cause, we found a workaround and obtained the hoped-for improved price/performance.
The second example explores the ability of Kafka to scale with increasing partitions. We revisit our previous benchmarking experiments with the newest version of Kafka (3.X), which has the option to replace Zookeeper with the new KRaft protocol. We test the theory that Kafka with KRaft can 'scale to millions of partitions' and also provide valuable experimental feedback on how close KRaft is to being production-ready.
Presentation for the ApacheCon NA Performance Engineering Track, October 6, 2022, Sheraton Hotel, New Orleans.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...HostedbyConfluent
In this session we share our experience of building a real-time data pipelines at Tencent PCG - one that handles 20 trillion daily messages with 700 clusters and 100Gb/s bursting traffic from a single app. We discuss our roadmap of enhancing Kafka to break its limits in terms of scalability, robustness and cost of operation.
We first built a proxy layer that aggregates physical clusters in a way agnostic to the clients. While this architecture solves many operational problems, it requires significant development to stay future-proof. With retrospection with our customer and careful study of the ongoing work from the community, we then designed a region federation solution in the broker layer, which allows us to deploy clusters at a much larger scale than previously possible, while at the same time providing better failure recovery and operability. We discuss how we make this development compatible with KIP-500 and KIP-405, and the two KIP (693, 694) that we submitted for discussion.
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...StreamNative
In this session, we provide an overview of the “Lakehouse” architecture and how Apache Pulsar™ can be used to support this architecture through integrations with the Apache Spark™ and Delta Lake to build your reliable data lake. We will also discuss the current state of Pulsar + Spark & Delta Lake connectors and discuss real world use cases and present the roadmap on what you can expect in the future of integrations between Spark, Delta Lake, and Pulsar communities.
A noETL Parallel Streaming Transformation Loader using Spark, Kafka & VerticaData Con LA
ETL, ELT and Lambda architectures have evolved into a [non]Streaming general purpose data ingestion pipeline, that is scalable through distributed processing, for Big Data Analytics over hybrid Data Warehouses in Hadoop and MPP Columnar stores like HPE-Vertica.
Bio: Jack Gudenkauf (https://www.linkedin.com/in/jackglinkedin) has over twenty-nine years of experience designing and implementing Internet scale distributed systems. Jack is currently the CEO & Founder of the startup BigDataInfra. He was previously; VP of Big Data at Playtika, a hands-on manager of the Twitter Analytics Data Warehouse team, spent 15 years at Microsoft shipping 15 products, and prior to Microsoft he managed his own consulting company after he began his career as an MIS Director of several startup companies.
Полет на Zeppelin с Apache Spark™ и Cassandra™Alex Ott
Презентация для Cassandra Day Russia (29.05.2020) о Apache Zeppelin, и его использовании с Apache Cassandra & Apache Spark. Также содержится информация об использовании Spark + Cassandra.
Использованные примеры доступны на Github: https://github.com/alexott/zeppelin-demos/tree/master/cassandra-day-russia
Видео: https://www.youtube.com/watch?v=9xyiNIlr-ws
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
Delivery of a new Bio-informatics infrastructure at the Wellcome Trust Sanger Center. We include how to programatically create, manage and provide providence for images used both at Sanger and elsewhere using open source tools and continuous integration.
Kafka Connect is a framework which connects Kafka with external Systems. It helps to move the data in and out of the Kafka. Connect makes it simple to use existing connector configuration for common source and sink Connectors.
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...confluent
What do you do when you've two different technologies on the upstream and the downstream that are both rapidly being adopted industrywide? How do you bridge them scalably and robustly? At Wework, the upstream data was being brokered by Kafka and the downstream consumers were highly scalable gRPC services. While Kafka was capable of efficiently channeling incoming events in near real-time from a variety of sensors that were used in select Wework spaces, the downstream gRPC services that were user-facing were exceptionally good at serving requests in a concurrent and robust manner. This was a formidable combination, if only there was a way to effectively bridge these two in an optimized way. Luckily, sink Connectors came to the rescue. However, there weren't any for gRPC sinks! So we wrote one.
In this talk, we will briefly focus on the advantages of using Connectors, creating new Connectors, and specifically spend time on gRPC sink Connector and its impact on Wework's data pipeline.
Kubernetes is great for deploying stateless containers, but what about the big data ecosystem? Episode 3 of our Kubernetes series covers how DC/OS enables you to connect your Kubernetes-based applications to co-located big data services.
Slides cover:
1. Why persistence is challenging in distributed architectures
How DC/OS helps you take advantage of the services available in the big data ecosystem
2. How to connect Kubernetes to your data services through networking
3. How Apache Flink and Apache Spark work with Kubernetes to enable real-time data processing on DC/OS
Flight on Zeppelin with Apache Spark & CassandraAlex Ott
English translation of the presentation done for Cassandra Day Russia about Apache Zeppelin, and how to use it with Apache Spark and Apache Cassandra.
Notebooks are available on Github: https://github.com/alexott/zeppelin-demos/
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlowAll Things Open
Presented at All Things Open 2023
Presented by Paul Brebner - Instaclustr (by Spot by NetApp)
Title: Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Abstract: In this talk we’ll build a Drone delivery application, and then use it to do some Machine Learning “on the fly”.
In the 1st part of the talk, we'll build a real-time Drone Delivery demonstration application using a combination of two open-source technologies: Uber’s Cadence (for stateful, scheduled, long-running workflows), and Apache Kafka (for fast streaming data).
With up to 2,000 (simulated) drones and deliveries in progress at once this application generates a vast flow of spatio-temporal data.
In the 2nd part of the talk, we'll use this platform to explore Machine Learning (ML) over streaming and drifting Kafka data with TensorFlow to try and predict which shops will be busy in advance.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlowPaulBrebner2
In this talk we’ll build a Drone delivery application, and then use it to do some Machine Learning “on the fly”.
In the 1st part of the talk, we’ll build a real-time Drone Delivery demonstration application using a combination of two open-source technologies: Uber’s Cadence (for stateful, scheduled, long-running workflows), and Apache Kafka (for fast streaming data).
With up to 2,000 (simulated) drones and deliveries in progress at once this application generates a vast flow of spatio-temporal data.
In the 2nd part of the talk, we’ll use this platform to explore Machine Learning (ML) over streaming and drifting Kafka data with TensorFlow to try and predict which shops will be busy in advance.
Talk from All Things Open 2023 held in Raleigh, USA: https://2023.allthingsopen.org/sessions/spinning-your-drones-with-cadence-workflows-apache-kafka-and-tensorflow/
Spinning your Drones with Cadence Workflows and Apache KafkaPaul Brebner
The rapid rise in Big Data use cases over the last decade has been accelerated by popular massively scalable open-source technologies such as Apache Cassandra® for storage, Apache Kafka® for streaming, and OpenSearch® for search. Now there’s a new member of the peloton, Cadence, for orchestration - code-based scalable fault-tolerant workflow orchestration. To illustrate the most important Cadence concepts (and more) we’ll build a realistic drone delivery service demonstration application. We’ll also explore what happens when orchestration meets choreography, and use the drone application to illustrate different ways to integrate Cadence with Apache Kafka, including reusing Kafka microservices. But how scalable is Cadence in practice? We’ll fill the sky with drones - how many drones can we get flying at once?
More Related Content
Similar to Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Source Connector
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
Streaming data systems have been growing rapidly in importance to the modern data stack. Kafka’s kSQL provides an interface for analytic tools that speak SQL. Apache Superset, the most popular modern open-source visualization and analytics solution, plugs into nearly any data source that speaks SQL, including Kafka. Here, we review and compare methods for connecting Kafka to Superset to enable streaming analytics use cases including anomaly detection, operational monitoring, and online data integration.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Enhancing Apache Kafka for Large Scale Real-Time Data Pipeline at Tencent | K...HostedbyConfluent
In this session we share our experience of building a real-time data pipelines at Tencent PCG - one that handles 20 trillion daily messages with 700 clusters and 100Gb/s bursting traffic from a single app. We discuss our roadmap of enhancing Kafka to break its limits in terms of scalability, robustness and cost of operation.
We first built a proxy layer that aggregates physical clusters in a way agnostic to the clients. While this architecture solves many operational problems, it requires significant development to stay future-proof. With retrospection with our customer and careful study of the ongoing work from the community, we then designed a region federation solution in the broker layer, which allows us to deploy clusters at a much larger scale than previously possible, while at the same time providing better failure recovery and operability. We discuss how we make this development compatible with KIP-500 and KIP-405, and the two KIP (693, 694) that we submitted for discussion.
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...StreamNative
In this session, we provide an overview of the “Lakehouse” architecture and how Apache Pulsar™ can be used to support this architecture through integrations with the Apache Spark™ and Delta Lake to build your reliable data lake. We will also discuss the current state of Pulsar + Spark & Delta Lake connectors and discuss real world use cases and present the roadmap on what you can expect in the future of integrations between Spark, Delta Lake, and Pulsar communities.
A noETL Parallel Streaming Transformation Loader using Spark, Kafka & VerticaData Con LA
ETL, ELT and Lambda architectures have evolved into a [non]Streaming general purpose data ingestion pipeline, that is scalable through distributed processing, for Big Data Analytics over hybrid Data Warehouses in Hadoop and MPP Columnar stores like HPE-Vertica.
Bio: Jack Gudenkauf (https://www.linkedin.com/in/jackglinkedin) has over twenty-nine years of experience designing and implementing Internet scale distributed systems. Jack is currently the CEO & Founder of the startup BigDataInfra. He was previously; VP of Big Data at Playtika, a hands-on manager of the Twitter Analytics Data Warehouse team, spent 15 years at Microsoft shipping 15 products, and prior to Microsoft he managed his own consulting company after he began his career as an MIS Director of several startup companies.
Полет на Zeppelin с Apache Spark™ и Cassandra™Alex Ott
Презентация для Cassandra Day Russia (29.05.2020) о Apache Zeppelin, и его использовании с Apache Cassandra & Apache Spark. Также содержится информация об использовании Spark + Cassandra.
Использованные примеры доступны на Github: https://github.com/alexott/zeppelin-demos/tree/master/cassandra-day-russia
Видео: https://www.youtube.com/watch?v=9xyiNIlr-ws
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
Delivery of a new Bio-informatics infrastructure at the Wellcome Trust Sanger Center. We include how to programatically create, manage and provide providence for images used both at Sanger and elsewhere using open source tools and continuous integration.
Kafka Connect is a framework which connects Kafka with external Systems. It helps to move the data in and out of the Kafka. Connect makes it simple to use existing connector configuration for common source and sink Connectors.
Creating Connector to Bridge the Worlds of Kafka and gRPC at Wework (Anoop Di...confluent
What do you do when you've two different technologies on the upstream and the downstream that are both rapidly being adopted industrywide? How do you bridge them scalably and robustly? At Wework, the upstream data was being brokered by Kafka and the downstream consumers were highly scalable gRPC services. While Kafka was capable of efficiently channeling incoming events in near real-time from a variety of sensors that were used in select Wework spaces, the downstream gRPC services that were user-facing were exceptionally good at serving requests in a concurrent and robust manner. This was a formidable combination, if only there was a way to effectively bridge these two in an optimized way. Luckily, sink Connectors came to the rescue. However, there weren't any for gRPC sinks! So we wrote one.
In this talk, we will briefly focus on the advantages of using Connectors, creating new Connectors, and specifically spend time on gRPC sink Connector and its impact on Wework's data pipeline.
Kubernetes is great for deploying stateless containers, but what about the big data ecosystem? Episode 3 of our Kubernetes series covers how DC/OS enables you to connect your Kubernetes-based applications to co-located big data services.
Slides cover:
1. Why persistence is challenging in distributed architectures
How DC/OS helps you take advantage of the services available in the big data ecosystem
2. How to connect Kubernetes to your data services through networking
3. How Apache Flink and Apache Spark work with Kubernetes to enable real-time data processing on DC/OS
Flight on Zeppelin with Apache Spark & CassandraAlex Ott
English translation of the presentation done for Cassandra Day Russia about Apache Zeppelin, and how to use it with Apache Spark and Apache Cassandra.
Notebooks are available on Github: https://github.com/alexott/zeppelin-demos/
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlowAll Things Open
Presented at All Things Open 2023
Presented by Paul Brebner - Instaclustr (by Spot by NetApp)
Title: Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Abstract: In this talk we’ll build a Drone delivery application, and then use it to do some Machine Learning “on the fly”.
In the 1st part of the talk, we'll build a real-time Drone Delivery demonstration application using a combination of two open-source technologies: Uber’s Cadence (for stateful, scheduled, long-running workflows), and Apache Kafka (for fast streaming data).
With up to 2,000 (simulated) drones and deliveries in progress at once this application generates a vast flow of spatio-temporal data.
In the 2nd part of the talk, we'll use this platform to explore Machine Learning (ML) over streaming and drifting Kafka data with TensorFlow to try and predict which shops will be busy in advance.
Find more info about All Things Open:
On the web: https://www.allthingsopen.org/
Twitter: https://twitter.com/AllThingsOpen
LinkedIn: https://www.linkedin.com/company/all-things-open/
Instagram: https://www.instagram.com/allthingsopen/
Facebook: https://www.facebook.com/AllThingsOpen
Mastodon: https://mastodon.social/@allthingsopen
Threads: https://www.threads.net/@allthingsopen
2023 conference: https://2023.allthingsopen.org/
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlowPaulBrebner2
In this talk we’ll build a Drone delivery application, and then use it to do some Machine Learning “on the fly”.
In the 1st part of the talk, we’ll build a real-time Drone Delivery demonstration application using a combination of two open-source technologies: Uber’s Cadence (for stateful, scheduled, long-running workflows), and Apache Kafka (for fast streaming data).
With up to 2,000 (simulated) drones and deliveries in progress at once this application generates a vast flow of spatio-temporal data.
In the 2nd part of the talk, we’ll use this platform to explore Machine Learning (ML) over streaming and drifting Kafka data with TensorFlow to try and predict which shops will be busy in advance.
Talk from All Things Open 2023 held in Raleigh, USA: https://2023.allthingsopen.org/sessions/spinning-your-drones-with-cadence-workflows-apache-kafka-and-tensorflow/
Similar to Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Source Connector (20)
Spinning your Drones with Cadence Workflows and Apache KafkaPaul Brebner
The rapid rise in Big Data use cases over the last decade has been accelerated by popular massively scalable open-source technologies such as Apache Cassandra® for storage, Apache Kafka® for streaming, and OpenSearch® for search. Now there’s a new member of the peloton, Cadence, for orchestration - code-based scalable fault-tolerant workflow orchestration. To illustrate the most important Cadence concepts (and more) we’ll build a realistic drone delivery service demonstration application. We’ll also explore what happens when orchestration meets choreography, and use the drone application to illustrate different ways to integrate Cadence with Apache Kafka, including reusing Kafka microservices. But how scalable is Cadence in practice? We’ll fill the sky with drones - how many drones can we get flying at once?
Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
In the last decade, the development of modern horizontally scalable open-source Big Data technologies such as Apache Cassandra (for data storage), and Apache Kafka (for data streaming) enabled cost-effective, highly scalable, reliable, low-latency applications, and made these technologies increasingly ubiquitous. To enable reliable horizontal scalability, both Cassandra and Kafka utilize partitioning (for concurrency) and replication (for reliability and availability) across clustered servers. But building scalable applications isn’t as easy as just throwing more servers at the clusters, and unexpected speed humps are common. Consequently, you also need to understand the performance impact of partitions, replication, and clusters; monitor the correct metrics to have an end-to-end view of applications and clusters; conduct careful benchmarking, and scale and tune iteratively to take into account performance insights and optimizations. In this presentation, I will explore some of the performance goals, challenges, solutions, and results I discovered over the last 5 years building multiple realistic demonstration applications. The examples will include trade-offs with elastic Cassandra auto-scaling, scaling a Cassandra and Kafka anomaly detection application to 19 Billion checks per day, and building low-latency streaming data pipelines using Kafka Connect for multiple heterogeneous source and sink systems.
Invited keynote for 5th Workshop on Hot Topics in Cloud Computing Performance (HotCloudPerf 2022) https://hotcloudperf.spec.org/ at ICPE 2022 https://icpe2022.spec.org/
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
DeveloperWeek Management 2022 Conference Presentation https://www.developerweek.com/global/conference/management/schedule/
In the last decade, the development of modern horizontally scalable open-source Big Data technologies such as Apache Cassandra (for data storage), and Apache Kafka (for data streaming) enabled cost-effective, highly scalable, reliable, low-latency applications, and made these technologies increasingly ubiquitous. To enable reliable horizontal scalability, both Cassandra and Kafka utilize partitioning (for concurrency) and replication (for reliability and availability) across clustered servers. But building scalable applications isn’t as easy as just throwing more servers at the clusters, and unexpected speed humps are common. Consequently, you also need to understand the performance impact of partitions, replication, and clusters; monitor the correct metrics to have an end-to-end view of applications and clusters; conduct careful benchmarking, and scale and tune iteratively to take into account performance insights and optimizations. In this presentation, I will explore some of the performance goals, challenges, solutions, and results I discovered over the last 5 years building multiple realistic demonstration applications. The examples will include trade-offs with elastic Cassandra auto-scaling, scaling a Cassandra and Kafka anomaly detection application to 19 Billion checks per day, and building low-latency streaming data pipelines using Kafka Connect for multiple heterogeneous source and sink systems.
n this Cartoon Style Visual Introduction to Apache Kafka we’re going to build a “Postal Service” to deliver party invitations to two groups, Nerds and Pugsters – find out who goes to the party. Along the way we’ll learn about Kafka Producers, Consumers, Groups, Topics, Partitions, Keys, Records, Delivery Semantics (Guaranteed delivery, and who gets what messages). We’ll also have a quick look at Streams (mail sorting) and Connectors (how does mail get delivered between post offices).
Presentation for Open Source 101 2022: https://opensource101.com/sessions/a-visual-introduction-to-apache-kafka/
Video: https://youtu.be/NUnsHFn52sE
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can efficiently process spatiotemporal data (space and time). In order to find location-specific anomalies, we need ways to represent locations, to index locations, and to query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. For each representation we also explore possible Cassandra implementations including: Clustering columns, Secondary indexes, Denormalized tables, and the Cassandra Lucene Index Plugin. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
ApacheCon NA 2020 Geospatial track presentation https://www.apachecon.com/acah2020/tracks/geospatial.html
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Paul Brebner
With the rapid onset of the global Covid-19 Pandemic from the start of this year the USA Centers for Disease Control and Prevention (CDC) had to quickly implement a new Covid-19 specific pipeline to collect testing data from all of the USA’s states and territories, and carry out other critical steps including integration, cleaning, checking, enrichment, analysis, and enforcing data governance and privacy etc. The pipeline then produces multiple consumable results for federal and public agencies. They did this in under 30 days, using Apache Kafka. In this presentation we'll build a similar (but simpler) pipeline for ingesting, integrating, indexing, searching/analysing and visualising some publicly available tidal data. We'll briefly introduce each technology and component, and walk through the steps of using Apache Kafka, Kafka Connect, Elasticsearch and Kibana to build the pipeline and visualise the results.
Grid Middleware – Principles, Practice and PotentialPaul Brebner
A presentation I gave at UCL, while I was managing the UK OGSA Evaluation Project in 2004, while I was on leave from CSIRO, at UCL Computer Science department, working with Wolfgang Emmerich.
Paul Brebner, University College London, Computer Science Department Seminar: "Grid Middleware - Principles, Practice, and Potential", 1 November 2004.
The project page was still here (2020): http://sse.cs.ucl.ac.uk/UK-OGSA/
Grid middleware is easy to install, configure, secure, debug and manage acros...Paul Brebner
A presentation made while I was managing the UK OGSA Evaluation Project in 2004, while I was on leave from CSIRO, at UCL Computer Science department, working with Wolfgang Emmerich: in which we "believe 6 impossible things before breakfast". This project encountered and partially solved many of the problems that Cloud computing finally solved.
Paul Brebner, Oxford University Computing Laboratory invited talk: "Grid middleware is easy to install, configure, debug and manage - across multiple sites (One can't believe impossible things)", 15 October 2004.
The project web site is still here (2020): http://sse.cs.ucl.ac.uk/UK-OGSA/
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
This version is a slightly shorter version of previous ones.
Google Cloud Special Edition, Sydney Data Engineering Meetup
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/269146076/
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra.
Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations.
We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes.
For each representation we also explore possible Cassandra implementations including: Clustering columns, Secondary indexes, Denormalized tables, and the Cassandra Lucene Index Plugin.
To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
Updated version of presentation for 30 April 2020 Melbourne Distributed Meetup (online)
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Paul Brebner
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters.
In this presentation, Paul will reveal how he architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream.
It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. Paul will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from his experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
Melbourne Big Data Meetup, March 5 2020
https://www.eventbrite.com/e/melbourne-big-data-meetup-realtime-anomaly-detection-with-cassandra-kafka-tickets-93028445585
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
Geospatial data makes it possible to leverage location, location, location! Geospatial data is taking off, as companies realize that just about everyone needs the benefits of geospatially aware applications. As a result there are no shortages of unique but demanding use cases of how enterprises are leveraging large-scale and fast geospatial big data processing. The data must be processed in large quantities - and quickly - to reveal hidden spatiotemporal insights vital to businesses and their end users. In the rush to tap into geospatial data, many enterprises will find that representing, indexing and querying geospatially-enriched data is more complex than they anticipated - and might bring about tradeoffs between accuracy, latency, and throughput.This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
0b101000 years of computing: a personal timeline - decade "0", the 1980'sPaul Brebner
With the arrival of the 2020's I realised I've now been involved in Computing for 4 decades. So I probably know more about the past of Computing that I will about the future! Here's a personal timeline of hopefully interesting things from the 1980's in Computing (at Waikato University, NZ, and UNSW in Australia).
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...Paul Brebner
Join with me in a journey of exploration upriver with "Kongo", a scalable streaming IoT logistics demonstration application using Apache Kafka, the popular open source distributed streaming platform. Along the way you'll discover: an example logistics IoT problem domain (involving the rapid movement of thousands of goods by trucks between warehouses, with real-time checking of complex business and safety rules from sensor data); an overview of the Apache Kafka architecture and components; lessons learned from making critical Kaka application design decisions; an example of Kafka Streams for checking truck load limits; and finish the journey by overcoming final performance challenges and shooting the rapids to scale Kongo on a production Kafka cluster.
https://aceu19.apachecon.com/session/kongo-building-scalable-streaming-iot-application-using-apache-kafka
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters. In this presentation, we will reveal how we architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. We will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from our experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...Paul Brebner
As distributed applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical. Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works. In this presentation we’ll explore two complementary Open Source technologies: Prometheus for monitoring application metrics; and OpenTracing and Jaeger for distributed tracing. We’ll discover how they improve the observability of a massively scalable Anomaly Detection system - an application which is built around Apache Cassandra and Apache Kafka for the data layers, and dynamically deployed and scaled on Kubernetes, a container orchestration technology. We will give an overview of Prometheus and OpenTracing/Jaeger, explain how the application is instrumented, and describe how Prometheus and OpenTracing are deployed and configured in a production environment running Kubernetes, to dynamically monitor the application at scale. We conclude by exploring the benefits of monitoring and tracing technologies for understanding, debugging and tuning complex dynamic distributed systems built on Kafka, Cassandra and Kubernetes, and introduce a new use case to enable Cassandra Elastic Autoscaling, by combining Prometheus alerts, Instaclustr’s Provisioning API for Dynamic Resizing, and the new Prometheus monitoring API.
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
As distributed cloud applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical.
Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works.
We’ll explore two complementary Open Source technologies:
Prometheus for monitoring application metrics, and
OpenTracing and Jaeger for distributed tracing.
We’ll discover how they improve the observability of
an Anomaly Detection application, deployed on AWS Kubernetes, and using Instaclustr managed Apache Cassandra and Kafka clusters.
Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
9. 1. The
Debezium
PostgreSQL
Connector—
Run It
• Download the Debezium PostgreSQL connector
• Deploy it:
o Upload to AWS S3 bucket
o Synchronise with Instaclustr managed Kafka connect
o "io.debezium.connector.postgresql.PostgresConnector" will be in list
of available connectors on the console
• Configure PostgreSQL
o Set wal_level (write ahead log) to logical (3rd non-default level,
requires server restart)
o Create Debezium user with REPLICATION and LOGIN permissions
o These need PostgreSQL admin permissions
• Configure Debezium connector and run it
o Plugin.name default must be set to pgoutput, need PG
username/password and IP