The document discusses the design and implementation of Spark Streaming connectors for real-time data sources like Azure Event Hubs. It covers key aspects like connecting Event Hubs to Spark Streaming, designing the connector to minimize resource usage, ensuring fault tolerance through checkpointing and recovery, and managing message offsets and processing rates in a distributed manner. The connector design addresses challenges like long-running receivers, extra resource requirements, and data loss during failures. Lessons from the initial receiver-based approach informed the design of a more efficient solution.
Using FLiP with influxdb for edgeai iot at scale 2022Timothy Spann
https://adtmag.com/webcasts/2021/12/influxdata-february-10.aspx?tc=page0
FLiP Stack (Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark) with Influx DB for Edge AI and IoT workloads at scale
Tim Spann
Developer Advocate
StreamNative
datainmotion.dev
StreamNative FLiP into scylladb - scylla summit 2022Timothy Spann
StreamNative FLiP into scylladb - scylla summit 2022
Utilizing Apache Pulsar with Apache NiFi, Apache Flink, Apache Spark and Scylla for fast IoT application with MQTT and beyond.
This document discusses Spotify's migration of data pipelines to Docker. It provides background on Spotify growing from 50 to 1000 engineers and the challenges of scaling their big data infrastructure. Spotify adopted Docker to help solve packaging and dependency issues, moving pipelines from cron jobs to a REST API and Docker images. Docker is allowing Spotify to transparently migrate their on-premise Hadoop cluster to Google Cloud, handling over 100 petabytes of data and growing.
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Timothy Spann
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
ntroducing the FLaNK stack which combines Apache Flink, Apache NiFi, Apache Kafka and Apache Kudu to build fast applications for IoT, AI, rapid ingest.
FLaNK provides a quick set of tools to build applications at any scale for any streaming and IoT use cases.
https://www.flankstack.dev/
Tools
Apache Flink, Apache Kafka, Apache NiFi, MiNiFi, Apache MXNet, Apache Kudu, Apache Impala, Apache HDFS
References
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html
Track
Community and Industry Impact
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsTimothy Spann
This document provides an overview and summary of Apache Pulsar, a distributed streaming and messaging platform. It discusses Pulsar's benefits like data durability, scalability, geo-replication and multi-tenancy. It outlines key use cases like message queuing and data streaming. The document also summarizes Pulsar's architecture, subscriptions modes, connectors, and integration with other technologies like Apache Flink, Apache NiFi and MQTT. It highlights real-world customer implementations and provides demos of ingesting IoT data via Pulsar.
Using FLiP with influxdb for edgeai iot at scale 2022Timothy Spann
https://adtmag.com/webcasts/2021/12/influxdata-february-10.aspx?tc=page0
FLiP Stack (Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark) with Influx DB for Edge AI and IoT workloads at scale
Tim Spann
Developer Advocate
StreamNative
datainmotion.dev
StreamNative FLiP into scylladb - scylla summit 2022Timothy Spann
StreamNative FLiP into scylladb - scylla summit 2022
Utilizing Apache Pulsar with Apache NiFi, Apache Flink, Apache Spark and Scylla for fast IoT application with MQTT and beyond.
This document discusses Spotify's migration of data pipelines to Docker. It provides background on Spotify growing from 50 to 1000 engineers and the challenges of scaling their big data infrastructure. Spotify adopted Docker to help solve packaging and dependency issues, moving pipelines from cron jobs to a REST API and Docker images. Docker is allowing Spotify to transparently migrate their on-premise Hadoop cluster to Google Cloud, handling over 100 petabytes of data and growing.
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Timothy Spann
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
ntroducing the FLaNK stack which combines Apache Flink, Apache NiFi, Apache Kafka and Apache Kudu to build fast applications for IoT, AI, rapid ingest.
FLaNK provides a quick set of tools to build applications at any scale for any streaming and IoT use cases.
https://www.flankstack.dev/
Tools
Apache Flink, Apache Kafka, Apache NiFi, MiNiFi, Apache MXNet, Apache Kudu, Apache Impala, Apache HDFS
References
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html
Track
Community and Industry Impact
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsTimothy Spann
This document provides an overview and summary of Apache Pulsar, a distributed streaming and messaging platform. It discusses Pulsar's benefits like data durability, scalability, geo-replication and multi-tenancy. It outlines key use cases like message queuing and data streaming. The document also summarizes Pulsar's architecture, subscriptions modes, connectors, and integration with other technologies like Apache Flink, Apache NiFi and MQTT. It highlights real-world customer implementations and provides demos of ingesting IoT data via Pulsar.
This document provides an introduction and overview of Apache NiFi 1.11.4. It discusses new features such as improved support for partitions in Azure Event Hubs, encrypted repositories, class loader isolation, and support for IBM MQ and the Hortonworks Schema Registry. It also summarizes new reporting tasks, controller services, and processors. Additional features include JDK 11 support, encrypted repositories, and parameter improvements to support CI/CD. The document provides examples of using NiFi with Docker, Kubernetes, and in the cloud. It concludes with useful links for additional NiFi resources.
Using the flipn stack for edge ai (flink, nifi, pulsar)Timothy Spann
The document summarizes a presentation about using the FLiPN stack (Flink, NiFi, Pulsar) for edge AI. It discusses the key components - Apache Flink for stream processing, Apache Pulsar for messaging and streaming, and Apache NiFi for dataflow. It provides an overview of their features and benefits. It also demonstrates integrating these technologies with edge devices like NVIDIA Jetson boards and deploying the streaming pipelines to StreamNative Cloud.
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...Timothy Spann
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Kafka, and Flink
Timothy Spann
Twitter - @PaasDev // Blog: www.datainmotion.dev
Frequent speaker at major conferences and events.
Principal DataFlow Field Engineer for streaming around Apache NiFi, NiFi Registry, MiNiFi, Kafka, Kafka Connect, Kafka Streams, Flink, Flink SQL, SMM, SRM, SR and EFM.
Previously at E&Y, HPE, Pivotal & Hortonworks
Question #1
What is the most difficult part of an Edge Flow?
Gateway Agent
Edge Data Collection
Processing Data
https://github.com/tspannhw/DemoJam2021
https://github.com/tspannhw/CloudDemo2021
ApacheCon 2021: Apache NiFi 101- introduction and best practicesTimothy Spann
ApacheCon 2021: Apache NiFi 101- introduction and best practices
Thursday 14:10 UTC
Apache NiFi 101: Introduction and Best Practices
Timothy Spann
In this talk, we will walk step by step through Apache NiFi from the first load to first application. I will include slides, articles and examples to take away as a Quick Start to utilizing Apache NiFi in your real-time dataflows. I will help you get up and running locally on your laptop, Docker
DZone Zone Leader and Big Data MVB
@PaasDev
https://github.com/tspannhw https://www.datainmotion.dev/
https://github.com/tspannhw/SpeakerProfile
https://dev.to/tspannhw
https://sessionize.com/tspann/
https://www.slideshare.net/bunkertor
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkDataWorks Summit
The amount of information coming from a Cloud deployment, that could be used to have a better situational awareness, and operate it efficiently is huge. Tools as the ones provided by Apache foundation can be used to build a solution to that challenge.
Nowadays Cloud deployments are pervasive in businesses, with scalability and multi tenancy as their core capabilities. This means that these deployments can grow easily beyond 1000 nodes and efficient operation of these huge clusters requires real time log analysis, metrics, events and configuration data. Performing correlation and finding patterns, not just to get to root causes but also to predict failures and reduce risk requires tools that go beyond current solutions.
In the prototype developed by Red Hat and KEEDIO (keedio.com), we managed to address the above challenges with the use of Big Data tools like Apache NiFi, Apache Kafka and Apache Flink, that enabled us to process the constant stream of syslog messages (RFC5424) produced by the Infrastructure as a Service, provided by OpenStack services, and also detect common failure patterns that could arise and generate alerts as needed.
This session is an (Intermediate) talk in our Apache Nifi and Data Science track. It focuses on Apache Flink, Apache Nifi, Apache Kafka and is geared towards Architect, Data Scientist, Data Analyst, Developer / Engineer audiences.
Speaker
Miguel Perez Colino, Senior Design Product Manager, Red Hat
Suneel Marthi, Senior Principal Engineer, Red Hat
DBCC 2021 - FLiP Stack for Cloud Data LakesTimothy Spann
DBCC 2021 - FLiP Stack for Cloud Data Lakes
With Apache Pulsar, Apache NiFi, Apache Flink. The FLiP(N) Stack for Event processing and IoT. With StreamNative Cloud.
DBCC International – Friday 15.10.2021
Powered by Apache Pulsar, StreamNative provides a cloud-native, real-time messaging and streaming platform to support multi-cloud and hybrid cloud strategies.
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...Timothy Spann
This document provides an overview and introduction to Apache Pulsar and StreamNative. Some key points:
- Apache Pulsar is an open-source distributed messaging and streaming platform built for cloud-native applications. It provides features like data durability, scalability, geo-replication, and multi-tenancy.
- StreamNative helps companies adopt Pulsar for use cases like building microservices, capturing real-time data, and cloud migrations. They provide commercial support for Pulsar through products like StreamNative Cloud.
- The document discusses how Pulsar works, its key capabilities and milestones, and reference architectures for using it with tools like Apache Flink and ClickHouse for unified messaging, streaming
Codeless pipelines with pulsar and flinkTimothy Spann
This document summarizes Tim Spann's presentation on codeless pipelines with Apache Pulsar and Apache Flink. The presentation discusses how StreamNative's platform uses Pulsar and Flink to enable end-to-end streaming data pipelines without code. It provides an overview of Pulsar's capabilities for messaging, stream processing, and integration with other Apache projects like Kafka, NiFi and Flink. Examples are given of ingesting IoT data into Pulsar and running real-time analytics on the data using Flink SQL.
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with a demo highlighting the performant and powerful integration of these projects.
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...DataWorks Summit
The last 5 years have been marked by an explosion of Internet-connected devices. From cars to solar power, from TVs to juice makers, modern life is filled with interconnected smart devices.
But while those ubiquitous devices enhance the interaction with the technology that surrounds us, the lifecycle management of IoT firmware and poor security design choices still present a significant threat to our daily lives.
Despite the ascent of threats like the Mirai botnet, the amount of published research around how to programmatically detect new IoTs in the wild has been somewhat limited.
In this presentation we introduce Data Engineering in the context of cyber security, discuss why it is important to move away from the view that security log pipelines are enrichment and indicator matching tools, and push the boundaries of “Simple Event Processing” to demonstrate how Apache NiFi and Apache MiNiFi’s feature rich dataflows can be used to dynamically identify new IoT botnet activities in the wild.
Speakers
Andre Fucs De Miranda, Independent Consultant, Fluenda
Andy LoPresto, Sr. Member of Technical Staff, Hortonworks
Fluentd is an open source log collector that allows flexible collection and routing of log data. It uses JSON format for log messages and supports many input and output plugins. Fluentd can collect logs from files, network services, and applications before routing them to storage and analysis services like MongoDB, HDFS, and Treasure Data. The open source project has grown a large community contributing over 100 plugins to make log collection and processing easier.
Matt Franklin - Apache Software (Geekfest)W2O Group
The document discusses the potential benefits of container technologies like Docker. It notes that containers offer significantly higher density than virtual machines by avoiding hypervisor overhead. This density improvement can lead to major cost reductions by reducing infrastructure needs. Containers also improve developer efficiency by making development environments portable and disposable. This allows more rapid experimentation and innovation, potentially translating to increased revenue. Technologies like Amazon Lambda take the on-demand aspects of containers even further by abstracting compute resources. The document promotes StackEngine as a solution for managing containers at scale in production environments.
This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.
Across the globe energy systems are changing, creating unprecedented challenges for the organisations tasked with ensuring the lights stay on. In the UK, National Grid is facing shrinking margins, looming capacity shortages and unpredictable peaks and troughs in energy supply caused by increasing levels of renewable penetration. Open Energi uses its IoT technology to unlock demand-side capacity - from industrial equipment, co-generation and batery storage systems - creating a smarter grid; one that is cleaner, cheaper, more secure and more efficient.
I'll talk about how we use Apache Nifi to orchestrate and coordinate Machine Learning microservices that operate on streams of data coming from IoT devices, providing a layer of fault-tolerance and traceability. With built-in retry logic, backpressure and clustering, Nifi helps us keep hard problems away from our code. It comes with processors that integrate with our cloud provider of choice (Microsoft Azure), fitting seamlessly into our processing pipeline.Finally, its straightforward graphical interface makes it easy enough to use that any team member can step in and troubleshoot a flow with little training.
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)Timothy Spann
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
by Timothy Spann
Wednesday 17:10 UTC - Cracking the Nut, Solving Edge AI with Apache Tools and Frameworks
Wednesday 17:10 UTC
Cracking the Nut, Solving Edge AI with Apache Tools and Frameworks
Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the edge before we start our real-time streaming flows. Fortunately using the all Apache FLiP Stack we can do this with ease! Streaming AI Powered Analytics From the Edge to the Data Center is now a simple use case. With MiNiFi we can ingest the data, do data checks, cleansing, run machine learning and deep learning models and route our data in real-time to Apache NiFi and Apache Pulsar for further transformations and processing. Apache Flink will provide our advanced streaming capabilities fed real-time via Apache Kafka topics. Apache MXNet models will run both at the edge and in our data centers via Apache NiFi and MiNiFi. Our final data will be stored in various Apache datastores. Event-Driven Microservices in Apache Pulsar Functions.
Tools:
Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, Apache MXNet
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...Timothy Spann
Scenic city summit real-time streaming in any and all clouds, hybrid and beyond
24-September-2021. Scenic City Summit. Virtual. Real-Time Streaming in Any and All Clouds, Hybrid and Beyond
Apache Pulsar, Apache NiFi, Apache Flink
StreamNative
Tim Spann
https://sceniccitysummit.com/
JustGiving – Serverless Data Pipelines, API, Messaging and Stream ProcessingLuis Gonzalez
What to Expect from the Session
• Recap of some AWS services
• Event-driven data platform at JustGiving
• Serverless computing
• Six serverless patterns
• Serverless recommendations and best practices
This document discusses strategies for building large-scale stream infrastructures across multiple data centers using Apache Kafka. It outlines common multi-data center patterns like stretched clusters, active/passive clusters, and active/active clusters. It also covers challenges like maintaining ordering and consumer offsets across data centers and potential solutions.
This document provides an introduction and overview of Apache NiFi 1.11.4. It discusses new features such as improved support for partitions in Azure Event Hubs, encrypted repositories, class loader isolation, and support for IBM MQ and the Hortonworks Schema Registry. It also summarizes new reporting tasks, controller services, and processors. Additional features include JDK 11 support, encrypted repositories, and parameter improvements to support CI/CD. The document provides examples of using NiFi with Docker, Kubernetes, and in the cloud. It concludes with useful links for additional NiFi resources.
Using the flipn stack for edge ai (flink, nifi, pulsar)Timothy Spann
The document summarizes a presentation about using the FLiPN stack (Flink, NiFi, Pulsar) for edge AI. It discusses the key components - Apache Flink for stream processing, Apache Pulsar for messaging and streaming, and Apache NiFi for dataflow. It provides an overview of their features and benefits. It also demonstrates integrating these technologies with edge devices like NVIDIA Jetson boards and deploying the streaming pipelines to StreamNative Cloud.
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...Timothy Spann
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Kafka, and Flink
Timothy Spann
Twitter - @PaasDev // Blog: www.datainmotion.dev
Frequent speaker at major conferences and events.
Principal DataFlow Field Engineer for streaming around Apache NiFi, NiFi Registry, MiNiFi, Kafka, Kafka Connect, Kafka Streams, Flink, Flink SQL, SMM, SRM, SR and EFM.
Previously at E&Y, HPE, Pivotal & Hortonworks
Question #1
What is the most difficult part of an Edge Flow?
Gateway Agent
Edge Data Collection
Processing Data
https://github.com/tspannhw/DemoJam2021
https://github.com/tspannhw/CloudDemo2021
ApacheCon 2021: Apache NiFi 101- introduction and best practicesTimothy Spann
ApacheCon 2021: Apache NiFi 101- introduction and best practices
Thursday 14:10 UTC
Apache NiFi 101: Introduction and Best Practices
Timothy Spann
In this talk, we will walk step by step through Apache NiFi from the first load to first application. I will include slides, articles and examples to take away as a Quick Start to utilizing Apache NiFi in your real-time dataflows. I will help you get up and running locally on your laptop, Docker
DZone Zone Leader and Big Data MVB
@PaasDev
https://github.com/tspannhw https://www.datainmotion.dev/
https://github.com/tspannhw/SpeakerProfile
https://dev.to/tspannhw
https://sessionize.com/tspann/
https://www.slideshare.net/bunkertor
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkDataWorks Summit
The amount of information coming from a Cloud deployment, that could be used to have a better situational awareness, and operate it efficiently is huge. Tools as the ones provided by Apache foundation can be used to build a solution to that challenge.
Nowadays Cloud deployments are pervasive in businesses, with scalability and multi tenancy as their core capabilities. This means that these deployments can grow easily beyond 1000 nodes and efficient operation of these huge clusters requires real time log analysis, metrics, events and configuration data. Performing correlation and finding patterns, not just to get to root causes but also to predict failures and reduce risk requires tools that go beyond current solutions.
In the prototype developed by Red Hat and KEEDIO (keedio.com), we managed to address the above challenges with the use of Big Data tools like Apache NiFi, Apache Kafka and Apache Flink, that enabled us to process the constant stream of syslog messages (RFC5424) produced by the Infrastructure as a Service, provided by OpenStack services, and also detect common failure patterns that could arise and generate alerts as needed.
This session is an (Intermediate) talk in our Apache Nifi and Data Science track. It focuses on Apache Flink, Apache Nifi, Apache Kafka and is geared towards Architect, Data Scientist, Data Analyst, Developer / Engineer audiences.
Speaker
Miguel Perez Colino, Senior Design Product Manager, Red Hat
Suneel Marthi, Senior Principal Engineer, Red Hat
DBCC 2021 - FLiP Stack for Cloud Data LakesTimothy Spann
DBCC 2021 - FLiP Stack for Cloud Data Lakes
With Apache Pulsar, Apache NiFi, Apache Flink. The FLiP(N) Stack for Event processing and IoT. With StreamNative Cloud.
DBCC International – Friday 15.10.2021
Powered by Apache Pulsar, StreamNative provides a cloud-native, real-time messaging and streaming platform to support multi-cloud and hybrid cloud strategies.
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...Timothy Spann
This document provides an overview and introduction to Apache Pulsar and StreamNative. Some key points:
- Apache Pulsar is an open-source distributed messaging and streaming platform built for cloud-native applications. It provides features like data durability, scalability, geo-replication, and multi-tenancy.
- StreamNative helps companies adopt Pulsar for use cases like building microservices, capturing real-time data, and cloud migrations. They provide commercial support for Pulsar through products like StreamNative Cloud.
- The document discusses how Pulsar works, its key capabilities and milestones, and reference architectures for using it with tools like Apache Flink and ClickHouse for unified messaging, streaming
Codeless pipelines with pulsar and flinkTimothy Spann
This document summarizes Tim Spann's presentation on codeless pipelines with Apache Pulsar and Apache Flink. The presentation discusses how StreamNative's platform uses Pulsar and Flink to enable end-to-end streaming data pipelines without code. It provides an overview of Pulsar's capabilities for messaging, stream processing, and integration with other Apache projects like Kafka, NiFi and Flink. Examples are given of ingesting IoT data into Pulsar and running real-time analytics on the data using Flink SQL.
Druid is an open-source analytics data store specially designed to execute OLAP queries on event data. Its speed, scalability and efficiency have made it a popular choice to power user-facing analytic applications, including multiple BI tools and dashboards. However, Druid does not provide important features requested by many of these applications, such as a SQL interface or support for complex operations such as joins. This talk presents our work on extending Druid indexing and querying capabilities using Apache Hive. In particular, our solution allows to index complex query results in Druid using Hive, query Druid data sources from Hive using SQL, and execute complex Hive queries on top of Druid data sources. We describe how we built an extension that brings benefits to both systems alike, leveraging Apache Calcite to overcome the challenge of transparently generating Druid JSON queries from the input Hive SQL queries. We conclude with a demo highlighting the performant and powerful integration of these projects.
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...DataWorks Summit
The last 5 years have been marked by an explosion of Internet-connected devices. From cars to solar power, from TVs to juice makers, modern life is filled with interconnected smart devices.
But while those ubiquitous devices enhance the interaction with the technology that surrounds us, the lifecycle management of IoT firmware and poor security design choices still present a significant threat to our daily lives.
Despite the ascent of threats like the Mirai botnet, the amount of published research around how to programmatically detect new IoTs in the wild has been somewhat limited.
In this presentation we introduce Data Engineering in the context of cyber security, discuss why it is important to move away from the view that security log pipelines are enrichment and indicator matching tools, and push the boundaries of “Simple Event Processing” to demonstrate how Apache NiFi and Apache MiNiFi’s feature rich dataflows can be used to dynamically identify new IoT botnet activities in the wild.
Speakers
Andre Fucs De Miranda, Independent Consultant, Fluenda
Andy LoPresto, Sr. Member of Technical Staff, Hortonworks
Fluentd is an open source log collector that allows flexible collection and routing of log data. It uses JSON format for log messages and supports many input and output plugins. Fluentd can collect logs from files, network services, and applications before routing them to storage and analysis services like MongoDB, HDFS, and Treasure Data. The open source project has grown a large community contributing over 100 plugins to make log collection and processing easier.
Matt Franklin - Apache Software (Geekfest)W2O Group
The document discusses the potential benefits of container technologies like Docker. It notes that containers offer significantly higher density than virtual machines by avoiding hypervisor overhead. This density improvement can lead to major cost reductions by reducing infrastructure needs. Containers also improve developer efficiency by making development environments portable and disposable. This allows more rapid experimentation and innovation, potentially translating to increased revenue. Technologies like Amazon Lambda take the on-demand aspects of containers even further by abstracting compute resources. The document promotes StackEngine as a solution for managing containers at scale in production environments.
This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.
Across the globe energy systems are changing, creating unprecedented challenges for the organisations tasked with ensuring the lights stay on. In the UK, National Grid is facing shrinking margins, looming capacity shortages and unpredictable peaks and troughs in energy supply caused by increasing levels of renewable penetration. Open Energi uses its IoT technology to unlock demand-side capacity - from industrial equipment, co-generation and batery storage systems - creating a smarter grid; one that is cleaner, cheaper, more secure and more efficient.
I'll talk about how we use Apache Nifi to orchestrate and coordinate Machine Learning microservices that operate on streams of data coming from IoT devices, providing a layer of fault-tolerance and traceability. With built-in retry logic, backpressure and clustering, Nifi helps us keep hard problems away from our code. It comes with processors that integrate with our cloud provider of choice (Microsoft Azure), fitting seamlessly into our processing pipeline.Finally, its straightforward graphical interface makes it easy enough to use that any team member can step in and troubleshoot a flow with little training.
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)Timothy Spann
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
by Timothy Spann
Wednesday 17:10 UTC - Cracking the Nut, Solving Edge AI with Apache Tools and Frameworks
Wednesday 17:10 UTC
Cracking the Nut, Solving Edge AI with Apache Tools and Frameworks
Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the edge before we start our real-time streaming flows. Fortunately using the all Apache FLiP Stack we can do this with ease! Streaming AI Powered Analytics From the Edge to the Data Center is now a simple use case. With MiNiFi we can ingest the data, do data checks, cleansing, run machine learning and deep learning models and route our data in real-time to Apache NiFi and Apache Pulsar for further transformations and processing. Apache Flink will provide our advanced streaming capabilities fed real-time via Apache Kafka topics. Apache MXNet models will run both at the edge and in our data centers via Apache NiFi and MiNiFi. Our final data will be stored in various Apache datastores. Event-Driven Microservices in Apache Pulsar Functions.
Tools:
Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, Apache MXNet
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...Timothy Spann
Scenic city summit real-time streaming in any and all clouds, hybrid and beyond
24-September-2021. Scenic City Summit. Virtual. Real-Time Streaming in Any and All Clouds, Hybrid and Beyond
Apache Pulsar, Apache NiFi, Apache Flink
StreamNative
Tim Spann
https://sceniccitysummit.com/
JustGiving – Serverless Data Pipelines, API, Messaging and Stream ProcessingLuis Gonzalez
What to Expect from the Session
• Recap of some AWS services
• Event-driven data platform at JustGiving
• Serverless computing
• Six serverless patterns
• Serverless recommendations and best practices
This document discusses strategies for building large-scale stream infrastructures across multiple data centers using Apache Kafka. It outlines common multi-data center patterns like stretched clusters, active/passive clusters, and active/active clusters. It also covers challenges like maintaining ordering and consumer offsets across data centers and potential solutions.
Spark and MapR Streams: A Motivating ExampleIan Downard
Businesses are discovering the untapped potential of large datasets and data streams through the use of technologies for big data processing and storage. By leveraging these assets they’re creating a new generation of applications that derive value from data they used to throw away. In this presentation Ian Downard shows how to build operational environments for these types of applications with the MapR Converged Data Platform and he describes examples of a next-generation applications that use Java APIs for MapR Streams, Apache Spark, Apache Hive, and MapR-DB. He shows how these technologies can be used to join and transform unbounded datasets to find signals and derive new data streams for a financial scenario involving real-time algorithmic trading and historical analysis using SQL. He also discusses how MapR enables you to run real-time data applications with the speed, reliability, and security you need for a production environment.
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...confluent
The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the union for stream processing, and what gaps remain in the technology we have? How will this technology impact the architectures and applications of the future? Jay Kreps explores the future of Apache Kafka and the stream processing ecosystem.
Data Pipelines Made Simple with Apache Kafkaconfluent
Presentation by Ewen Cheslack-Postava, Engineer, Apache Kafka Committer, Confluent
In streaming workloads, often times data produced at the source is not useful down the pipeline or it requires some transformation to get it into usable shape. Similarly, where sensitive data is concerned, filtering of topics is helpful to ensure that the wrong data doesn't get to the wrong place.
The newest release of Apache Kafka now offers the ability to do transformations on individual messages, making is possible to implement finer grained transformations customized to your unique needs. In this session we’ll talk about the new single message transform capabilities, how to use them to implement things like data masking and advanced partitioning, and when you’ll need to use more complex tools like the Kafka Streams API instead.
A primer on building real time data-driven productsLars Albertsson
This document provides an overview of building real-time data products using stream processing. It discusses why stream processing is useful for providing low-latency reactions to data from 1 second to 1 hour. Key aspects covered include using a unified log to decouple producers and consumers, common stream processing building blocks like filtering and joining, and technologies like Spark Streaming, Kafka Streams, and Flink. The document also addresses challenges like out-of-order events and software bugs, and architectural patterns for handling imperfections in streams.
Getting Started with Alluxio + Spark + S3Alluxio, Inc.
This document provides an overview of Alluxio and how it can be used with Spark and S3. It discusses how Alluxio enables data sharing between jobs, provides data resilience during application crashes, and consolidates memory usage. It then describes how to visualize the Alluxio, Spark, and S3 stack and when Alluxio would be useful. Finally, it covers setting up Alluxio version 1.1.0 with Spark 1.6.1 and accessing data through the Alluxio filesystem API by changing the URI scheme to alluxio://.
Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingSingleStore
Robin Li, Director of Data Engineering and Yohan Chin, VP Data Science at Tapjoy share how to architect the best application experience for mobile users using technologies including Apache Kafka, Apache Spark, and MemSQL.
Speaker: Robin Li - Director of Data Engineering, Tapjoy and Yohan Chin - VP Data Science, Tapjoy
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017Alluxio, Inc.
This document discusses using Alluxio with Spark to improve performance. Alluxio consolidates data in memory across distributed systems to enable faster data sharing between Spark jobs and frameworks. Tests show Alluxio can accelerate Spark workloads by up to 30x when reading from remote storage like S3 by serving data at memory speed. Alluxio also provides data resilience during failures and allows sharing data across jobs more easily.
How to Become a Thought Leader in Your NicheLeslie Samuel
Are bloggers thought leaders? Here are some tips on how you can become one. Provide great value, put awesome content out there on a regular basis, and help others.
R, Scikit-Learn and Apache Spark ML - What difference does it make?Villu Ruusmann
This document discusses different machine learning frameworks like R, Scikit-Learn, LightGBM, XGBoost, and Apache Spark ML and compares their capabilities for predictive modeling tasks. It highlights differences in how each framework handles data formats, parameter tuning, model serialization, and execution. It also presents a case study predicting car prices using gradient boosted trees in various frameworks and discusses lessons learned, emphasizing that ease-of-use and integration often outweigh raw performance.
YACE (Yet Another Crossing Engine) is a financial trading application developed using Apache NiFi and QuickFIX/J. It performs continuous crossing of orders based on price/time priority and can run in an engine-free or server-less mode. Orders are placed using binary search for better performance and support JSON serialization. The order book is implemented as Java collections. The architecture includes NiFi for flow-based development, distributed hosting, and logging/tracing. Demo screenshots show the NiFi flow used by YACE.
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017MLconf
Yi Wang is the tech lead of AI Platform at Baidu. The team is a primary contributor of PaddlePaddle, the open source deep learning platform originally developed in Baidu. Before Baidu, he was a founding member of ScaledInference, a Palo Alto-based AI startup company. Before that, he was a senior staff at LinkedIn, engineering director of advertising system at Tencent, and researcher at Google.
Abstract Summary:
Fault-tolerable Deep Learning on General-purpose Clusters:
Researchers have been used to running deep learning jobs on clusters. In industrial applications, AI is built on top of big data and deep learning is only one stage of the data pipeline. That is where MPI-based clusters are not enough, and general-purpose cluster management systems are necessary to run Web servers like Nginx, log collectors like fluentd and Kafka, data processors on top of Hadoop, Spark, and Storm, and deep learning, which improves the Web service quality. This talk explains how we integrate PaddlePaddle and Kubernetes to provide an open source fault-tolerable large-scale deep learning platform.
Machine learning is overhyped nowadays. There is a strong belief that this area is exclusively for data scientists with a deep mathematical background that leverage Python (scikit-learn, Theano, Tensorflow, etc.) or R ecosystem and use specific tools like Matlab, Octave or similar. Of course, there is a big grain of truth in this statement, but we, Java engineers, also can take the best of machine learning universe from an applied perspective by using our native language and familiar frameworks like Apache Spark. During this introductory presentation, you will get acquainted with the simplest machine learning tasks and algorithms, like regression, classification, clustering, widen your outlook and use Apache Spark MLlib to distinguish pop music from heavy metal and simply have fun.
Source code: https://github.com/tmatyashovsky/spark-ml-samples
Design by Yarko Filevych: http://filevych.com/
This document provides tips and best practices for optimizing Apache Spark performance and resource allocation. It discusses:
- The components of Spark including executors, drivers, and tasks
- Configuring Spark on YARN and dynamic resource allocation
- Optimizing memory usage, avoiding data skew, and reducing serialization costs
- Best practices for Spark Streaming around microbatching, fault tolerance, and performance
- Recommendations for running Spark on cloud object stores like S3
This document provides information about purchasing a 3Com ETHERLINK II TP I 8-bit ISA Ethernet controller from Launch 3 Telecom. It describes payment and shipping options, same-day shipping availability, warranty and return policies, and additional services offered like repair, maintenance contracts, and asset recovery. Contact information is provided to purchase the product or learn more about Launch 3 Telecom's services.
Los chamanes incas utilizaban minerales, animales, plantas y cantos para sanar a los enfermos. Practicaban cirugías usando coca u otras plantas como anestésicos locales. Reconocieron propiedades curativas de muchas hierbas nativas como la coca. Creían que la enfermedad era un desequilibrio corporal o un mal presagio, por lo que la curación beneficiaba a toda la comunidad. Usaban rituales con plantas, oraciones, baños y sustancias psicoactivas para inducir trances curativos.
Previo
Kant decía que no se debía hablar de la valía del hombre, sino de su dignidad, ya que cualquier valor es mesurable y puede entrar en el cálculo comparativo. Esta reflexión tiene importancia porque nos hace advertir la trascendencia del hombre más allá de los valores prácticos
El hombre no queda subordinado a otro por el contrario todos los valores le son subordinados.
The document discusses various space agency websites and research papers, including the Isro Forbs web app, Nasa web blog organized into pages 1-3, and two Nasa research papers. It lists the titles of different online resources from Indian and American space organizations without providing much detail about the content within.
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
This document discusses Kappa Architecture, an alternative to Lambda Architecture for event processing. Kappa Architecture uses a single stream of events from Apache Kafka as the input, rather than separating batch and stream processing. It reads all events from Kafka and runs analytics on the full data set to enable both learning from historical events and reacting to new events. The document outlines how Kappa Architecture provides benefits like avoiding duplicate processing logic and making actionable analytics easier. It also describes how to read bounded batches of events from Kafka for analytics using tools like Apache Spark.
Sector is a distributed file system that stores files on local disks of nodes without splitting files. Sphere is a parallel data processing engine that processes data locally using user-defined functions like MapReduce. Sector/Sphere is open source, supports fault tolerance through replication, and provides security through user accounts and encryption. Performance tests show Sector/Sphere outperforms Hadoop for sorting and malware analysis benchmarks by processing data locally.
Sector is a distributed file system that stores files on local disks of nodes without splitting files. Sphere is a parallel data processing engine that processes data locally using user-defined functions like MapReduce. Sector/Sphere is open source, written in C++, and provides high performance distributed storage and processing for large datasets across wide areas using techniques like UDT for fast data transfer. Experimental results show it outperforms Hadoop for certain applications by exploiting data locality.
This document provides an overview of Oracle Stream Analytics capabilities for processing fast streaming data. It discusses deployment approaches on Oracle Cloud, hybrid cloud, and on-premises. It also covers event processing techniques like pattern detection, time windows, and continuous querying enabled by Oracle Stream Analytics. Specific use cases for retail and healthcare are also presented.
Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks
This document summarizes a presentation about building a structured streaming connector for continuous applications using Azure Event Hubs as the streaming data source. It discusses key design considerations like representing offsets, implementing the getOffset and getBatch methods required by structured streaming sources, and challenges with testing asynchronous behavior. It also outlines issues contributed back to the Apache Spark community around streaming checkpoints and recovery.
Amazon Kinesis is a managed service for real-time processing of streaming big data at any scale. It allows users to create streams to ingest and process large amounts of data in real-time. Kinesis provides high durability, performance, and elasticity through features like automatic shard management and the ability to seamlessly scale streams. It also offers integration with other AWS services like S3, Redshift, and DynamoDB for storage and analytics. The document discusses various aspects of Kinesis including how to ingest and consume data, best practices, and advantages over self-managed solutions.
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
This document discusses stream processing with Apache Spark. It begins with an overview of Spark Streaming and its advantages over other frameworks like low latency and rich APIs. It then covers core Spark Streaming concepts like windowing and achieving "exactly once" semantics through checkpointing and write ahead logs. The document presents two examples of using Spark Streaming for analytics and aggregation with transactional and snapshotted approaches. It concludes with notes on deployment with Mesos/Marathon and performance tuning Spark Streaming jobs.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMEconfluent
Confluent Platform is supporting London Metal Exchange’s Kafka Centre of Excellence across a number of projects with the main objective to provide a reliable, resilient, scalable and overall efficient Kafka as a Service model to the teams across the entire London Metal Exchange estate.
This document provides an overview of the Confluent streaming platform and Apache Kafka. It discusses how streaming platforms can be used to publish, subscribe and process streams of data in real-time. It also highlights challenges with traditional architectures and how the Confluent platform addresses them by allowing data to be ingested from many sources and processed using stream processing APIs. The document also summarizes key components of the Confluent platform like Kafka Connect for streaming data between systems, the Schema Registry for ensuring compatibility, and Control Center for monitoring the platform.
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
This document discusses AppsFlyer's experience running Spark on Mesos in production for retention data processing and analytics. Key points include:
- AppsFlyer processes over 30 million installs and 5 billion sessions daily for retention reporting across 18 dimensions using Spark, Mesos, and S3.
- Challenges included timeouts and errors when using Spark's S3 connectors due to the eventual consistency of S3, which was addressed by using more robust connectors and configuration options.
- A coarse-grained Mesos scheduling approach was found to be more stable than fine-grained, though it has limitations like static core allocation that future Mesos improvements may address.
- Tuning jobs for coarse-
Fast Streaming into Clickhouse with Apache PulsarTimothy Spann
https://github.com/tspannhw/SpeakerProfile/tree/main/2022/talks
Fast Streaming into Clickhouse with Apache Pulsar
https://github.com/tspannhw/FLiPC-FastStreamingIntoClickhouseWithApachePulsar
https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-Meetup/events/285271332/
Fast Streaming into Clickhouse with Apache Pulsar - Meetup 2022
StreamNative - Apache Pulsar - Stream to Altinity Cloud - Clickhouse
May the 4th Be With You!
04-May-2022 Clickhosue Meetup
CREATE TABLE iotjetsonjson_local
(
uuid String,
camera String,
ipaddress String,
networktime String,
top1pct String,
top1 String,
cputemp String,
gputemp String,
gputempf String,
cputempf String,
runtime String,
host String,
filename String,
host_name String,
macaddress String,
te String,
systemtime String,
cpu String,
diskusage String,
memory String,
imageinput String
)
ENGINE = MergeTree()
PARTITION BY uuid
ORDER BY (uuid);
CREATE TABLE iotjetsonjson ON CLUSTER '{cluster}' AS iotjetsonjson_local
ENGINE = Distributed('{cluster}', default, iotjetsonjson_local, rand());
select uuid, top1pct, top1, gputempf, cputempf
from iotjetsonjson
where toFloat32OrZero(top1pct) > 40
order by toFloat32OrZero(top1pct) desc, systemtime desc
select uuid, systemtime, networktime, te, top1pct, top1, cputempf, gputempf, cpu, diskusage, memory,filename
from iotjetsonjson
order by systemtime desc
select top1, max(toFloat32OrZero(top1pct)), max(gputempf), max(cputempf)
from iotjetsonjson
group by top1
select top1, max(toFloat32OrZero(top1pct)) as maxTop1, max(gputempf), max(cputempf)
from iotjetsonjson
group by top1
order by maxTop1
Tim Spann
Developer Advocate
StreamNative
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing pipelines, and also data ingestion and integration flows, supporting for both batch and streaming use cases. In presentation I will provide a general overview of Apache Beam and programming model comparison Apache Beam vs Apache Spark.
In early March, Harbour IT hosted a breakfast session in conjunction with VMware – “vForum Wrap – All the best bits from VMware’s vForum 2010”.
Held in both the Norwest and Sydney offices, local customers were given a VMware update from guest speaker, Bo Leksono. The presentation covered the latest VMware technology and the steps to follow on your journey to the cloud
1) Reactive programming is a new programming paradigm that is asynchronous and non-blocking, treating data flows as event-driven streams.
2) Traditional REST APIs are synchronous and blocking with limitations on concurrent users, while reactive programming supports asynchronous operations, uses fewer threads, and enables back pressure on data streams.
3) Key aspects of reactive programming include reactive streams specifications, publishers that represent data sources, subscribers, and asynchronous non-blocking libraries like RxJava and Project Reactor that implement the specifications.
Spring Boot & Spring Cloud on Pivotal Application Service - Alexandre RomanVMware Tanzu
- The document discusses how Pivotal Cloud Foundry (PCF) helps developers run Spring applications at scale through features like the Java Buildpack, Spring deployment profiles, Spring Cloud Connector, and Spring Cloud Services for service discovery, configuration, and circuit breaking.
- It also outlines the ecosystem of services on PCF for Spring apps, including Pivotal Cloud Cache, MySQL for PCF, RabbitMQ for PCF, and Redis for PCF.
- The presentation concludes with a demo of pushing a Spring Boot app to PCF, observing logs, binding services, and using Spring Cloud features.
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
Presented by Eren Avsarogullari and Pavel Hardak (ApacheCon 2020)
https://www.linkedin.com/in/erenavsarogullari/
https://www.linkedin.com/in/pavelhardak/
Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
Workday uses Apache Spark as the foundational technology for its Prism Analytics product. It has developed a custom Spark upgrade model to handle upgrading Spark across its multi-tenant environment. Workday also collects runtime metrics on Spark SQL queries using a custom metrics pipeline and REST API. Future plans include upgrading to Spark 3.x and improving multi-tenancy support through a "Multiverse" deployment model.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
2. About Us
Arijit Tarafdar
Software Engineer@Azure HDInsight. Work on Spark Streaming/Structured Streaming
service in Azure. Committee Member of XGBoost@DMLC and Apache MxNet (incubator).
Spark Contributor. Known as CodingCat in GitHub.
Nan Zhu
Software Engineer@Azure HDInsight. Work on Spark/Spark Streaming on Azure.
Previously worked with other distributed platforms like DryadLinq and MPI. Also
worked on graph coloring algorithms which was contributed to ADOL-C
(https://projects.coin-or.org/ADOL-C).
3. Real Time Data Analytics Results
Processing Engine
Continuous Data Source Control Manager
Continuous Data Source API
Persistent Data Storage Layer
Spark Streaming, Structured Streaming
Deliver real time data to Spark at scale
Real time view of data (message queue
or files filtered by timestamp)
Blobs/Queues/Tables/Files
Continuous Application Architecture and Role of Spark Connectors
Not only size of data is increasing, but also the velocity of data
◦ Sensors, IoT devices, social networks and online transactions are all generating
data that needs to be monitored constantly and acted upon quickly.
4. Outline
•Recap of Spark Streaming
•Introduction to Event Hubs
•Connecting Azure Event Hubs and Spark Streaming
•Design Considerations for Spark Streaming Connector
•Contributions Back to Community
•Future Work
5. Spark Streaming - Background
Task 1
Task 2
Task L
RDD 1 @ t RDD 1 @ t-1 RDD 1 @ 0
Stream 1
Task 1
Task 2
Task M
RDD N @ t RDD N @ t-1 RDD N @ 0
Stream N
Micro Batch @ t
Task 1
Task 2
Task L
Task 1
Task 2
Task M
Window Duration
Batch Duration
6. Azure Event Hubs - Introduction
Partition 1
Partition 2
Partition J
Event Hubs 1
Partition 1
Partition 2
Partition K
Event Hubs L
Event Hubs Namespace 1
Partition 1
Partition 2
Partition K
Event Hubs 1
Partition 1
Partition 2
Partition P
Event Hubs N
Event Hubs Namespace M
8. Data Flow – Event Hubs
• Proactive message delivery
• Efficient in terms of communication cost
• Data source treated as commit log of events
• Events read in batch per receive() call
New Old
Event Hubs Partition
(Event Hubs Server)
Prefetch Queue
(Event Hubs Client)
Streaming
Application
9. Event Hubs – Offset Management
• Event Hubs expect offset management to be performed on the receiver side
• Spark streaming uses DFS based persistent store (HDFS, ADLS, etc.)
• Offset is stored per consumer group per partition per event hubs per event hubs namespace
/* An interface to read/write offset for a given Event Hubs
namespace/name/partition */
@SerialVersionUID(1L)
trait OffsetStore extends Serializable {
def open(): Unit
def write(offset: String): Unit
def read() : String
def close(): Unit
}
19. Bridging Spark Streaming and
Event Hubs WITHOUT Receiver
How the Idea Extends to Other
Data Sources (in Azure & Your IT
Infrastructure)?
20. Extra Resources
Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Perf./Data Loss due
to Spark Bug/No
Recovery from
Code Update
Client-side Offset
Management
Offset Store Looks fine….
21. From Event Hubs to General
Data Sources (1)
•Communication Pattern
• Azure Event Hubs: Long-Running Receiver, Proactive Data Delivery
• Kafka: Receiver Start/Shutdown in a free-style, Passive Data
Delivery
Most Critical Factor in Designing a Resource-Efficient
Spark Streaming Connector!
22. Tackling Extra Resource
Requirement
Azure EventHubs
EvH-
Namespace-1
EventHub-1
P1
PN
.
.
.
Reduce Resource Requirements:
Customized Receiver
Logic
User-Defined
Lambdas
EventHubsRDD
.map()
MapPartitionsRDD
Spark Tasks
Compact Data Receiving and Processing in the same Task
Inspired by Kafka
Direct DStream!
Being More Challenging with a
Different Communication Pattern!
23. Bridging Spark Execution Model and
Communication Pattern Expectation
Azure EventHubs
EvH-
Namespace-1
EventHub-1
P1
PN
.
.
.
Customized Receiver
Logic
User-Defined
Lambdas
EventHubsRDD
.map()
MapPartitionsRDD
Spark Task
Passive
Message
Delivery Layer
Recv(expectedMsgNum:
Int) – Blocking API
Long-running/Proactive Receiver expected by Event Hubs
vs.
Transient Tasks started for each Batch by Spark
24. Takeaways (1)
Requirements in
Event Hubs
Receiver-based
Connection
Problems Solution
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Compact Data
Receiving/Processi
ng, with the
facilitates from
Passive Message
Delivery
Communication Pattern in Data Sources Plays the Key
Role in Resource-Efficient Design of Spark Streaming
Connector
26. Fault Tolerance
Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Perf./Data Loss due
to Spark Bug/No
Recovery from Code
Update
Client-side Offset
Management
Offset Store Looks fine….
27. From Event Hubs to General
Data Sources (2)
•Fault-Tolerance
• Capability
• Guarantee graceful recovery (no data loss, recover from where
you stopped, etc.) with application stops for various reasons
• Efficiency
• Minimum impact to application performance and user
deployment
28. …RDD L-t RDD L-(t-1) RDD L-0 Stream L
Unexpected Application Stop
Checkpoint Time
RDD L-(t-1)RDD L-t
Recovery
From Checkpoint, or Re-evaluated
Capability – Recover from
unexpected stop
29. …RDD L-(t-1) RDD L-0 Stream L
Application Upgrade
…
Application Stop
Spark Checkpoint Mechanism Serializes
Everything and does not recognize a re-compiled
class
Capability – Recover from
planned stop
RDD L-(2t)
Resume Application
with updated
Implementation
Fetch the latest offset
Offset Store
Your Connector shall maintain this!!!
30. Efficiency - What to be
Contained in Checkpoint Files?
• Checkpointing takes your computing resources!!!
• Received Event Data
• too large
• The range of messages to be processed in each batch
• Small enough to quickly persist data
Azure EventHubs
EvH-
Namespace-1
EventHub-1
P1
PN
.
.
.
EventHubsRDD
.map()
MapPartitionsRDD
Passive
Message
Delivery Layer
Recv(expectedMsgNum:
Int) – Blocking API
Persist this mapping relationship, i.e. using EventHubs itself as data backup
31. Efficiency - Checkpoint
Cleanup
•Connectors for Data Source Requiring Client-
side offset management generates Data/Files
for each Batch
• You have to clean up SAFELY
• Keep recovery feasible
• Coordinate with Spark’s checkpoint process
• Override clearCheckpointData() in EventHubsDStream (our
implementation of Dstream)
• Triggered by Batch Completion
• Delete all offset records out of the remembering window
32. Takeaways (2)
Requirements in
Event Hubs
Receiver-based
Connection
Problems Solution
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Perf./Data Loss due
to Spark Bug/No
Recovery from Code
Update
Checkpoint Mapping
Relationship instead
of Data/Self-
management Offset
Store/Coordinate
Checkpoint Cleanup
Fault Tolerance Design is about Interaction with Spark
Streaming Checkpoint
34. Offset Management
Requirements in
Event Hubs
Receiver-based
Connection
Problems
Long-Running
Receiver/Proactive
Message Fetching
Long-Running
Receiver Tasks
Extra Resource
Requirements
Fault tolerance
Mechanism
WAL/Spark
Checkpoint
Data Loss due to
Spark Bug
Client-side Offset
Management
Offset Store Looks fine….
Is it really fine???
35. From Event Hubs to General
Data Sources (3)
•Message Addressing Rate Control
36. Message Addressing
• Why Message Addressing?
• When creating a client instance of data source in a Spark task, where to start receiving?
• Without this info, you have to replay the stream for every newly created client
Data
Source
Client
Start from the first msg
Data
Source
Client
Start from where?
• Design Options:
• Xth message (X: 0, 1, 2, 3, 4….)
• server side metadata to map the message ID to the offset in storage system
• Actual offset
• Simpler server side design
Fault
Or
Next Batch
37. Rate Control
• Why Rate Control
• Prevent the messages flooding into the processing pipelines
• e.g. just start processing a queued up data sources
• Design Options
• Number of messages: I want to consume 1000 messages in next batch
• Assuming the homogeneous processing overhead
• Size of messages: I want to receive at most 1000 bytes in next batch
• Complicated Server side logic -> track the delivered size
• Larger messages, longer processing time is not always true
Data
Source
Client
Start from the first msg
Data
Source
Client
Consume all messages at
once? May crash your
processing engine!!!A Long Stop!!!
38. Kafka Choice
• Message Addressing:
• Xth message: 0, 1, 2, 3, 4, ..
• Rate Control
• Number of Messages: 0, 1, 2, 3, 4, …
Driver
Executor
Executor
Kafka
Message Addressing and Rate Control:
Batch 0: How many messages are to be
processed in next batch, and where to start? 0
- 999
Batch 1: How many messages are to be processed
in next batch, and where to start? 1000 - 1999
39. Azure Event Hubs’ Choice
• Message Addressing:
• Offset of messages: 0, size of msg 0, size of (msg 0 + msg 1),…
• Rate Control
• Number of Messages: 0, 1, 2, 3, 4, …
This brings totally different connector
design/implementation!!!
40. Distributed Information for Rate
Control and Message Addressing
Driver
Executor
Executor
Rate Control:
Batch 0: How many messages are to
be processed in next batch, and
where to start? 0 - 999
Azure EventHubs
Batch 1: How many messages are to be
processed in next batch, and where to
start? 1000 - 1999
What’s the offset of
1000th message???
The answer appeared in Executor
side (when Task receives the
message (as part of message
metadata))
Build a Channel to Pass
Information from Executor to
Driver!!!
45. Takeaways (3)
• There are multiple options on the Server-side design for Message
Addressing and Rate Control
• To design and implement a Spark Streaming connector, you have to
understand what are the options adopted in server side
The key is the combination!!!
46. Contribute Back to
Community
Failed Recovery from checkpoint caused by the multi-threads issue in
Spark Streaming scheduler
https://issues.apache.org/jira/browse/SPARK-19280
One Realistic Example of its Impact: You are potentially getting wrong
data when you use Kafka and reduceByWindow and recover from a
failure
Data loss caused by improper post-batch-completed processing
https://issues.apache.org/jira/browse/SPARK-18905
Inconsistent Behavior of Spark Streaming Checkpoint
https://issues.apache.org/jira/browse/SPARK-19233
47. Summary
• Spark Streaming Connector for Azure Event Hubs enables the user to perform
various types of analytics over streaming data from a fully managed, cloud-scale
message telemetry ingestion service
• https://github.com/hdinsight/spark-eventhubs
• Design and Implementation of Spark Streaming Connectors
• Coordinate Execution Model and Communication Pattern
• Fault Tolerance (Spark Streaming Checkpoint v.s. self-managed fault tolerance facilitates)
• Message Addressing and Rate Control (Server&Connector Co-Design)
• Contributing Back to the Community
• Microsoft is the organization with the most open source contributors in 2016!!!
• http://www.businessinsider.com/microsoft-github-open-source-2016-9
48. If you do not want to handle
this complexity
Move to Azure HDInsight…
49. Future Work
Structured Streaming integration with Event Hubs (will release at the
end of month)
Streaming Data Visualization with PowerBI (alpha released mode)
Streaming ETL Solutions on Azure HDInsight!
50. Thank You!!!
Build a Powerful&Robust Data Analytic
Pipeline with Spark@Azure HDInsight!!!
Editor's Notes
Two types of datasets
Bounded: Finite, unchanging datasets
Unbounded: Infinite datasets that are appended to continuously
Unbounded – data is generated all the time and we want to know now
Glue between unbounded data source like event hubs and powerful processing engine like Spark
Goal is to deliver near real time analysis or view.
Micro-batching mechanism, processes continuous and infinite data source
Batch scheduled at regular time interval or after certain number of events received
Distributed Stream is the highest level abstraction over continuous creation and expiration of RDDs
Batch duration – single RDD generated
Window duration – multiple of batch duration, may use multiple RDDs
RDDs contains partitions, one task per partitions
High throughput, low latency offered as platform as a service on Azure
No cluster set up required, no monitoring required
User can concentrate only on ingress and egress of data
Event hubs namespace collection of event hubs, an event hub is a collection of partitions, a partition is a sequential collection of events
Up to 32 partitions per event hub but can be increased if required
-HTTP or AMQP with transport level security (TLS/SSL)
-HTTP has higher message transmission overhead
-AMQP has higher connection setup overhead
-Consumer group gives logical view of event hubs partitions, including addressing same partition at different offsets
-Up to 20 consumer groups per event hubs
-1 receiver per consumer group
Each partition can be viewed as a commit log
Event Hubs client maintains prefetch queue to proactively get messages from the server
Receive call by application gets messages in batch from the prefetch queue to the caller.
No support from Event Hubs server yet
Offset is managed by the Event Hubs connector at the Spark application side
Uses distributed file system like HDFS, ADLS, etc.
Offset is stored per consumer group, per partition, per event hub, per event hub namespace
Event hubs clients are initialized with an initial offset from the which event hubs will start sending data
Offset is determined in one of three ways – start of stream, previously saved offset, enqueue time
- How do we bridge
Reliable receivers – received data backed up in a reliable persistent store (WAL), no data lost between application restarts
Reliable receivers – offset saved after saving to persistent store and pushing to block manager
Both executors and driver use the WAL
On application restart data is processed from the WAL first up to the offset saved before the previous application stop
Receiver tasks then start the event hubs clients, one per partition with the last offset saved for each partition.
Describe each parameter.
Extends spark provided Receiver class with specific type of Array of Bytes which is the exact content of the user data per event.
Storage level whether to spill to disk when memory usage reaches capacity.
On start establishes connections to event hubs
On stop cleans up the connections
Reliably store data to block manager
Restart call stop and start.