Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
Apache NiFi, Apache Flink, Apache Kafka
Timothy Spann
Principal Developer Advocate
Cloudera
Data in Motion
https://budapestdata.hu/2023/en/speakers/timothy-spann/
Timothy Spann
Principal Developer Advocate
Cloudera (US)
LinkedIn · GitHub · datainmotion.dev
June 8 · Online · English talk
Building Modern Data Streaming Apps with NiFi, Flink and Kafka
In my session, I will show you some best practices I have discovered over the last 7 years in building data streaming applications including IoT, CDC, Logs, and more.
In my modern approach, we utilize several open-source frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink SQL. We will stream data into Apache Iceberg.
We use the best streaming tools for the current applications with FLaNK. flankstack.dev
BIO
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
GSJUG: Mastering Data Streaming Pipelines 09May2023
https://www.meetup.com/futureofdata-princeton/events/293233881/
This is a repost from the Garden State Java Users Group Event.
Join me at
https://www.meetup.com/garden-state-java-user-group/events/293229660/
See: https://www.eventbrite.com/e/mastering-data-streaming-pipelines-tickets-627677218457?_ga=2.253257801.1787151623.1682868226-741104479.1678110925
Please note that registration via EventBrite is required to attend either in-person or online.
We are happy to announce that Tim Spann will be our special guest for the May 9, 2023 meeting!
Abstract:
In this session, Tim will show you some best practices that he has discovered over the last seven years in building data streaming applications including IoT, CDC, Logs, and more.
In his modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink, enhance events with NiFi enrichment. We build continuous queries against our topics with Flink SQL.
We will show where Java fits in as sources, enrichments, NiFi processors and sinks.
We hope to see you on May 9!
Speaker
Timothy Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
In this session, Tim will show you some best practices that he has discovered over the last seven years in building data streaming applications, including IoT, CDC, Logs, and more.
In his modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there, we build streaming ETL with Apache Flink, enhance events with NiFi enrichment. We build continuous queries against our topics with Flink SQL.
We will show where Java fits in as sources, enrichments, NiFi processors, and sinks.
https://www.eventbrite.com/e/mastering-data-streaming-pipelines-tickets-627677218457?_ga=2.253257801.178
Best Practices for Building Hybrid-Cloud Architectures | Hans Jespersenconfluent
Best Practices for building Hybrid-Cloud Architectures - Hans Jespersen
Afternoon opening presentation during Confluent’s streaming event in Paris, presented by Hans Jespersen, VP WW Systems Engineering at Confluent.
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...VMware Tanzu
SpringOne Platform 2017
Gilbert Lau, Data Stax; Wayne Lund, Pivotal
"Spring Cloud Data Flow satisfies all of the demands of modern streaming and task workloads. A growing number of customers are viewing Pivotal Cloud Foundry as an ideal runtime for these types of workloads to take advantage of all of the microservice architecture features of Spring Boot apps leveraging Spring Cloud Services. This is only half of the equation. Once the streaming data is persisted on their database, our customers want to generate actionable insights to provide the best customer experience to stay on top of the competitive marketplace. DataStax Enterprise (DSE) is a single and unified big data platform with Apache Cassandra NoSQL database at its core. Integrated within each node of DSE is powerful indexing, search through Apache Solr, analytics through Apache Spark, and a enterprise-ready graph functionality. It is by far the only operational data platform which can scale linearly in excess of 1,000 nodes, with no single point of failure, and is capable of providing real-time active-everywhere replication across many datacenters and cloud providers.
In this presentation and demo we will take a common social data set and show SCDF advantages on PCF for microservice scaling and pipelining data into a DataStax Enterprise Cassandra NoSQL database. Then followed by extracting meaningful information through DataStax Enterprise Search, DataStax Enterprise Analytics, and DataStax Cassandra Service Broker Tile for PCF using a Spring Boot Dashboard application."
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022HostedbyConfluent
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022
Modern streaming use cases are generating massive amounts of data - much of it needs to be organized and queried over time. The sheer amount and complexity of this problem presents new challenges for data engineers and developers alike.
To solve this problem Apache Kafka and MongoDB Time Series collections are a powerful combination. In this talk, Kenny Gorman and Elena Cuevas will present how Apache Kafka on Confluent Cloud can stream massive amounts of data to Time Series Collections via the MongoDB Connector for Apache Kafka. Elena and Kenny will discuss the required configuration details and critical components of Confluent Cloud and MongoDB Atlas as well as some tips, tricks and best practices.
You will leave armed with the knowledge of how Confluent Cloud, Apache Kafka, MongoDB Atlas, and Time Series collections fit into your event-driven architecture.
PartnerSkillUp_Enable a Streaming CDC SolutionTimothy Spann
PartnerSkillUp_Enable a Streaming CDC Solution
Tim Spann
Principal Developer Advocate in Data In Motion for Cloudera, Global
https://attend.cloudera.com/skillupseriesseptember14
Streaming Change Data Capture (CDC) Two Unique Ways
In this next session,
learn how to use Debezium with Flink, Kafka, and NiFi for Change Data Capture using two different mechanisms: Kafka Connect and Flink SQL.
With the virtual nature of today's world, streaming data is more critical than ever. Join Cloudera Chief Data-In-Motion Principal, Tim Spann, and Partner Solution Engineer, Salvador Alamazan as they look closely at key CDC use cases, discuss why Debezium is the best option for handling CDC and use examples to show you how to demonstrate value.
This is a must-attend experience!
Zoom Webinar
September 14, 2023
10:00am–11:00am EDT
FLaNK Stack
Apache NiFi
Apache Flink
Apache Kafka
Kafka Connect
Flink SQL
Cloudera DataFlow
Cloudera SQL Stream Builder
Cloudera Streams Messages Manager
Debezium
Postgresql
IBM DB2
Oracle DB
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
Apache NiFi, Apache Flink, Apache Kafka
Timothy Spann
Principal Developer Advocate
Cloudera
Data in Motion
https://budapestdata.hu/2023/en/speakers/timothy-spann/
Timothy Spann
Principal Developer Advocate
Cloudera (US)
LinkedIn · GitHub · datainmotion.dev
June 8 · Online · English talk
Building Modern Data Streaming Apps with NiFi, Flink and Kafka
In my session, I will show you some best practices I have discovered over the last 7 years in building data streaming applications including IoT, CDC, Logs, and more.
In my modern approach, we utilize several open-source frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink SQL. We will stream data into Apache Iceberg.
We use the best streaming tools for the current applications with FLaNK. flankstack.dev
BIO
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
GSJUG: Mastering Data Streaming Pipelines 09May2023
https://www.meetup.com/futureofdata-princeton/events/293233881/
This is a repost from the Garden State Java Users Group Event.
Join me at
https://www.meetup.com/garden-state-java-user-group/events/293229660/
See: https://www.eventbrite.com/e/mastering-data-streaming-pipelines-tickets-627677218457?_ga=2.253257801.1787151623.1682868226-741104479.1678110925
Please note that registration via EventBrite is required to attend either in-person or online.
We are happy to announce that Tim Spann will be our special guest for the May 9, 2023 meeting!
Abstract:
In this session, Tim will show you some best practices that he has discovered over the last seven years in building data streaming applications including IoT, CDC, Logs, and more.
In his modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink, enhance events with NiFi enrichment. We build continuous queries against our topics with Flink SQL.
We will show where Java fits in as sources, enrichments, NiFi processors and sinks.
We hope to see you on May 9!
Speaker
Timothy Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
In this session, Tim will show you some best practices that he has discovered over the last seven years in building data streaming applications, including IoT, CDC, Logs, and more.
In his modern approach, we utilize several Apache frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there, we build streaming ETL with Apache Flink, enhance events with NiFi enrichment. We build continuous queries against our topics with Flink SQL.
We will show where Java fits in as sources, enrichments, NiFi processors, and sinks.
https://www.eventbrite.com/e/mastering-data-streaming-pipelines-tickets-627677218457?_ga=2.253257801.178
Best Practices for Building Hybrid-Cloud Architectures | Hans Jespersenconfluent
Best Practices for building Hybrid-Cloud Architectures - Hans Jespersen
Afternoon opening presentation during Confluent’s streaming event in Paris, presented by Hans Jespersen, VP WW Systems Engineering at Confluent.
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...VMware Tanzu
SpringOne Platform 2017
Gilbert Lau, Data Stax; Wayne Lund, Pivotal
"Spring Cloud Data Flow satisfies all of the demands of modern streaming and task workloads. A growing number of customers are viewing Pivotal Cloud Foundry as an ideal runtime for these types of workloads to take advantage of all of the microservice architecture features of Spring Boot apps leveraging Spring Cloud Services. This is only half of the equation. Once the streaming data is persisted on their database, our customers want to generate actionable insights to provide the best customer experience to stay on top of the competitive marketplace. DataStax Enterprise (DSE) is a single and unified big data platform with Apache Cassandra NoSQL database at its core. Integrated within each node of DSE is powerful indexing, search through Apache Solr, analytics through Apache Spark, and a enterprise-ready graph functionality. It is by far the only operational data platform which can scale linearly in excess of 1,000 nodes, with no single point of failure, and is capable of providing real-time active-everywhere replication across many datacenters and cloud providers.
In this presentation and demo we will take a common social data set and show SCDF advantages on PCF for microservice scaling and pipelining data into a DataStax Enterprise Cassandra NoSQL database. Then followed by extracting meaningful information through DataStax Enterprise Search, DataStax Enterprise Analytics, and DataStax Cassandra Service Broker Tile for PCF using a Spring Boot Dashboard application."
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022HostedbyConfluent
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022
Modern streaming use cases are generating massive amounts of data - much of it needs to be organized and queried over time. The sheer amount and complexity of this problem presents new challenges for data engineers and developers alike.
To solve this problem Apache Kafka and MongoDB Time Series collections are a powerful combination. In this talk, Kenny Gorman and Elena Cuevas will present how Apache Kafka on Confluent Cloud can stream massive amounts of data to Time Series Collections via the MongoDB Connector for Apache Kafka. Elena and Kenny will discuss the required configuration details and critical components of Confluent Cloud and MongoDB Atlas as well as some tips, tricks and best practices.
You will leave armed with the knowledge of how Confluent Cloud, Apache Kafka, MongoDB Atlas, and Time Series collections fit into your event-driven architecture.
PartnerSkillUp_Enable a Streaming CDC SolutionTimothy Spann
PartnerSkillUp_Enable a Streaming CDC Solution
Tim Spann
Principal Developer Advocate in Data In Motion for Cloudera, Global
https://attend.cloudera.com/skillupseriesseptember14
Streaming Change Data Capture (CDC) Two Unique Ways
In this next session,
learn how to use Debezium with Flink, Kafka, and NiFi for Change Data Capture using two different mechanisms: Kafka Connect and Flink SQL.
With the virtual nature of today's world, streaming data is more critical than ever. Join Cloudera Chief Data-In-Motion Principal, Tim Spann, and Partner Solution Engineer, Salvador Alamazan as they look closely at key CDC use cases, discuss why Debezium is the best option for handling CDC and use examples to show you how to demonstrate value.
This is a must-attend experience!
Zoom Webinar
September 14, 2023
10:00am–11:00am EDT
FLaNK Stack
Apache NiFi
Apache Flink
Apache Kafka
Kafka Connect
Flink SQL
Cloudera DataFlow
Cloudera SQL Stream Builder
Cloudera Streams Messages Manager
Debezium
Postgresql
IBM DB2
Oracle DB
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramièreconfluent
During the Confluent Streaming event in Paris, Florent Ramière, Technical Account Manager at Confluent, goes beyond brokers, introducing a whole new ecosystem with Kafka Streams, KSQL, Kafka Connect, Rest proxy, Schema Registry, MirrorMaker, etc.
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
Are you interested in learning how to schedule batch jobs in container runtimes?
Maybe you’re wondering how to apply continuous delivery in practice for data-intensive applications? Perhaps you’re looking for an orchestration tool for data pipelines?
Questions like these are common, so rest assured that you’re not alone.
In this webinar, we’ll cover the recent feature improvements in Spring Cloud Data Flow. More specifically, we’ll discuss data processing use cases and how they simplify the overall orchestration experience in cloud runtimes like Cloud Foundry and Kubernetes.
Please join us and be part of the community discussion!
Presenters :
Sabby Anandan, Product Manager
Mark Pollack, Software Engineer, Pivotal
Building Real-time Travel Alerts
In this session, we will walk through how to build a complete streaming application to send alerts based on travel advisories from public data. We will also join in other data sources of relevance and push out alerts.
We will show you how to build this streaming application with Apache NiFi, Apache Kafka, and Apache Flink and show you when/why/how, and what to build to maximize performance, productivity, and ease of development.
Let's get streaming.
Apache Flink
Apache Kafka
Apache NiFi
FLaNK Stack
Tim Spann
Big Data Conference Europe 2023
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
JConWorld: Continuous SQL with Kafka and Flink
In this talk, I will walk through how someone can setup and run continous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas and publishing data.
We will then cover consuming Kafka data, joining Kafka topics and inserting new events into Kafka topics as they arrive. This basic over view will show hands-on techniques, tips and examples of how to do this.
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science. https://www.datainmotion.dev/p/about-me.html https://dzone.com/users/297029/bunkertor.html
https://www.youtube.com/channel/UCDIDMDfje6jAvNE8DGkJ3_w?view_as=subscriber
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK: A Case Study with Transit Data
In this session, we will explore the powerful combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines. We will present a case study using the FLaNK-MTA project, which leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). By integrating Flink, NiFi, and Kafka, FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.
Takeaways:
Understanding the integration of Apache Flink, Apache NiFi, and Apache Kafka for real-time data processing
Insights into building scalable and fault-tolerant data processing pipelines
Best practices for data collection, transformation, and analytics with FLaNK-MTA as a reference
Knowledge of use cases and potential business impact of real-time data processing pipelines
https://github.com/tspannhw/FLaNK-MTA/tree/main
https://medium.com/@tspann/finding-the-best-way-around-7491c76ca4cb
apache nifi
apache kafka
apache flink
apache iceberg
apache parquet
real-time streaming
tim spann
principal developer advocate
cloudera
datainmotion.dev
Leveraging Mainframe Data for Modern Analyticsconfluent
“The mainframe is going away” is as true now as it was 10, 20 and 30 years ago. Mainframes are still crucial in handling critical business transactions, they were however built for an era where batch data movement was the norm and can be difficult to integrate into today’s data-driven, real-time, analytics-focused business processes as well as the environments that support them. Until now.
Join experts from Confluent, Attunity, and Capgemini for a one-hour online talk session where you’ll learn how to:
Unlock your mainframe data with unique change data capture (CDC) functionality without incurring the complexity and expense that come with sending ongoing queries into the mainframe database
How using CDC benefits advanced analytics approaches such as deep machine learning and predictive analytics
Deliver ongoing streams of data in real-time to the most demanding analytics environments
Ensure that your analytics environment includes the broadest possible range of data sources and destinations while ensuring true enterprise-grade functionality
Identify use cases that can help you get started delivering value to the business moving from POC to Pilot to Production
Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data.
If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.
Unlocking Financial Data with Real-Time Pipelines
tspannOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Tim Spann
Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
https://events.linuxfoundation.org/open-source-finance-forum-new-york/
Open Source in Finance Forum NYC
November 1, 2023
Using Apache NiFi with Apache Pulsar for Fast Data On-RampTimothy Spann
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
https://pulsar-summit.org/event/europe-2023/schedule
https://pulsar-summit.org/event/europe-2023/sessions/europe-2023-using-apache-nifi-with-apache-pulsar-for-fast-data-on-ramp
12:30 PM - 1:00 PM, CEST , May 23
Using Apache Nifi with Apache Pulsar for Fast Data On-Ramp
As the Pulsar communities grows, more and more connectors will be added. To enhance the availability of sources and sinks and to make use of the greater Apache Streaming community, joining forces between Apache NiFi and Apache Pulsar is a perfect fit. Apache NiFi also adds the benefits of ELT, ETL, data crunching, transformation, validation and batch data processing. Once data is ready to be an event, NiFi can launch it into Pulsar at light speed.
Timothy Spann
Principal Developer Advocate for Data in Motion @ Cloudera
Streaming Data Ingest and Processing with Apache KafkaAttunity
Apache™ Kafka is a fast, scalable, durable, and fault-tolerant
publish-subscribe messaging system. It offers higher throughput, reliability and replication. To manage growing data volumes, many companies are leveraging Kafka for streaming data ingest and processing.
Join experts from Confluent, the creators of Apache™ Kafka, and the experts at Attunity, a leader in data integration software, for a live webinar where you will learn how to:
-Realize the value of streaming data ingest with Kafka
-Turn databases into live feeds for streaming ingest and processing
-Accelerate data delivery to enable real-time analytics
-Reduce skill and training requirements for data ingest
The recorded webinar on slide 32 includes a demo using automation software (Attunity Replicate) to stream live changes from a database into Kafka and also includes a Q&A with our experts.
For more information, please go to www.attunity.com/kafka.
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...HostedbyConfluent
Converting production databases into live data streams for Apache Kafka can be labor intensive and costly. As Kafka architectures grow, complexity also rises as data teams begin to configure clusters for redundancy, partitions for performance, as well as for consumer groups for correlated analytics processing. In this breakout session, you’ll hear data streaming success stories from Generali and Skechers that leverage Qlik Data Integration and Confluent. You’ll discover how Qlik’s data integration platform lets organizations automatically produce real-time transaction streams into Kafka, Confluent Platform, or Confluent Cloud, deliver faster business insights from data, enable streaming analytics, as well as streaming ingestion for modern analytics. Learn how these customer use Qlik and Confluent to: - Turn databases into live data feeds - Simplify and automate the real-time data streaming process - Accelerate data delivery to enable real-time analytics Learn how Skechers and Generali breathe new life into data in the cloud, stay ahead of changing demands, while lowering over-reliance on resources, production time and costs.
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...Precisely
Watch our latest quarterly customer education webcast to learn about the latest advancements in Syncsort DMX and DMX-h data integration software, including our new product DMX Change Data Capture (CDC).
Many of our customers use DMX-h to quickly and efficiently populate their data lakes with enterprise-wide data, to power a variety of use cases, including data as a service, data archiving, fraud detection, and Customer 360. But, as important as it is to populate the data lake, it’s equally important to keep that data current for accurate decision making.
DMX Change Data Capture makes it easy and efficient to keep your data lake fresh after the initial load with real-time data replication that continually applies changes made on your traditional systems to your cluster.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
More Related Content
Similar to The Never Landing Stream with HTAP and Streaming
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramièreconfluent
During the Confluent Streaming event in Paris, Florent Ramière, Technical Account Manager at Confluent, goes beyond brokers, introducing a whole new ecosystem with Kafka Streams, KSQL, Kafka Connect, Rest proxy, Schema Registry, MirrorMaker, etc.
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
Are you interested in learning how to schedule batch jobs in container runtimes?
Maybe you’re wondering how to apply continuous delivery in practice for data-intensive applications? Perhaps you’re looking for an orchestration tool for data pipelines?
Questions like these are common, so rest assured that you’re not alone.
In this webinar, we’ll cover the recent feature improvements in Spring Cloud Data Flow. More specifically, we’ll discuss data processing use cases and how they simplify the overall orchestration experience in cloud runtimes like Cloud Foundry and Kubernetes.
Please join us and be part of the community discussion!
Presenters :
Sabby Anandan, Product Manager
Mark Pollack, Software Engineer, Pivotal
Building Real-time Travel Alerts
In this session, we will walk through how to build a complete streaming application to send alerts based on travel advisories from public data. We will also join in other data sources of relevance and push out alerts.
We will show you how to build this streaming application with Apache NiFi, Apache Kafka, and Apache Flink and show you when/why/how, and what to build to maximize performance, productivity, and ease of development.
Let's get streaming.
Apache Flink
Apache Kafka
Apache NiFi
FLaNK Stack
Tim Spann
Big Data Conference Europe 2023
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
JConWorld: Continuous SQL with Kafka and Flink
In this talk, I will walk through how someone can setup and run continous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas and publishing data.
We will then cover consuming Kafka data, joining Kafka topics and inserting new events into Kafka topics as they arrive. This basic over view will show hands-on techniques, tips and examples of how to do this.
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science. https://www.datainmotion.dev/p/about-me.html https://dzone.com/users/297029/bunkertor.html
https://www.youtube.com/channel/UCDIDMDfje6jAvNE8DGkJ3_w?view_as=subscriber
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK: A Case Study with Transit Data
In this session, we will explore the powerful combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines. We will present a case study using the FLaNK-MTA project, which leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). By integrating Flink, NiFi, and Kafka, FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.
Takeaways:
Understanding the integration of Apache Flink, Apache NiFi, and Apache Kafka for real-time data processing
Insights into building scalable and fault-tolerant data processing pipelines
Best practices for data collection, transformation, and analytics with FLaNK-MTA as a reference
Knowledge of use cases and potential business impact of real-time data processing pipelines
https://github.com/tspannhw/FLaNK-MTA/tree/main
https://medium.com/@tspann/finding-the-best-way-around-7491c76ca4cb
apache nifi
apache kafka
apache flink
apache iceberg
apache parquet
real-time streaming
tim spann
principal developer advocate
cloudera
datainmotion.dev
Leveraging Mainframe Data for Modern Analyticsconfluent
“The mainframe is going away” is as true now as it was 10, 20 and 30 years ago. Mainframes are still crucial in handling critical business transactions, they were however built for an era where batch data movement was the norm and can be difficult to integrate into today’s data-driven, real-time, analytics-focused business processes as well as the environments that support them. Until now.
Join experts from Confluent, Attunity, and Capgemini for a one-hour online talk session where you’ll learn how to:
Unlock your mainframe data with unique change data capture (CDC) functionality without incurring the complexity and expense that come with sending ongoing queries into the mainframe database
How using CDC benefits advanced analytics approaches such as deep machine learning and predictive analytics
Deliver ongoing streams of data in real-time to the most demanding analytics environments
Ensure that your analytics environment includes the broadest possible range of data sources and destinations while ensuring true enterprise-grade functionality
Identify use cases that can help you get started delivering value to the business moving from POC to Pilot to Production
Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data.
If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.
Unlocking Financial Data with Real-Time Pipelines
tspannOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Tim Spann
Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
https://events.linuxfoundation.org/open-source-finance-forum-new-york/
Open Source in Finance Forum NYC
November 1, 2023
Using Apache NiFi with Apache Pulsar for Fast Data On-RampTimothy Spann
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
https://pulsar-summit.org/event/europe-2023/schedule
https://pulsar-summit.org/event/europe-2023/sessions/europe-2023-using-apache-nifi-with-apache-pulsar-for-fast-data-on-ramp
12:30 PM - 1:00 PM, CEST , May 23
Using Apache Nifi with Apache Pulsar for Fast Data On-Ramp
As the Pulsar communities grows, more and more connectors will be added. To enhance the availability of sources and sinks and to make use of the greater Apache Streaming community, joining forces between Apache NiFi and Apache Pulsar is a perfect fit. Apache NiFi also adds the benefits of ELT, ETL, data crunching, transformation, validation and batch data processing. Once data is ready to be an event, NiFi can launch it into Pulsar at light speed.
Timothy Spann
Principal Developer Advocate for Data in Motion @ Cloudera
Streaming Data Ingest and Processing with Apache KafkaAttunity
Apache™ Kafka is a fast, scalable, durable, and fault-tolerant
publish-subscribe messaging system. It offers higher throughput, reliability and replication. To manage growing data volumes, many companies are leveraging Kafka for streaming data ingest and processing.
Join experts from Confluent, the creators of Apache™ Kafka, and the experts at Attunity, a leader in data integration software, for a live webinar where you will learn how to:
-Realize the value of streaming data ingest with Kafka
-Turn databases into live feeds for streaming ingest and processing
-Accelerate data delivery to enable real-time analytics
-Reduce skill and training requirements for data ingest
The recorded webinar on slide 32 includes a demo using automation software (Attunity Replicate) to stream live changes from a database into Kafka and also includes a Q&A with our experts.
For more information, please go to www.attunity.com/kafka.
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...HostedbyConfluent
Converting production databases into live data streams for Apache Kafka can be labor intensive and costly. As Kafka architectures grow, complexity also rises as data teams begin to configure clusters for redundancy, partitions for performance, as well as for consumer groups for correlated analytics processing. In this breakout session, you’ll hear data streaming success stories from Generali and Skechers that leverage Qlik Data Integration and Confluent. You’ll discover how Qlik’s data integration platform lets organizations automatically produce real-time transaction streams into Kafka, Confluent Platform, or Confluent Cloud, deliver faster business insights from data, enable streaming analytics, as well as streaming ingestion for modern analytics. Learn how these customer use Qlik and Confluent to: - Turn databases into live data feeds - Simplify and automate the real-time data streaming process - Accelerate data delivery to enable real-time analytics Learn how Skechers and Generali breathe new life into data in the cloud, stay ahead of changing demands, while lowering over-reliance on resources, production time and costs.
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...Precisely
Watch our latest quarterly customer education webcast to learn about the latest advancements in Syncsort DMX and DMX-h data integration software, including our new product DMX Change Data Capture (CDC).
Many of our customers use DMX-h to quickly and efficiently populate their data lakes with enterprise-wide data, to power a variety of use cases, including data as a service, data archiving, fraud detection, and Customer 360. But, as important as it is to populate the data lake, it’s equally important to keep that data current for accurate decision making.
DMX Change Data Capture makes it easy and efficient to keep your data lake fresh after the initial load with real-time data replication that continually applies changes made on your traditional systems to your cluster.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
Building Real-Time Pipelines With FLaNK
Timothy Spann, Principal Developer Advocate, Streaming - Cloudera Future of Data meetup, startup grind, AI Camp
The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines is extremely powerful, as demonstrated by this case study using the FLaNK-MTA project. The project leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.
Apache NiFi
Apache Kafka
Apache Flink
Apache Iceberg
LLM
Generative AI
Slack
Postgresql
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
Gen AI on Enterprise Cloud
Apache NiFi
Milvus
Apache Kafka
Apache Flink
Cloudera Machine Learning
Cloudera DataFlow
https://medium.com/@tspann/building-a-milvus-connector-for-nifi-34372cb3c7fa
https://www.meetup.com/futureofdata-princeton/events/300737266/
https://lu.ma/q7pcfyjn?source=post_page-----34372cb3c7fa--------------------------------&tk=TTyakY
If you're interested in working with Generative AI on the cloud, this virtual workshop is for you.
Tim Spann from Cloudera and Yujian Tang from Zilliz will cover how you can implement your own GenAI workflows on the cloud at enterprise scale.
9:00 - 9:05: Intro
9:05 - 9:15: What is Milvus
9:15 - 9:25: Cloudera Development Platform
9:25 - 10:00: Demo
Location
https://www.youtube.com/watch?v=IfWIzKsoHnA
https://github.com/tspannhw/SpeakerProfile
https://www.linkedin.com/in/yujiantang/
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
https://www.youtube.com/watch?v=Yeua8NlzQ3Y
https://www.conf42.com/Large_Language_Models_LLMs_2024_Tim_Spann_generative_ai_streaming
Adding Generative AI to Real-Time Streaming Pipelines
Abstract
Let’s build streaming pipelines that convert streaming events into prompts, call LLMs, and process the results.
Summary
Tim Spann: My talk is adding generative AI to real time streaming pipelines. I'm going to discuss a couple of different open source technologies. We'll touch on Kafka, Nifi, Flink, Python, Iceberg. All the slides, all the code and GitHub are out there.
Llm, if you didn't know, is rapidly evolving. There's a lot of different ways to interact with models. That enrichment, transformation, processing really needs tools. The amount of models and projects and software that are available is massive.
Nifi supports hundreds of different inputs and can convert them on the fly. Great way to distribute your data quickly to whoever needs it without duplication, without tight coupling. Fun to find new things to integrate into.
So what we can do is, well, I want to get a meetup chat going. I have a processor here that just listens for events as they come from slack. And then I'm going to clean it up, add a couple fields and push that out to slack. Every model is a little bit of different tweaking.
Nifi acts as a whole website. And as you see here, it can be get, post, put, whatever you want. We send that response back to flink and it shows up here. Thank you for attending this talk. I'm going to be speaking at some other events very shortly.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, Tim Spann here. My talk is adding generative AI to real time streaming pipelines, and we're here for the large language model conference at Comp 42, which is always a nice one, great place to be. I'm going to discuss a couple of different open source technologies that work together to enable you to build real time pipelines using large language models. So we'll touch on Kafka, Nifi, Flink, Python, Iceberg, and I'll show you a little bit of each one in the demos. I've been working with data machine learning, streaming IoT, some other things for a number of years, and you could contact me at any of these places, whether Twitter or whatever it's called, some different blogs, or in person at my meetups and at different conferences around the world. I do a weekly newsletter, cover streaming ML, a lot of LLM, open source, Python, Java, all kinds of fun stuff, as I mentioned, do a bunch of different meetups. They are not just in the east coast of the US, they are available virtually live, and I also put them on YouTube, and if you need them somewhere else, let me know. We publish all the slides, all the code and GitHub. Everything you need is out there. Let's get into the talk. Llm, if you didn't know, is rapidly evolving. While you're typing down the things that you use, it
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...Timothy Spann
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
https://xtremej.dev/2023/schedule/
Building Real-time Pipelines with FLaNK: A Case Study with Transit Data
Overview of the problem, the application (code walkthru and running), overview of FLaNK, introduction to NiFi, introduction to Kafka, and introduction to Flink.
28March2024-Codeless-Generative-AI-Pipelines
https://www.meetup.com/futureofdata-princeton/events/299440871/
https://www.meetup.com/real-time-analytics-meetup-ny/events/299290822/
******Note*****
The event is seat-limited, therefore please complete your registration here. Only people completing the form will be able to attend.
-----------------------
We're excited to invite you to join us in-person, for a Real-Time Analytics exploration!
Join us for an evening of insights, networking as we delve into the OSS technologies shaping the field!
Agenda:
05:30-06:00: Pizza and friends
06:00- 06:40: Codeless GenAI Pipelines with Flink, Kafka, NiFi
06:40- 07:20 Real-Time Analytics in the Corporate World: How Apache Pinot® Powers Industry Leaders
07:20-07:30 QNA
Codeless GenAI Pipelines with Flink, Kafka, NiFi | Tim Spann, Cloudera
Explore the power of real-time streaming with GenAI using Apache NiFi. Learn how NiFi simplifies data engineering workflows, allowing you to focus on creativity over technical complexities. I'll guide you through practical examples, showcasing NiFi's automation impact from ingestion to delivery. Whether you're a seasoned data engineer or new to GenAI, this talk offers valuable insights into optimizing workflows. Join us to unlock the potential of real-time streaming and witness how NiFi makes data engineering a breeze for GenAI applications!
Real-Time Analytics in the Corporate World: How Apache Pinot® Powers Industry Leaders | Viktor Gamov, StarTree
Explore how industry leaders like LinkedIn, Uber Eats, and Stripe are mastering real-time data with Viktor as your guide. Discover how Apache Pinot transforms data into actionable insights instantly. Viktor will showcase Pinot's features, including the Star-Tree Index, and explain why it's a game-changer in data strategy. This session is for everyone, from data geeks to business gurus, eager to uncover the future of tech. Join us and be wowed by the power of real-time analytics with Apache Pinot!
-------
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera.
He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more.
TCFPro24 Building Real-Time Generative AI PipelinesTimothy Spann
https://princetonacm.acm.org/tcfpro/
18th Annual IEEE IT Professional Conference (ITPC)
Armstrong Hall at The College of New Jersey
Friday, March 15th, 2024 | 10:00 AM to 5:00 PM
IT Professional Conference at Trenton Computer Festival
IEEE Information Technology Professional Conference on Friday, March 15th, 2024
TCFPro24 Building Real-Time Generative AI Pipelines
Building Real-Time Generative AI Pipelines
In this talk, Tim will delve into the exciting realm of building real-time generative AI pipelines with streaming capabilities. The discussion will revolve around the integration of cutting-edge technologies to create dynamic and responsive systems that harness the power of generative algorithms.
From leveraging streaming data sources to implementing advanced machine learning models, the presentation will explore the key components necessary for constructing a robust real-time generative AI pipeline. Practical insights, use cases, and best practices will be shared, offering a comprehensive guide for developers and data scientists aspiring to design and implement dynamic AI systems in a streaming environment.
Tim will show a live demo showing we can use Apache NiFi to provide a live chat between a person in Slack and several LLM models all orchestrated with Apache NiFi, Apache Kafka and Python. We will use RAG against Chroma and Pinecone vector data stores, Hugging Face and WatsonX.AI LLM, and add additional context with NiFi lookups of stocks, weather and other data streams in real-time.
Timothy Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark.
Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines
https://www.meetup.com/futureofdata-newyork/events/298660453/
Unlocking Financial Data with Real-Time Pipelines
(Flink Analytics on Stocks with SQL )
By Timothy Spann
Financial institutions thrive on accurate and timely data to drive critical decision-making processes, risk assessments, and regulatory compliance. However, managing and processing vast amounts of financial data in real-time can be a daunting task. To overcome this challenge, modern data engineering solutions have emerged, combining powerful technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient and reliable real-time data pipelines. In this talk, we will explore how this technology stack can unlock the full potential of financial data, enabling organizations to make data-driven decisions swiftly and with confidence.
Introduction: Financial institutions operate in a fast-paced environment where real-time access to accurate and reliable data is crucial. Traditional batch processing falls short when it comes to handling rapidly changing financial markets and responding to customer demands promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of financial data. I will be utilizing NiFi 2.0 with Python and Vector Databases.
Timothy Spann
Principal Developer Advocate, Cloudera
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
https://twitter.com/PaaSDev
https://www.linkedin.com/in/timothyspann/
https://medium.com/@tspann
https://github.com/tspannhw/FLiPStackWeekly/
Conf42-Python-Building Apache NiFi 2.0 Python Processors
https://www.conf42.com/Python_2024_Tim_Spann_apache_nifi_2_processors
Building Apache NiFi 2.0 Python Processors
Abstract
Let’s enhance real-time streaming pipelines with smart Python code. Adding code for vector databases and LLM.
Summary
Tim Spann: I'm going to be talking today, be building Apache 9520 Python processors. One of the main purposes of supporting Python in the streaming tool Apache Nifi is to interface with new machine learning and AI and Gen AI. He says Python is a real game changer for Cloudera.
You're just going to add some metadata around it. It's a great way to pass a file along without changing it too substantially. We really need you to have Python 310 and again JDK 21 on your machine. You got to be smart about how you use these models.
There are a ton of python processors available. You can use them in multiple ways. We're still in the early world of Python processors, so now's the time to start putting yours out there. Love to see a lot of people write their own.
When we are parsing documents here, again, this is the Python one I'm picking PDF. Lots of different things you could do. If you're interested on writing your own python code for Apache Nifi, definitely reach out and thank.
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg with Stock Data and LLM
Abstract
In this talk, we’ll discuss how to use Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg to process and analyze stock data. We demonstrated the ingestion, processing, and analysis of stock data. Additionally, we illustrated how to use an LLM to generate predictions from the analyzed data.
Karin Wolok
Developer Relations, Dev Marketing, and Community Programming @ Project Elevate
Karin Wolok's LinkedIn account Karin Wolok's twitter account
Tim Spann
Principal Developer Advocate @ Cloudera
Tim Spann's LinkedIn account Tim Spann's twitter account
https://www.conf42.com/Python_2024_Karin_Wolok_Tim_Spann_nifi__kafka_risingwave_iceberg_llm
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
https://www.aicamp.ai/event/eventdetails/W2024022214
apache nifi
llm
generative ai
gen ai
ml
dl
machine learning
apache kafka
apache flink
postgresql
python
AI Meetup (NYC): GenAI, LLMs, ML and Data
Feb 22, 05:30 PM EST
Welcome to the monthly in-person AI meetup in New York City, in collaboration with Microsoft. Join us for deep dive tech talks on AI, GenAI, LLMs and machine learning, food/drink, networking with speakers and fellow developers
Agenda:
* 5:30pm~6:00pm: Checkin, Food/drink and networking
* 6:00pm~6:10pm: Welcome/community update
* 6:10pm~8:30pm: Tech talks
* 8:30pm: Q&A, Open discussion
Tech Talk: Searching and Reasoning Over Multimedia Data with Vector Databases and LMMs
Speaker: Zain Hasan (Weaviate LinkedIn)
Abstract: In this talk, Zain Hasan will discuss how we can use open-source multimodal embedding models in conjunction with large generative multimodal models that can that can see, hear, read, and feel data(!), to perform cross-modal search(searching audio with images, videos with text etc.) and multimodal retrieval augmented generation (MM-RAG) at the billion-object scale with the help of open source vector databases. I will also demonstrate, with live code demos, how being able to perform this cross-modal retrieval in real-time can enables users to use LLMs that can reason over their enterprise multimodal data. This talk will revolve around how we can scale the usage of multimodal embedding and generative models in production.
Tech Talk: Codeless Generative AI Pipelines
Speaker: Timothy Spann (Cloudera LinkedIn)
Abstract: Join us for an insightful talk on leveraging the power of real-time streaming tools, specifically Apache NiFi, to revolutionize GenAI data engineering. In this session, we’ll explore how the integration of Apache NiFi can automate the entire process of prompt building, making it a seamless and efficient task.
Speakers/Topics:
Stay tuned as we are updating speakers and schedules. If you have a keen interest in speaking to our community, we invite you to submit topics for consideration: Submit Topics
Sponsors:
We are actively seeking sponsors to support our community. Whether it is by offering venue spaces, providing food/drink, or cash sponsorship. Sponsors will have the chance to speak at the meetups, receive prominent recognition, and gain exposure to our extensive membership base of 20,000+ local or 300K+ developers worldwide.
Venue:
Microsoft NYC - Times Square, 11 Times Square, New York, NY 10036
Room Name: Central Park West 6501
Community on Slack/Discord
- Event chat: chat and connect with speakers and attendees
- Sharing blogs, events, job openings, projects collaborations
Join Slack (search and join the #newyork channel) | Join Discord
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
20-Feb-2024
In this talk, I will walk through how someone can set up and run continuous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas, and publishing data.
We will then cover consuming Kafka data, joining Kafka topics, and inserting new events into Kafka topics as they arrive. This basic overview will show hands-on techniques, tips, and examples of how to do this.
Tim Spann
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesTimothy Spann
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Unlocking Financial Data with Real-Time Pipelines
Financial institutions thrive on accurate and timely data to drive critical decision-making processes, risk assessments, and regulatory compliance. However, managing and processing vast amounts of financial data in real-time can be a daunting task. To overcome this challenge, modern data engineering solutions have emerged, combining powerful technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient and reliable real-time data pipelines. In this talk, we will explore how this technology stack can unlock the full potential of financial data, enabling organizations to make data-driven decisions swiftly and with confidence.
Introduction: Financial institutions operate in a fast-paced environment where real-time access to accurate and reliable data is crucial. Traditional batch processing falls short when it comes to handling rapidly changing financial markets and responding to customer demands promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of financial data.
Key Points to be Covered:
Introduction to Real-Time Data Pipelines: a. The limitations of traditional batch processing in the financial domain. b. Understanding the need for real-time data processing.
Apache Flink: Powering Real-Time Stream Processing: a. Overview of Apache Flink and its role in real-time stream processing. b. Use cases for Apache Flink in the financial industry. c. How Flink enables fast, scalable, and fault-tolerant processing of streaming financial data.
Apache Kafka: Building Resilient Event Streaming Platforms: a. Introduction to Apache Kafka and its role as a distributed streaming platform. b. Kafka's capabilities in handling high-throughput, fault-tolerant, and real-time data streaming. c. Integration of Kafka with financial data sources and consumers.
Apache NiFi: Data Ingestion and Flow Management: a. Overview of Apache NiFi and its role in data ingestion and flow management. b. Data integration and transformation capabilities of NiFi for financial data. c. Utilizing NiFi to collect and process financial data from diverse sources.
Iceberg: Efficient Data Lake Management: a. Understanding Iceberg and its role in managing large-scale data lakes. b. Iceberg's schema evolution and table-level metadata capabilities. c. How Iceberg simplifies data lake management in financial institutions.
Real-World Use Cases: a. Real-time fraud detection using Flink, Kafka, and NiFi. b. Portfolio risk analysis with Iceberg and Flink. c. Streamlined regulatory reporting leveraging all four technologies.
Best Practices and Considerations: a. Architectural considerations when building real-time financial data pipelines. b. Ensuring data integrity, security, and compliance in real-time pipelines. c. Scalability an
[EN]DSS23_tspann_Integrating LLM with Streaming Data PipelinesTimothy Spann
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
https://dssconf.pl/en/#agenda-section
Integrating LLM with Streaming Data Pipelines
Timothy Spann, Principal Developer Advocate, Cloudera
APACHE NIFI, APACHE FLINK, APACHE KAFKA, LLM, HUGGINGFACE, REST, STREAMING
LESS
In this talk and demo I will walk through how to add LLMs to your streaming pipelines by integration through Apache NiFi.
https://github.com/tspannhw/FLaNK-watsonx.ai
Cloudera
streaming
llm
generative ai
slack to slack
AIDevWorld 23 Apache NiFi 101 Introduction and Best Practices
https://sched.co/1RoAO
Timothy Spann, Cloudera, Principal Developer Advocate
In this talk, we will walk step by step through Apache NiFi from the first load to first application. I will include slides, articles and examples to take away as a Quick Start to utilizing Apache NiFi in your real-time dataflows. I will help you get up and running locally on your laptop, Docker or in CDP Public Cloud.
Wednesday November 1, 2023 12:00pm - 12:25pm PDT
VIRTUAL AI DevWorld -- Main Stage https://app.hopin.com/events/api-world-2023-ai-devworld/stages
Retail & E-Commerce AI (Industry AI Conference)
Session Type OPEN TALK
Track or Conference Retail & E-Commerce AI (Industry AI Conference), Industry AI Conference, VIRTUAL, Tensorflow & PyTorch & Open Source Frameworks (AI/ML Engineering Conference), AI/ML Engineering Conference, AI DevWorld
In-Person/Virtual Virtual, Virtual Exclusive
apache nifi
Timothy Spann
Cloudera
Principal Developer Advocate for Data in Motion
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
cloudera dataflow
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Advanced Flow Concepts Every Developer Should KnowPeter Caitens
Tim Combridge from Sensible Giraffe and Salesforce Ben presents some important tips that all developers should know when dealing with Flows in Salesforce.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
4. 4
4
FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
Apache NiFi x Apache Kafka x Apache Flink
5. 5
5
Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
6. 6
6
CDP IS THE ONLY HYBRID DATA PLATFORM
Hybrid. Open. Portable. Secure.
S3
GCS
OZONE
ADLS
OZONE S3
GCS
ADLS
CLOUDERA DATA PLATFORM
OZONE S3
GCS
ADLS
OPEN DATA
LAKEHOUSE
8. 8
8
CLOUDERA FLOW MANAGEMENT - POWERED BY
APACHE NiFi
Ingest and manage data from edge-to-cloud using a no-code interface
● #1 data ingestion/movement engine
● Strong community
● Product maturity over 11 years
● Deploy on-premises or in the cloud
● Over 400+ pre-built processors
● Built-in data provenance
● Guaranteed delivery
● Throttling and Back pressure
9. 9
9
Cloudera Flow Management
Ingest and manage data from edge-to-cloud using a no-code interface
ACQUIRE PROCESS DELIVER
• Over 300 pre-built processors
• Easy to build your own processors
• Parse, enrich & apply schema
• Filter, Split, Merge & Route
• Throttle & Backpressure
• Guaranteed delivery
• Full data provenance
• Eco-system integration
Advanced tooling to industrialize flow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE
10. 10
10
SQL BASED ROUTING WITH NiFi’s QueryRecord Processor
● QueryRecord Processor- Executes a SQL
statement against records and writes the
results to the flow file content.
● CSVReader: Looking up schema from SR, it
will converts CSV Records into
ProcessRecords
● SQL execution via Apache Calcite:
execute configured SQL against the
ProcessRecords for routing
● CSVRecordSetWriter: Converts the result
of the query from Process records into CSV
for the for the flow file content
Do routing(routing geo and speed streams) using standard SQL as opposed to complex regular expressions.
11. 11
11
Key Differentiators
Comprehensive streaming platform – Only vendor to offer a open and comprehensive streaming
platform for real-time data ingestion and processing to produce prescriptive and predictive analytics
Stream to Cloud – Extend the same on-premises streaming capabilities to the cloud with full support
for multi-cloud and hybrid cloud models
400+ pre-built processors – Only product to offer such comprehensive connectivity to a wide range
of data sources from edge to cloud
Enterprise-Grade Security & Governance – Deploy your streaming applications with confidence and
trust with Cloudera SDX offering unified security and governance across the entire platform
Democratize access to real-time data – Enable data analysts and other personas to quickly build
streaming applications with just SQL
12. 12
12
Development & Runtime of DataFlow Functions
Step1. Develop functions
on local workstation or in
CDP Public Cloud using
no-code, UI designer
Step 2. Run functions on
serverless compute
services in AWS, Azure &
GCP
AWS Lambda Azure Functions Google Cloud Functions
13. 13
13
DataFlow Functions Use Cases
Trigger Based, Batch, Scheduled and Microservice Use Cases
Serverless Trigger-Based
File Processing Pipeline
Develop & run data processing pipelines when
files are created or updated in any of the cloud
object stores
Example: When a photo is uploaded to object
storage, a data flow is triggered which runs image
resizing code and delivers resized image to
different locations.
Serverless Workflows /
Orchestration
Chain different low-code functions to build
complex workflows
Example: Automate the handling of support
tickets in a call center or orchestrate data
movement across different cloud services.
Serverless
Scheduled Tasks
Develop and run scheduled tasks without any
code on pre-defined timed intervals
Example: Offload an external database running
on-premises into the cloud once a day every
morning at 4:00 a.m.
Serverless
Microservices
Build and deploy serverless independent modules
that power your applications microservices
architecture
Example: Event-driven functions for easy
communication between thousands of decoupled
services that power a ride-sharing application.
Serverless
Web APIs
Easily build endpoints for your web applications
with HTTP APIs without any code using DFF and
any of the cloud providers' function triggers
Example: Build high performant, scalable web
applications across multiple data centers.
Serverless
Customized Triggers
With the DFF State feature, build flows to create
customized triggers allowing access to
on-premises or external services
Example: Near real time offloading of files from a
remote SFTP server.
15. 15
15
ReadyFlows
• Cloudera provided
flow definitions
• Cover most common
data flow use cases
• Can be deployed and
adjusted as needed
• Made available
through docs during
Tech Preview
16. 16
16
Deployment
Wizard
• Turns flow definitions
into flow deployments
• Guides users through
providing required
configuration
• Pick from pre-defined
NiFi node sizes
• Define KPIs for the
deployment
Start Deployment Wizard Provide Parameters
Configure Sizing & Scaling Define KPIs
17. 17
17
Key
Performance
Indicators
• Visibility into flow
deployments
• Track high level flow
performance
• Track in-depth NiFi
component metrics
• Defined in
Deployment Wizard
• Monitoring & Alerts
in Deployment
Details
KPI Definition in Deployment Wizard KPI Monitoring
18. 18
18
Dashboard
• Central Monitoring View
• Monitors flow
deployments across
CDP environments
• Monitors flow
deployment health &
performance
• Drill into flow
deployment to monitor
system metrics and
deployment events
19. 19
19
Data Flow
Design for
Everyone
• Cloud-native data
flow development
• Developers get their
own sandbox
• Start developing flows
without installing NiFi
• Redesigned visual
canvas
• Optimized interaction
patterns
• Integration into
CDF-PC Catalog for
versioning
21. 21
21
NiFi Ingesting REST API
● NiFi consumes stream
(cdc, rest, sensors)
● Distributes real-time to
● Kafka and MySQL at same time
● Flink SQL consumes from Kafka
● TiDB CDC -> Kafka
https://ossinsight.io/docs/api
25. 25
25
Why Kudu?
A simultaneous combination of sequential and random reads and writes
Can you insert time series data in
real time? How long does it take to
prepare it for analysis? Can you
get results and act fast enough to
change outcomes?
Can you handle large volumes of
machine-generated data? Do you
have the tools to identify
problems or threats? Can your
system do machine learning?
How fast can you add data to your
data store? Are you trading off the
ability to do broad analytics for the
ability to make updates? Are you
retaining only part of your data?
Time Series Data Machine Data Analytics Online Reporting
26. 26
26
Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP)
https://hpi.de/fileadmin/user_upload/hpi/navigation/10_forschung/20_future_soc_lab/Poster/2019-1/To
zun_FSOC-Poster_20191_150443.pdf
HTAP Options - Apache Kudu
29. 29
29
SQL STREAM BUILDER (CLOUDERA SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
33. 33
HTAP
INGEST OF ALL DATA
Data Sources Cloudera Data
Flow
Cloudera
Streaming
Analytics
Cloudera
Streams
Processing
Kafka
Lake House
34. 34
34
LLM USE CASE
Vector DB
AI Model
Unstructured file types
Data in Motion
on Cloudera Data
Platform (CDP)
Capture, process &
distribute any data,
anywhere
Other enterprise data Open Data Lakehouse
Materialized Views
Structured Sources
Applications/API’s
Streams
35. 35
35
Live Q&A
Travel Advisories
Weather Reports
Documents
Social Media
Internal Data
Github Data
REST API
HYBRID CLOUD
INTERACT
COLLECT STORE
ENRICH, REPORT
Distribute
Collect
Report
REPORT
Visualize
Report, Automate
AI BASED ENHANCEMENTS
Predict, Automate
VECTOR DATABASE
LLM
Machine
Learning
Data
Visualization
Data Flow
Data
Warehouse
SQL
Stream Builder
Data
Visualization
Input Sentences
Generated Text
Timestamp
Input Sentence
Timestamps
Enrichments
Messaging
Broker
Real-time alerting
Real-time alerting
Aggregations
37. 37
37
CSP
Community
Edition
● Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
● Runs in Docker
● Try new features quickly
● Develop applications
locally
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $> docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications
38. 38
38
Open Source Edition
● Apache NiFi in Docker
● Runs in Docker
● Try new features
quickly
● Develop applications
locally
● Docker NiFi
○ docker run --name nifi -p 8443:8443 -d -e
SINGLE_USER_CREDENTIALS_USERNAME=admin -e
SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUgh
vvgEvjnaLjFEB apache/nifi:latest
● Licensed under the ASF License
● Unsupported
https://hub.docker.com/r/apache/nifi