Tracking crime as it occurs with apache phoenix, apache hbase and apache nifi.
Ingesting JSON Crime Feeds, XML Feeds, Twitter feeds, Traffic Camera Images.
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Using Apache NiFi 1.10 to read/write from HBase
Dec 2019, Timothy Spann, Field Engineer, Data in Motion
Princeton Meetup 10-dec-2019
https://www.meetup.com/futureofdata-princeton/events/266496424/
Hosted By PGA Fund at:
https://pga.fund/coworking-space/
Princeton Growth Accelerator
5 Independence Way, 4th Floor, Princeton, NJ
Building Event-Driven Microservices using Kafka Streams (Stathis Souris, Thou...London Microservices
Recorded at the London Microservices Meetup: https://www.meetup.com/London-Microservices/
- Date: 14th of October 2020
- Video: https://youtu.be/Arzr0T0hrCw
- Event page: https://www.meetup.com/London-Microservices/events/273266418/
Follow us on Twitter! https://twitter.com/LondonMicrosvc
---
Building Event-Driven Microservices using Kafka Streams
Stathis Souris, ThousandEyes
Streaming is all the rage these days, but can business systems be built using stream processing?
We'll explore this question by looking at Streaming Microservices using Kafka Streams.
We'll also discuss some of the patterns that we currently use in real-life production microservices at ThousandEyes (part of Cisco) and things to avoid.
Key takeaways:
- Basic Kafka concepts
- Kafka Streams
- Discuss various event-driven service built using Kafka Streams
Stathis spent several years in Athens, Greece, as a Software Engineer before moving to London and ThousandEyes (part of Cisco now).
He enjoys working with large distributed systems using technologies like Kafka, Elasticsearch, Java, Kotlin.
Mm.. FLaNK Stack (MiNiFi MXNet Flink NiFi Kudu Kafka)Timothy Spann
Mm.. FLaNK Stack (MiNiFi MXNet Flink NiFi Kudu Kafka)
A quick discussion and demo of the FLaNK stack.
Streaming development with Apache NiFi, Apache Kafka, Apache Flink and friends.
Dec 2019, Timothy Spann, Field Engineer, Data in Motion
Princeton Meetup 10-dec-2019
https://www.meetup.com/futureofdata-princeton/events/266496424/
Hosted By PGA Fund at:
https://pga.fund/coworking-space/
Princeton Growth Accelerator
5 Independence Way, 4th Floor, Princeton, NJ
Battle Tested Event-Driven Patterns for your Microservices ArchitectureNatan Silnitsky
During the past couple of years I’ve implemented or have witnessed implementations of several key patterns of event-driven messaging designs on top of Kafka that have facilitated creating a robust distributed microservices system at Wix that can easily handle increasing traffic and storage needs with many different use-cases.
In this talk I will share these patterns with you, including:
* Consume and Project (data decoupling)
* End-to-end Events (Kafka+websockets)
* In memory KV stores (consume and query with 0-latency)
* Events transactions (Exactly Once Delivery)
At BAADER, we design and engineer innovative and holistic solutions that ensure intelligent, safe, efficient and sustainable food processing in all phases, from the handling of live and raw protein materials to the finished food products. As a key player in the food value chain, we aim to take further significant steps toward greater efficiency, traceability, transparency, profitability, and sustainability through new digital solutions.
During our digital transformation we are working on two ends: At one hand there are many brownfield factories unprepared for the digital journey and the other hand we have powerful greenfield technologies like Apache Kafka. Now, we have to bring two mindsets together – robust food processing machinery and highly scalable software technologies. In this talk, we will present how we successfully started to ingest various kinds of IoT data into our Kafka cluster - spotlighted from both ends.
After having run Hadoop on-premise in production for some time, we decided to build a Hadoop platform in AWS to extend the on-premise Hadoop cluster to a hybrid platform.
In this presentation, we first briefly state our motivation and requirements for building a cloud platform. Moving to the cloud not only offers new technical possibilities, it also helped us to make your way of working more agile. We explain how we setup a team of internal and external experts, defined an agile working mode and how this approach worked for us.
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...HostedbyConfluent
We will demonstrate how easy it is to use Confluent Cloud as the data source of your Beam pipelines. You will learn how to process the information that comes from Confluent Cloud in real time, make transformations on such information and feed it back to your Kafka topics and other parts of your architecture.
IoT Architectures for a Digital Twin with Apache Kafka, IoT Platforms and Mac...Kai Wähner
A digital twin is a digital replica of a living or non-living physical entity. This session discusses the benefits and IoT architectures of a Digital Twin in Industrial IoT (IIoT) and its relation to Apache Kafka, IoT frameworks and Machine Learning. Kafka is often used as central event streaming platform to build a scalable and reliable digital twin for real time streaming sensor data. A live demo shows a scalable digital twin infrastructure for condition monitoring and predictive maintenance in real time for a connected car infrastructure leveraging Kafka, MQTT and TensorFlow.
Key Take-Aways:
• Learn about use cases and characteristics of a digital twin in various industries
• Understand how to build a digital twin for every single (of tens of thousands) IoT device or machine
• See different IoT architectures with Kafka and other IoT technologies and products, including edge, hybrid and global deployments
• Understand the relation to Machine Learning and bring added value to your IoT infrastructure by enabling use cases like predictive maintenance
• Understand how the Apache Kafka enables scalable and flexible end-to-end integration processing from IIoT data to various backend applications
• Watch a live demo of an end-to-end integration, real time processing and analytics of thousands of IoT devices
More details:
https://www.kai-waehner.de/blog/2019/11/28/apache-kafka-industrial-iot-iiot-build-an-open-scalable-reliable-digital-twin/
https://www.kai-waehner.de/blog/2020/03/25/architectures-digital-twin-digital-thread-apache-kafka-iot-platforms-machine-learning/
https://youtu.be/Q3eKPEVwNVY
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Using Apache NiFi 1.10 to read/write from HBase
Dec 2019, Timothy Spann, Field Engineer, Data in Motion
Princeton Meetup 10-dec-2019
https://www.meetup.com/futureofdata-princeton/events/266496424/
Hosted By PGA Fund at:
https://pga.fund/coworking-space/
Princeton Growth Accelerator
5 Independence Way, 4th Floor, Princeton, NJ
Building Event-Driven Microservices using Kafka Streams (Stathis Souris, Thou...London Microservices
Recorded at the London Microservices Meetup: https://www.meetup.com/London-Microservices/
- Date: 14th of October 2020
- Video: https://youtu.be/Arzr0T0hrCw
- Event page: https://www.meetup.com/London-Microservices/events/273266418/
Follow us on Twitter! https://twitter.com/LondonMicrosvc
---
Building Event-Driven Microservices using Kafka Streams
Stathis Souris, ThousandEyes
Streaming is all the rage these days, but can business systems be built using stream processing?
We'll explore this question by looking at Streaming Microservices using Kafka Streams.
We'll also discuss some of the patterns that we currently use in real-life production microservices at ThousandEyes (part of Cisco) and things to avoid.
Key takeaways:
- Basic Kafka concepts
- Kafka Streams
- Discuss various event-driven service built using Kafka Streams
Stathis spent several years in Athens, Greece, as a Software Engineer before moving to London and ThousandEyes (part of Cisco now).
He enjoys working with large distributed systems using technologies like Kafka, Elasticsearch, Java, Kotlin.
Mm.. FLaNK Stack (MiNiFi MXNet Flink NiFi Kudu Kafka)Timothy Spann
Mm.. FLaNK Stack (MiNiFi MXNet Flink NiFi Kudu Kafka)
A quick discussion and demo of the FLaNK stack.
Streaming development with Apache NiFi, Apache Kafka, Apache Flink and friends.
Dec 2019, Timothy Spann, Field Engineer, Data in Motion
Princeton Meetup 10-dec-2019
https://www.meetup.com/futureofdata-princeton/events/266496424/
Hosted By PGA Fund at:
https://pga.fund/coworking-space/
Princeton Growth Accelerator
5 Independence Way, 4th Floor, Princeton, NJ
Battle Tested Event-Driven Patterns for your Microservices ArchitectureNatan Silnitsky
During the past couple of years I’ve implemented or have witnessed implementations of several key patterns of event-driven messaging designs on top of Kafka that have facilitated creating a robust distributed microservices system at Wix that can easily handle increasing traffic and storage needs with many different use-cases.
In this talk I will share these patterns with you, including:
* Consume and Project (data decoupling)
* End-to-end Events (Kafka+websockets)
* In memory KV stores (consume and query with 0-latency)
* Events transactions (Exactly Once Delivery)
At BAADER, we design and engineer innovative and holistic solutions that ensure intelligent, safe, efficient and sustainable food processing in all phases, from the handling of live and raw protein materials to the finished food products. As a key player in the food value chain, we aim to take further significant steps toward greater efficiency, traceability, transparency, profitability, and sustainability through new digital solutions.
During our digital transformation we are working on two ends: At one hand there are many brownfield factories unprepared for the digital journey and the other hand we have powerful greenfield technologies like Apache Kafka. Now, we have to bring two mindsets together – robust food processing machinery and highly scalable software technologies. In this talk, we will present how we successfully started to ingest various kinds of IoT data into our Kafka cluster - spotlighted from both ends.
After having run Hadoop on-premise in production for some time, we decided to build a Hadoop platform in AWS to extend the on-premise Hadoop cluster to a hybrid platform.
In this presentation, we first briefly state our motivation and requirements for building a cloud platform. Moving to the cloud not only offers new technical possibilities, it also helped us to make your way of working more agile. We explain how we setup a team of internal and external experts, defined an agile working mode and how this approach worked for us.
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...HostedbyConfluent
We will demonstrate how easy it is to use Confluent Cloud as the data source of your Beam pipelines. You will learn how to process the information that comes from Confluent Cloud in real time, make transformations on such information and feed it back to your Kafka topics and other parts of your architecture.
IoT Architectures for a Digital Twin with Apache Kafka, IoT Platforms and Mac...Kai Wähner
A digital twin is a digital replica of a living or non-living physical entity. This session discusses the benefits and IoT architectures of a Digital Twin in Industrial IoT (IIoT) and its relation to Apache Kafka, IoT frameworks and Machine Learning. Kafka is often used as central event streaming platform to build a scalable and reliable digital twin for real time streaming sensor data. A live demo shows a scalable digital twin infrastructure for condition monitoring and predictive maintenance in real time for a connected car infrastructure leveraging Kafka, MQTT and TensorFlow.
Key Take-Aways:
• Learn about use cases and characteristics of a digital twin in various industries
• Understand how to build a digital twin for every single (of tens of thousands) IoT device or machine
• See different IoT architectures with Kafka and other IoT technologies and products, including edge, hybrid and global deployments
• Understand the relation to Machine Learning and bring added value to your IoT infrastructure by enabling use cases like predictive maintenance
• Understand how the Apache Kafka enables scalable and flexible end-to-end integration processing from IIoT data to various backend applications
• Watch a live demo of an end-to-end integration, real time processing and analytics of thousands of IoT devices
More details:
https://www.kai-waehner.de/blog/2019/11/28/apache-kafka-industrial-iot-iiot-build-an-open-scalable-reliable-digital-twin/
https://www.kai-waehner.de/blog/2020/03/25/architectures-digital-twin-digital-thread-apache-kafka-iot-platforms-machine-learning/
https://youtu.be/Q3eKPEVwNVY
Fully-Managed, Multi-Tenant Kafka Clusters: Tips, Tricks, and Tools (Christop...confluent
Running a multi-tenant Kafka platform designed for the enterprise can be challenging. You need to manage and plan for data growth, support an ever-increasing number of use cases, and ensure your developers can be productive with the latest tools in the Apache Kafka ecosystem — all while maintaining the stability and performance of Kafka itself.
At Bloomberg, we run a fully-managed, multi-tenant Kafka platform that is used by developers across the enterprise. The variety of use cases for Kafka leads to bursty workloads, latency-sensitive workloads, and topologies where partitions are fanned out across hundreds or thousands of consumer groups running side-by-side in the same cluster.
In this talk, we will give a brief overview of our platform and share some of our experiences and tools for running multi-tenant stretched clusters, managing data growth with compression, and mitigating the impact of various application patterns on shared clusters.
Kafka Summit 2021 - Why MQTT and Kafka are a match made in heavenDominik Obermaier
A fast and efficient integration of end device data into data processing systems is becoming increasingly important in the Internet of Things. Factors such as secure and reliable data transmission, real-time data processing and the analysis of huge amounts of data afterwards play a major role. In order to implement an architecture that enables smooth communication between end devices and systems, tools are needed that are designed for a type of communication that takes these factors into account.
This presentation shows the strengths and application areas of MQTT, the de-facto standard communication protocol for the Internet of Things, and Apache Kafka, which is often used for data streaming in the Internet of Things, and explains how and why the technologies complement each other ideally.
How are leading companies deploying Spark with Hadoop in production? What insights have they learned and what key considerations should you consider to put your Spark-based innovative app to work faster? Hear real-life customer examples of turning data into action using Spark and Hadoop and how advanced users are deploying Hadoop and Spark applications in one cluster with better reliability and performance at production scale.
Help, My Kafka is Broken! (Emma Humber & Gantigmaa Selenge, IBM) Kafka Summit...HostedbyConfluent
While Apache Kafka is designed to be fault-tolerant, there will be times when your Kafka environment just isn’t working as expected.
Whether it’s a newly configured application not processing messages, or an outage in a high-load, mission-critical production environment, it’s crucial to get up and running as quickly and safely as possible.
IBM has hosted production Kafka environments for several years and has in-depth knowledge of how to diagnose and resolve problems rapidly and accurately to ensure minimal impact to end users.
This session will discuss our experiences of how to most effectively collect and understand Kafka diagnostics. We’ll talk through using these diagnostics to work out what’s gone wrong, and how to recover from a system outage. Using this new-found knowledge, you will be equipped to handle any problem your cluster throws at you.
Building a Codeless Log Pipeline w/ Confluent Sink Connector | Pollyanna Vale...HostedbyConfluent
Kubernetes became the de-facto standard for running cloud-native applications. And many users turn to it also to run stateful applications such as Apache Kafka. You can use different tools to deploy Kafka on Kubernetes - write your own YAML files, use Helm Charts, or go for one of the available operators. But there is one thing all of these have in common. You still need very good knowledge of Kubernetes to make sure your Kafka cluster works properly in all situations. This talk will cover different Kubernetes features such as resources, affinity, tolerations, pod disruption budgets, topology spread constraints and more. And it will explain why they are important for Apache Kafka and how to use them. If you are interested in running Kafka on Kubernetes and do not know all of these, this is a talk for you.
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechHostedbyConfluent
In a world where there is an ever growing shift towards event driven streaming data, Kafka is firmly embedded in the epicenter of any Data Platform’s central nervous system. In an attempt to aide in the shift of analytics towards true event time, we have implemented a pure Kappa architecture - effectively turning the database inside out. Through extending the concept of a truly idempotent stream of events, Kafka has been elevated to the source of truth. We have eliminated extra network trips for joins as well as querying state which has significantly improved processing performance while also reducing processing latency. Tune in to discuss challenges, tips and lessons learned while implementing a pure Kappa Architecture. I will address hurdles such as scaling, warm standbys, schema evolution, and batch replay strategies - highlighting issues prevalent with any streaming Kappa based architecture. Streaming big data in and of itself comes with its own set of challenges - such as serialization formats, encryption, and strategies to efficiently utilize message headers. I invite each and every one of you to embark on a journey discussing a means to an end - resulting in processing billions of records each day.
IoT Data Platforms: Processing IoT Data with Apache Kafka™confluent
Apache Kafka is a de-facto standard in most IoT data platforms and stream processing solutions. Kafka is the ideal solution to ingest and process sensor data at scale and in real time. This talk introduces Kafka, Confluent (the company driving Kafka evolution), typical IoT architectures and solutions.
Continus sql with sql stream builder
Eventador Cloudera
Flink SQL Kafka Apache NiFi SMM Schema Registry
Avro Json Apache Calcite. Meetup Future of Data New York
Kenny Gorman Tim Spann John Kuchmek
Event: https://www.meetup.com/de-DE/Vienna-Kafka-meetup/events/262314643/
Speaker: Patrik Kleindl (patrik.kleindl@bearingpoint.com)
Slides of the introduction to Apache Kafka and some popular use cases.
Slides were provided by Confluent (confluent.io)
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Pat Patterson
On a typical day we see hundreds of downloads of StreamSets Data Collector, our open source data integration tool. We used to wrangle our download logs using a combination of the AWS S3 command line, sed, grep, awk and other tools, all run from a shell script (on my laptop!) once a week. This was a classic example of a brittle, hard to maintain, custom data integration. One day it dawned on me, "This is crazy, we have a tool that can do all this!". In this session, I'll explain how I built a dataflow pipeline to stream content delivery network (CDN) logs from S3 to MySQL in real-time, allowing us to gain valuable insights into our open source community. You'll also learn how we use the same techniques to not only gain insights into our community on Slack, but also build tools to better serve them.
While SQL is a simple declarative language, it can be used in very advanced ways when querying streams of data on Kafka - in this talk Kenny will discuss techniques like advanced time specification, complex event processing (CEP), unifying sparse events, restart from failure, and even using Kafka metadata like message size.He will deep dive into how schema management, data serialization formats, Apache Flink and SQL all work together to successfully process data. He will cover advanced SQL techniques, architecture, recovery and scalability strategies from a full stack point of view.
Attendees will see a demo of end-to-end processing pipeline showing features and capabilities of SQL Stream Builder that show new powerful capabilities within the SQL engine itself. They will leave well versed in rich techniques for processing data with SQL at scale and gain new tips and tricks to use in their day to day work.
Transform Your Mainframe and IBM i Data for the Cloud with Precisely and Apac...HostedbyConfluent
Your mainframe and IBM i platforms do hard work for your business, supporting essential computing transactions every day. However, mainframe data does not easily integrate with the cloud platforms driving data-driven, real-time, analytics-focused business processes. Integrating data from this critical technology often results in high costs, missed deadlines, and unhappy customers. So, what can you do? Join us to hear how Precisely Connect can help use the power of Apache Kafka to eliminate data silos and make cloud-based, event-driven data architectures a reality. Start your cloud transformation journey today, knowing you don’t need to leave essential transaction data behind! Learn more about: • Where to begin your cloud transformation journey using mainframe and IBM i data and Apache Kafka • What you need to move mainframe and IBM i data to the cloud while reducing costs, modernizing architectures, and using the staff you have today • How Precisely Connect customers are using change data capture and Apache Kafka to deliver real-time insights to the cloud
Insurance companies are facing similar challenges like all other disrupted market segments like the change of customer expectations and hence the need of differentiating itself new as a brand in a challenging market environment. But at the same time, it underlies a very strict regulatory pressure. Generali Switzerland, as many market leaders in every industry, have understood the power of data to reimagine their markets, customers, products, and business model and managed this change by building their Connection Platform within one year.
Christian Nicoll, Director of Platform Engineering & Operations at Generali Switzerland guides us through their journey of setting up an event-driven architecture to support their digital transformation project.
Attend this online talk and learn more about:
-How Generali managed it to assemble various parts to one platform
-The architecture of the Generali Connection Platform, including Confluent, Kafka, and Attunity.
-Their challenges, best practices, and lessons learned
-Generali’s plans of expanding and scaling the Connection Platform
-Additional Use Cases in regulated markets like retail banking
Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...HostedbyConfluent
Hermes, Germany's largest post-independent logistics service provider for deliveries, had one main goal—make faster and smarter data-driven business decisions. But with high volumes of diverse and disparate data, how can you effectively leverage it as an asset for real-time insights and business intelligence? During this session, Hermes will share their data challenges and how HVR's high volume data replication capabilities enabled Hermes to securely and seamlessly integrate data into Kafka for real-time decision-making and greater visibility into the entire logistics process.
Apache Kafka and the Data Mesh | Michael Noll, ConfluentHostedbyConfluent
Data mesh is a relatively recent term that describes a set of principles that good modern data systems uphold. A kind of “microservices” for the data-centric world. While the data mesh is not technology-specific as a pattern, the building of systems that adopt and implement data mesh principles have a relatively long history under different guises.
In this talk, we share our recommendations and picks of what every developer should know about building a streaming data mesh with Kafka. We introduce the four principles of the data mesh: domain-driven decentralization, data as a product, self-service data platform, and federated governance. We then cover topics such as the differences between working with event streams versus centralized approaches and highlight the key characteristics that make streams a great fit for implementing a mesh, such as their ability to capture both real-time and historical data. We’ll examine how to onboard data from existing systems into a mesh, modelling the communication within the mesh, how to deal with changes to your domain’s “public” data, give examples of global standards for governance, and discuss the importance of taking a product-centric view on data sources and the data sets they share.
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
Using Kafka to stream data into TigerGraph, a distributed graph database, is a common pattern in our customers’ data architecture. In the TigerGraph database, Kafka Connect framework was used to build the native S3 data loader. In TigerGraph Cloud, we will be building native integration with many data sources such as Azure Blob Storage and Google Cloud Storage using Kafka as an integrated component for the Cloud Portal.
In this session, we will be discussing both architectures: 1. built-in Kafka Connect framework within TigerGraph database; 2. using Kafka cluster for cloud native integration with other popular data sources. Demo will be provided for both data streaming processes.
Fully-Managed, Multi-Tenant Kafka Clusters: Tips, Tricks, and Tools (Christop...confluent
Running a multi-tenant Kafka platform designed for the enterprise can be challenging. You need to manage and plan for data growth, support an ever-increasing number of use cases, and ensure your developers can be productive with the latest tools in the Apache Kafka ecosystem — all while maintaining the stability and performance of Kafka itself.
At Bloomberg, we run a fully-managed, multi-tenant Kafka platform that is used by developers across the enterprise. The variety of use cases for Kafka leads to bursty workloads, latency-sensitive workloads, and topologies where partitions are fanned out across hundreds or thousands of consumer groups running side-by-side in the same cluster.
In this talk, we will give a brief overview of our platform and share some of our experiences and tools for running multi-tenant stretched clusters, managing data growth with compression, and mitigating the impact of various application patterns on shared clusters.
Kafka Summit 2021 - Why MQTT and Kafka are a match made in heavenDominik Obermaier
A fast and efficient integration of end device data into data processing systems is becoming increasingly important in the Internet of Things. Factors such as secure and reliable data transmission, real-time data processing and the analysis of huge amounts of data afterwards play a major role. In order to implement an architecture that enables smooth communication between end devices and systems, tools are needed that are designed for a type of communication that takes these factors into account.
This presentation shows the strengths and application areas of MQTT, the de-facto standard communication protocol for the Internet of Things, and Apache Kafka, which is often used for data streaming in the Internet of Things, and explains how and why the technologies complement each other ideally.
How are leading companies deploying Spark with Hadoop in production? What insights have they learned and what key considerations should you consider to put your Spark-based innovative app to work faster? Hear real-life customer examples of turning data into action using Spark and Hadoop and how advanced users are deploying Hadoop and Spark applications in one cluster with better reliability and performance at production scale.
Help, My Kafka is Broken! (Emma Humber & Gantigmaa Selenge, IBM) Kafka Summit...HostedbyConfluent
While Apache Kafka is designed to be fault-tolerant, there will be times when your Kafka environment just isn’t working as expected.
Whether it’s a newly configured application not processing messages, or an outage in a high-load, mission-critical production environment, it’s crucial to get up and running as quickly and safely as possible.
IBM has hosted production Kafka environments for several years and has in-depth knowledge of how to diagnose and resolve problems rapidly and accurately to ensure minimal impact to end users.
This session will discuss our experiences of how to most effectively collect and understand Kafka diagnostics. We’ll talk through using these diagnostics to work out what’s gone wrong, and how to recover from a system outage. Using this new-found knowledge, you will be equipped to handle any problem your cluster throws at you.
Building a Codeless Log Pipeline w/ Confluent Sink Connector | Pollyanna Vale...HostedbyConfluent
Kubernetes became the de-facto standard for running cloud-native applications. And many users turn to it also to run stateful applications such as Apache Kafka. You can use different tools to deploy Kafka on Kubernetes - write your own YAML files, use Helm Charts, or go for one of the available operators. But there is one thing all of these have in common. You still need very good knowledge of Kubernetes to make sure your Kafka cluster works properly in all situations. This talk will cover different Kubernetes features such as resources, affinity, tolerations, pod disruption budgets, topology spread constraints and more. And it will explain why they are important for Apache Kafka and how to use them. If you are interested in running Kafka on Kubernetes and do not know all of these, this is a talk for you.
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechHostedbyConfluent
In a world where there is an ever growing shift towards event driven streaming data, Kafka is firmly embedded in the epicenter of any Data Platform’s central nervous system. In an attempt to aide in the shift of analytics towards true event time, we have implemented a pure Kappa architecture - effectively turning the database inside out. Through extending the concept of a truly idempotent stream of events, Kafka has been elevated to the source of truth. We have eliminated extra network trips for joins as well as querying state which has significantly improved processing performance while also reducing processing latency. Tune in to discuss challenges, tips and lessons learned while implementing a pure Kappa Architecture. I will address hurdles such as scaling, warm standbys, schema evolution, and batch replay strategies - highlighting issues prevalent with any streaming Kappa based architecture. Streaming big data in and of itself comes with its own set of challenges - such as serialization formats, encryption, and strategies to efficiently utilize message headers. I invite each and every one of you to embark on a journey discussing a means to an end - resulting in processing billions of records each day.
IoT Data Platforms: Processing IoT Data with Apache Kafka™confluent
Apache Kafka is a de-facto standard in most IoT data platforms and stream processing solutions. Kafka is the ideal solution to ingest and process sensor data at scale and in real time. This talk introduces Kafka, Confluent (the company driving Kafka evolution), typical IoT architectures and solutions.
Continus sql with sql stream builder
Eventador Cloudera
Flink SQL Kafka Apache NiFi SMM Schema Registry
Avro Json Apache Calcite. Meetup Future of Data New York
Kenny Gorman Tim Spann John Kuchmek
Event: https://www.meetup.com/de-DE/Vienna-Kafka-meetup/events/262314643/
Speaker: Patrik Kleindl (patrik.kleindl@bearingpoint.com)
Slides of the introduction to Apache Kafka and some popular use cases.
Slides were provided by Confluent (confluent.io)
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Pat Patterson
On a typical day we see hundreds of downloads of StreamSets Data Collector, our open source data integration tool. We used to wrangle our download logs using a combination of the AWS S3 command line, sed, grep, awk and other tools, all run from a shell script (on my laptop!) once a week. This was a classic example of a brittle, hard to maintain, custom data integration. One day it dawned on me, "This is crazy, we have a tool that can do all this!". In this session, I'll explain how I built a dataflow pipeline to stream content delivery network (CDN) logs from S3 to MySQL in real-time, allowing us to gain valuable insights into our open source community. You'll also learn how we use the same techniques to not only gain insights into our community on Slack, but also build tools to better serve them.
While SQL is a simple declarative language, it can be used in very advanced ways when querying streams of data on Kafka - in this talk Kenny will discuss techniques like advanced time specification, complex event processing (CEP), unifying sparse events, restart from failure, and even using Kafka metadata like message size.He will deep dive into how schema management, data serialization formats, Apache Flink and SQL all work together to successfully process data. He will cover advanced SQL techniques, architecture, recovery and scalability strategies from a full stack point of view.
Attendees will see a demo of end-to-end processing pipeline showing features and capabilities of SQL Stream Builder that show new powerful capabilities within the SQL engine itself. They will leave well versed in rich techniques for processing data with SQL at scale and gain new tips and tricks to use in their day to day work.
Transform Your Mainframe and IBM i Data for the Cloud with Precisely and Apac...HostedbyConfluent
Your mainframe and IBM i platforms do hard work for your business, supporting essential computing transactions every day. However, mainframe data does not easily integrate with the cloud platforms driving data-driven, real-time, analytics-focused business processes. Integrating data from this critical technology often results in high costs, missed deadlines, and unhappy customers. So, what can you do? Join us to hear how Precisely Connect can help use the power of Apache Kafka to eliminate data silos and make cloud-based, event-driven data architectures a reality. Start your cloud transformation journey today, knowing you don’t need to leave essential transaction data behind! Learn more about: • Where to begin your cloud transformation journey using mainframe and IBM i data and Apache Kafka • What you need to move mainframe and IBM i data to the cloud while reducing costs, modernizing architectures, and using the staff you have today • How Precisely Connect customers are using change data capture and Apache Kafka to deliver real-time insights to the cloud
Insurance companies are facing similar challenges like all other disrupted market segments like the change of customer expectations and hence the need of differentiating itself new as a brand in a challenging market environment. But at the same time, it underlies a very strict regulatory pressure. Generali Switzerland, as many market leaders in every industry, have understood the power of data to reimagine their markets, customers, products, and business model and managed this change by building their Connection Platform within one year.
Christian Nicoll, Director of Platform Engineering & Operations at Generali Switzerland guides us through their journey of setting up an event-driven architecture to support their digital transformation project.
Attend this online talk and learn more about:
-How Generali managed it to assemble various parts to one platform
-The architecture of the Generali Connection Platform, including Confluent, Kafka, and Attunity.
-Their challenges, best practices, and lessons learned
-Generali’s plans of expanding and scaling the Connection Platform
-Additional Use Cases in regulated markets like retail banking
Achieving Real-Time Analytics at Hermes | Zulf Qureshi, HVR and Dr. Stefan Ro...HostedbyConfluent
Hermes, Germany's largest post-independent logistics service provider for deliveries, had one main goal—make faster and smarter data-driven business decisions. But with high volumes of diverse and disparate data, how can you effectively leverage it as an asset for real-time insights and business intelligence? During this session, Hermes will share their data challenges and how HVR's high volume data replication capabilities enabled Hermes to securely and seamlessly integrate data into Kafka for real-time decision-making and greater visibility into the entire logistics process.
Apache Kafka and the Data Mesh | Michael Noll, ConfluentHostedbyConfluent
Data mesh is a relatively recent term that describes a set of principles that good modern data systems uphold. A kind of “microservices” for the data-centric world. While the data mesh is not technology-specific as a pattern, the building of systems that adopt and implement data mesh principles have a relatively long history under different guises.
In this talk, we share our recommendations and picks of what every developer should know about building a streaming data mesh with Kafka. We introduce the four principles of the data mesh: domain-driven decentralization, data as a product, self-service data platform, and federated governance. We then cover topics such as the differences between working with event streams versus centralized approaches and highlight the key characteristics that make streams a great fit for implementing a mesh, such as their ability to capture both real-time and historical data. We’ll examine how to onboard data from existing systems into a mesh, modelling the communication within the mesh, how to deal with changes to your domain’s “public” data, give examples of global standards for governance, and discuss the importance of taking a product-centric view on data sources and the data sets they share.
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
Using Kafka to stream data into TigerGraph, a distributed graph database, is a common pattern in our customers’ data architecture. In the TigerGraph database, Kafka Connect framework was used to build the native S3 data loader. In TigerGraph Cloud, we will be building native integration with many data sources such as Azure Blob Storage and Google Cloud Storage using Kafka as an integrated component for the Cloud Portal.
In this session, we will be discussing both architectures: 1. built-in Kafka Connect framework within TigerGraph database; 2. using Kafka cluster for cloud native integration with other popular data sources. Demo will be provided for both data streaming processes.
Meetup Streaming Data Pipeline DevelopmentTimothy Spann
Meetup Streaming Data Pipeline Development
28 June 2023 6pm EST
Milwaukee meetup
https://www.meetup.com/futureofdata-princeton/events/292976004/
Details
This will be a hybrid event with a Zoom. The in-person event will be in Milwaukee.
In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns.
He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg.
If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html
All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/
https://www.thecapitalgrille.com/locations/wi/milwaukee/milwaukee/8027
The Capital Grille 310 W Wisconsin Ave, Milwaukee, WI 53203
limited seating, preference will be given to NLIT attendees
A peak at the menu (Not Pizza)
RISOTTO FRITTERS WITH FRESH MOZZARELLA AND PROSCIUTTO
SLICED SIRLOIN WITH ROQUEFORT AND BALSAMIC ONIONS
MINIATURE LOBSTER AND CRAB CAKES
WILD MUSHROOM AND HERBED CHEESE
You can join the meeting virtually here (no meat or cheese virtually):
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023ssuser73434e
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data: New Jersey - Princeton, Edison, Holmdel
This will be a hybrid event with a Zoom. The in-person event will be in Milwaukee.
In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns.
He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg.
If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html
All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/
https://www.thecapitalgrille.com/locations/wi/milwaukee/milwaukee/8027
The Capital Grille 310 W Wisconsin Ave, Milwaukee, WI 53203
limited seating, preference will be given to NLIT attendees
A peak at the menu (Not Pizza)
RISOTTO FRITTERS WITH FRESH MOZZARELLA AND PROSCIUTTO
SLICED SIRLOIN WITH ROQUEFORT AND BALSAMIC ONIONS
MINIATURE LOBSTER AND CRAB CAKES
WILD MUSHROOM AND HERBED CHEESE
You can join the meeting virtually here (no meat or cheese virtually):
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
Meetup: Streaming Data Pipeline Development
In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns.
He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg.
If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html
All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/
You can join the meeting virtually here:
https://cloudera.zoom.us/j/91603330726
Speaker - Tim Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Successful AI/ML Projects with End-to-End Cloud Data EngineeringDatabricks
Trusted, high-quality data and efficient use of data engineers’ time are critical success factors for AI/ML projects. Enterprise data is complex—it comes from several sources, in a variety of formats, and at varied speeds. For your machine learning projects on Apache Spark, you need a holistic approach to data engineering: finding & discovering, ingesting & integrating, server-less processing at scale, and data governance. Stop by this session for an overview on how to set up AI/ML projects for success while Informatica takes the heavy lifting out of your data engineering.
Overcoming the Challenges of Architecting for the CloudZscaler
The concept of backhauling traffic to a centralized datacenter worked when both users and applications resided there. But, the migration of applications from the data center to the cloud requires organizations to rethink their branch and network architectures. What is the best approach to manage costs, reduce risk, and deliver the best user experience for all your users?
Watch this webcast to uncover the five key requirements to overcome these challenges and securely route your branch traffic direct to the cloud.
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesTimothy Spann
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Unlocking Financial Data with Real-Time Pipelines
Financial institutions thrive on accurate and timely data to drive critical decision-making processes, risk assessments, and regulatory compliance. However, managing and processing vast amounts of financial data in real-time can be a daunting task. To overcome this challenge, modern data engineering solutions have emerged, combining powerful technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient and reliable real-time data pipelines. In this talk, we will explore how this technology stack can unlock the full potential of financial data, enabling organizations to make data-driven decisions swiftly and with confidence.
Introduction: Financial institutions operate in a fast-paced environment where real-time access to accurate and reliable data is crucial. Traditional batch processing falls short when it comes to handling rapidly changing financial markets and responding to customer demands promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of financial data.
Key Points to be Covered:
Introduction to Real-Time Data Pipelines: a. The limitations of traditional batch processing in the financial domain. b. Understanding the need for real-time data processing.
Apache Flink: Powering Real-Time Stream Processing: a. Overview of Apache Flink and its role in real-time stream processing. b. Use cases for Apache Flink in the financial industry. c. How Flink enables fast, scalable, and fault-tolerant processing of streaming financial data.
Apache Kafka: Building Resilient Event Streaming Platforms: a. Introduction to Apache Kafka and its role as a distributed streaming platform. b. Kafka's capabilities in handling high-throughput, fault-tolerant, and real-time data streaming. c. Integration of Kafka with financial data sources and consumers.
Apache NiFi: Data Ingestion and Flow Management: a. Overview of Apache NiFi and its role in data ingestion and flow management. b. Data integration and transformation capabilities of NiFi for financial data. c. Utilizing NiFi to collect and process financial data from diverse sources.
Iceberg: Efficient Data Lake Management: a. Understanding Iceberg and its role in managing large-scale data lakes. b. Iceberg's schema evolution and table-level metadata capabilities. c. How Iceberg simplifies data lake management in financial institutions.
Real-World Use Cases: a. Real-time fraud detection using Flink, Kafka, and NiFi. b. Portfolio risk analysis with Iceberg and Flink. c. Streamlined regulatory reporting leveraging all four technologies.
Best Practices and Considerations: a. Architectural considerations when building real-time financial data pipelines. b. Ensuring data integrity, security, and compliance in real-time pipelines. c. Scalability an
InfluxData Architecture for IoT | Noah Crowley | InfluxDataInfluxData
Noah will walk you through a typical data architecture for an IoT deployment: from sensor to edge to cloud. Then, it will be a hands-on demo to gather data from the device, display it on a dashboard and trigger alerts.
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines
https://www.meetup.com/futureofdata-newyork/events/298660453/
Unlocking Financial Data with Real-Time Pipelines
(Flink Analytics on Stocks with SQL )
By Timothy Spann
Financial institutions thrive on accurate and timely data to drive critical decision-making processes, risk assessments, and regulatory compliance. However, managing and processing vast amounts of financial data in real-time can be a daunting task. To overcome this challenge, modern data engineering solutions have emerged, combining powerful technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient and reliable real-time data pipelines. In this talk, we will explore how this technology stack can unlock the full potential of financial data, enabling organizations to make data-driven decisions swiftly and with confidence.
Introduction: Financial institutions operate in a fast-paced environment where real-time access to accurate and reliable data is crucial. Traditional batch processing falls short when it comes to handling rapidly changing financial markets and responding to customer demands promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of financial data. I will be utilizing NiFi 2.0 with Python and Vector Databases.
Timothy Spann
Principal Developer Advocate, Cloudera
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
https://twitter.com/PaaSDev
https://www.linkedin.com/in/timothyspann/
https://medium.com/@tspann
https://github.com/tspannhw/FLiPStackWeekly/
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK: A Case Study with Transit Data
In this session, we will explore the powerful combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines. We will present a case study using the FLaNK-MTA project, which leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). By integrating Flink, NiFi, and Kafka, FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.
Takeaways:
Understanding the integration of Apache Flink, Apache NiFi, and Apache Kafka for real-time data processing
Insights into building scalable and fault-tolerant data processing pipelines
Best practices for data collection, transformation, and analytics with FLaNK-MTA as a reference
Knowledge of use cases and potential business impact of real-time data processing pipelines
https://github.com/tspannhw/FLaNK-MTA/tree/main
https://medium.com/@tspann/finding-the-best-way-around-7491c76ca4cb
apache nifi
apache kafka
apache flink
apache iceberg
apache parquet
real-time streaming
tim spann
principal developer advocate
cloudera
datainmotion.dev
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Timothy Spann
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
ntroducing the FLaNK stack which combines Apache Flink, Apache NiFi, Apache Kafka and Apache Kudu to build fast applications for IoT, AI, rapid ingest.
FLaNK provides a quick set of tools to build applications at any scale for any streaming and IoT use cases.
https://www.flankstack.dev/
Tools
Apache Flink, Apache Kafka, Apache NiFi, MiNiFi, Apache MXNet, Apache Kudu, Apache Impala, Apache HDFS
References
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html
Track
Community and Industry Impact
The Edge to AI Deep Dive Barcelona Meetup March 2019Timothy Spann
The Edge to AI Deep Dive Barcelona Meetup March 2019
A deep dive demo of using MiNiFi, NiFi, CDSW for real-time AI at the edge, in a local cluster, in the cloud and in a Data Science platform at scale with real-time streaming and data storage.
Apache NiFi, MiNiFi, NiFi Registry, Cloudera Data Science Workbench (CDSW), Python, Pyspark, Spark SQL, Apache Calcite, Apache Parquet, Apache MXNet, GluonCV.
A Journey to the Cloud with Data VirtualizationDenodo
Watch this Fast Data Strategy Virtual Summit with speakers Cijo Thomas Isaac, Big Data Architect, Asurion & Nick Sarkisian, Associate Vice President - North America Analytics Head, HCL here: https://buff.ly/2KwLvj3
While Asurion expanded its operations globally, their global client base expected highest quality customer service, something Asurion prides itself with. At the same time Asurions brand new digital home premium support required strong predictive analytics, IoT and big data architecture support to provide their customers with the best user experience.
Attend this session to learn:
• How Asurion built its hybrid cloud environment using data virtualization
• Why centralizing security and data governance is key to their data architecture
• Why data virtualization is important for their advanced analytics and data science
Unlocking Financial Data with Real-Time Pipelines
tspannOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Tim Spann
Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
https://events.linuxfoundation.org/open-source-finance-forum-new-york/
Open Source in Finance Forum NYC
November 1, 2023
Similar to Tracking crime as it occurs with apache phoenix, apache hbase and apache nifi (20)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
Building Real-Time Pipelines With FLaNK
Timothy Spann, Principal Developer Advocate, Streaming - Cloudera Future of Data meetup, startup grind, AI Camp
The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines is extremely powerful, as demonstrated by this case study using the FLaNK-MTA project. The project leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.
Apache NiFi
Apache Kafka
Apache Flink
Apache Iceberg
LLM
Generative AI
Slack
Postgresql
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
Gen AI on Enterprise Cloud
Apache NiFi
Milvus
Apache Kafka
Apache Flink
Cloudera Machine Learning
Cloudera DataFlow
https://medium.com/@tspann/building-a-milvus-connector-for-nifi-34372cb3c7fa
https://www.meetup.com/futureofdata-princeton/events/300737266/
https://lu.ma/q7pcfyjn?source=post_page-----34372cb3c7fa--------------------------------&tk=TTyakY
If you're interested in working with Generative AI on the cloud, this virtual workshop is for you.
Tim Spann from Cloudera and Yujian Tang from Zilliz will cover how you can implement your own GenAI workflows on the cloud at enterprise scale.
9:00 - 9:05: Intro
9:05 - 9:15: What is Milvus
9:15 - 9:25: Cloudera Development Platform
9:25 - 10:00: Demo
Location
https://www.youtube.com/watch?v=IfWIzKsoHnA
https://github.com/tspannhw/SpeakerProfile
https://www.linkedin.com/in/yujiantang/
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
https://www.youtube.com/watch?v=Yeua8NlzQ3Y
https://www.conf42.com/Large_Language_Models_LLMs_2024_Tim_Spann_generative_ai_streaming
Adding Generative AI to Real-Time Streaming Pipelines
Abstract
Let’s build streaming pipelines that convert streaming events into prompts, call LLMs, and process the results.
Summary
Tim Spann: My talk is adding generative AI to real time streaming pipelines. I'm going to discuss a couple of different open source technologies. We'll touch on Kafka, Nifi, Flink, Python, Iceberg. All the slides, all the code and GitHub are out there.
Llm, if you didn't know, is rapidly evolving. There's a lot of different ways to interact with models. That enrichment, transformation, processing really needs tools. The amount of models and projects and software that are available is massive.
Nifi supports hundreds of different inputs and can convert them on the fly. Great way to distribute your data quickly to whoever needs it without duplication, without tight coupling. Fun to find new things to integrate into.
So what we can do is, well, I want to get a meetup chat going. I have a processor here that just listens for events as they come from slack. And then I'm going to clean it up, add a couple fields and push that out to slack. Every model is a little bit of different tweaking.
Nifi acts as a whole website. And as you see here, it can be get, post, put, whatever you want. We send that response back to flink and it shows up here. Thank you for attending this talk. I'm going to be speaking at some other events very shortly.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, Tim Spann here. My talk is adding generative AI to real time streaming pipelines, and we're here for the large language model conference at Comp 42, which is always a nice one, great place to be. I'm going to discuss a couple of different open source technologies that work together to enable you to build real time pipelines using large language models. So we'll touch on Kafka, Nifi, Flink, Python, Iceberg, and I'll show you a little bit of each one in the demos. I've been working with data machine learning, streaming IoT, some other things for a number of years, and you could contact me at any of these places, whether Twitter or whatever it's called, some different blogs, or in person at my meetups and at different conferences around the world. I do a weekly newsletter, cover streaming ML, a lot of LLM, open source, Python, Java, all kinds of fun stuff, as I mentioned, do a bunch of different meetups. They are not just in the east coast of the US, they are available virtually live, and I also put them on YouTube, and if you need them somewhere else, let me know. We publish all the slides, all the code and GitHub. Everything you need is out there. Let's get into the talk. Llm, if you didn't know, is rapidly evolving. While you're typing down the things that you use, it
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...Timothy Spann
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
https://xtremej.dev/2023/schedule/
Building Real-time Pipelines with FLaNK: A Case Study with Transit Data
Overview of the problem, the application (code walkthru and running), overview of FLaNK, introduction to NiFi, introduction to Kafka, and introduction to Flink.
28March2024-Codeless-Generative-AI-Pipelines
https://www.meetup.com/futureofdata-princeton/events/299440871/
https://www.meetup.com/real-time-analytics-meetup-ny/events/299290822/
******Note*****
The event is seat-limited, therefore please complete your registration here. Only people completing the form will be able to attend.
-----------------------
We're excited to invite you to join us in-person, for a Real-Time Analytics exploration!
Join us for an evening of insights, networking as we delve into the OSS technologies shaping the field!
Agenda:
05:30-06:00: Pizza and friends
06:00- 06:40: Codeless GenAI Pipelines with Flink, Kafka, NiFi
06:40- 07:20 Real-Time Analytics in the Corporate World: How Apache Pinot® Powers Industry Leaders
07:20-07:30 QNA
Codeless GenAI Pipelines with Flink, Kafka, NiFi | Tim Spann, Cloudera
Explore the power of real-time streaming with GenAI using Apache NiFi. Learn how NiFi simplifies data engineering workflows, allowing you to focus on creativity over technical complexities. I'll guide you through practical examples, showcasing NiFi's automation impact from ingestion to delivery. Whether you're a seasoned data engineer or new to GenAI, this talk offers valuable insights into optimizing workflows. Join us to unlock the potential of real-time streaming and witness how NiFi makes data engineering a breeze for GenAI applications!
Real-Time Analytics in the Corporate World: How Apache Pinot® Powers Industry Leaders | Viktor Gamov, StarTree
Explore how industry leaders like LinkedIn, Uber Eats, and Stripe are mastering real-time data with Viktor as your guide. Discover how Apache Pinot transforms data into actionable insights instantly. Viktor will showcase Pinot's features, including the Star-Tree Index, and explain why it's a game-changer in data strategy. This session is for everyone, from data geeks to business gurus, eager to uncover the future of tech. Join us and be wowed by the power of real-time analytics with Apache Pinot!
-------
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera.
He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more.
TCFPro24 Building Real-Time Generative AI PipelinesTimothy Spann
https://princetonacm.acm.org/tcfpro/
18th Annual IEEE IT Professional Conference (ITPC)
Armstrong Hall at The College of New Jersey
Friday, March 15th, 2024 | 10:00 AM to 5:00 PM
IT Professional Conference at Trenton Computer Festival
IEEE Information Technology Professional Conference on Friday, March 15th, 2024
TCFPro24 Building Real-Time Generative AI Pipelines
Building Real-Time Generative AI Pipelines
In this talk, Tim will delve into the exciting realm of building real-time generative AI pipelines with streaming capabilities. The discussion will revolve around the integration of cutting-edge technologies to create dynamic and responsive systems that harness the power of generative algorithms.
From leveraging streaming data sources to implementing advanced machine learning models, the presentation will explore the key components necessary for constructing a robust real-time generative AI pipeline. Practical insights, use cases, and best practices will be shared, offering a comprehensive guide for developers and data scientists aspiring to design and implement dynamic AI systems in a streaming environment.
Tim will show a live demo showing we can use Apache NiFi to provide a live chat between a person in Slack and several LLM models all orchestrated with Apache NiFi, Apache Kafka and Python. We will use RAG against Chroma and Pinecone vector data stores, Hugging Face and WatsonX.AI LLM, and add additional context with NiFi lookups of stocks, weather and other data streams in real-time.
Timothy Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark.
Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Conf42-Python-Building Apache NiFi 2.0 Python Processors
https://www.conf42.com/Python_2024_Tim_Spann_apache_nifi_2_processors
Building Apache NiFi 2.0 Python Processors
Abstract
Let’s enhance real-time streaming pipelines with smart Python code. Adding code for vector databases and LLM.
Summary
Tim Spann: I'm going to be talking today, be building Apache 9520 Python processors. One of the main purposes of supporting Python in the streaming tool Apache Nifi is to interface with new machine learning and AI and Gen AI. He says Python is a real game changer for Cloudera.
You're just going to add some metadata around it. It's a great way to pass a file along without changing it too substantially. We really need you to have Python 310 and again JDK 21 on your machine. You got to be smart about how you use these models.
There are a ton of python processors available. You can use them in multiple ways. We're still in the early world of Python processors, so now's the time to start putting yours out there. Love to see a lot of people write their own.
When we are parsing documents here, again, this is the Python one I'm picking PDF. Lots of different things you could do. If you're interested on writing your own python code for Apache Nifi, definitely reach out and thank.
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg with Stock Data and LLM
Abstract
In this talk, we’ll discuss how to use Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg to process and analyze stock data. We demonstrated the ingestion, processing, and analysis of stock data. Additionally, we illustrated how to use an LLM to generate predictions from the analyzed data.
Karin Wolok
Developer Relations, Dev Marketing, and Community Programming @ Project Elevate
Karin Wolok's LinkedIn account Karin Wolok's twitter account
Tim Spann
Principal Developer Advocate @ Cloudera
Tim Spann's LinkedIn account Tim Spann's twitter account
https://www.conf42.com/Python_2024_Karin_Wolok_Tim_Spann_nifi__kafka_risingwave_iceberg_llm
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
https://www.aicamp.ai/event/eventdetails/W2024022214
apache nifi
llm
generative ai
gen ai
ml
dl
machine learning
apache kafka
apache flink
postgresql
python
AI Meetup (NYC): GenAI, LLMs, ML and Data
Feb 22, 05:30 PM EST
Welcome to the monthly in-person AI meetup in New York City, in collaboration with Microsoft. Join us for deep dive tech talks on AI, GenAI, LLMs and machine learning, food/drink, networking with speakers and fellow developers
Agenda:
* 5:30pm~6:00pm: Checkin, Food/drink and networking
* 6:00pm~6:10pm: Welcome/community update
* 6:10pm~8:30pm: Tech talks
* 8:30pm: Q&A, Open discussion
Tech Talk: Searching and Reasoning Over Multimedia Data with Vector Databases and LMMs
Speaker: Zain Hasan (Weaviate LinkedIn)
Abstract: In this talk, Zain Hasan will discuss how we can use open-source multimodal embedding models in conjunction with large generative multimodal models that can that can see, hear, read, and feel data(!), to perform cross-modal search(searching audio with images, videos with text etc.) and multimodal retrieval augmented generation (MM-RAG) at the billion-object scale with the help of open source vector databases. I will also demonstrate, with live code demos, how being able to perform this cross-modal retrieval in real-time can enables users to use LLMs that can reason over their enterprise multimodal data. This talk will revolve around how we can scale the usage of multimodal embedding and generative models in production.
Tech Talk: Codeless Generative AI Pipelines
Speaker: Timothy Spann (Cloudera LinkedIn)
Abstract: Join us for an insightful talk on leveraging the power of real-time streaming tools, specifically Apache NiFi, to revolutionize GenAI data engineering. In this session, we’ll explore how the integration of Apache NiFi can automate the entire process of prompt building, making it a seamless and efficient task.
Speakers/Topics:
Stay tuned as we are updating speakers and schedules. If you have a keen interest in speaking to our community, we invite you to submit topics for consideration: Submit Topics
Sponsors:
We are actively seeking sponsors to support our community. Whether it is by offering venue spaces, providing food/drink, or cash sponsorship. Sponsors will have the chance to speak at the meetups, receive prominent recognition, and gain exposure to our extensive membership base of 20,000+ local or 300K+ developers worldwide.
Venue:
Microsoft NYC - Times Square, 11 Times Square, New York, NY 10036
Room Name: Central Park West 6501
Community on Slack/Discord
- Event chat: chat and connect with speakers and attendees
- Sharing blogs, events, job openings, projects collaborations
Join Slack (search and join the #newyork channel) | Join Discord
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
20-Feb-2024
In this talk, I will walk through how someone can set up and run continuous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas, and publishing data.
We will then cover consuming Kafka data, joining Kafka topics, and inserting new events into Kafka topics as they arrive. This basic overview will show hands-on techniques, tips, and examples of how to do this.
Tim Spann
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
Building Real-time Travel Alerts
In this session, we will walk through how to build a complete streaming application to send alerts based on travel advisories from public data. We will also join in other data sources of relevance and push out alerts.
We will show you how to build this streaming application with Apache NiFi, Apache Kafka, and Apache Flink and show you when/why/how, and what to build to maximize performance, productivity, and ease of development.
Let's get streaming.
Apache Flink
Apache Kafka
Apache NiFi
FLaNK Stack
Tim Spann
Big Data Conference Europe 2023
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
JConWorld: Continuous SQL with Kafka and Flink
In this talk, I will walk through how someone can setup and run continous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas and publishing data.
We will then cover consuming Kafka data, joining Kafka topics and inserting new events into Kafka topics as they arrive. This basic over view will show hands-on techniques, tips and examples of how to do this.
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science. https://www.datainmotion.dev/p/about-me.html https://dzone.com/users/297029/bunkertor.html
https://www.youtube.com/channel/UCDIDMDfje6jAvNE8DGkJ3_w?view_as=subscriber
[EN]DSS23_tspann_Integrating LLM with Streaming Data PipelinesTimothy Spann
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
https://dssconf.pl/en/#agenda-section
Integrating LLM with Streaming Data Pipelines
Timothy Spann, Principal Developer Advocate, Cloudera
APACHE NIFI, APACHE FLINK, APACHE KAFKA, LLM, HUGGINGFACE, REST, STREAMING
LESS
In this talk and demo I will walk through how to add LLMs to your streaming pipelines by integration through Apache NiFi.
https://github.com/tspannhw/FLaNK-watsonx.ai
Cloudera
streaming
llm
generative ai
slack to slack
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Tracking crime as it occurs with apache phoenix, apache hbase and apache nifi
1. Tracking Crime as It Occurs with Apache
Phoenix, Apache HBase and Apache NiFi
HENRY SOWELL
Technical Director
Cloudera Government Solutions
TIMOTHY SPANN
Field Engineer, Data in Motion
Cloudera