This document provides instructions for completing a use case walkthrough as part of a best in flow competition. It includes steps to read and filter a stream of syslog data from Kafka, define a filter rule to identify critical events, and write the filtered events to another Kafka topic. The steps customize an existing Kafka to Kafka ReadyFlow, modify parameters, start services, verify the data and filter, and finally publish and deploy the flow in production with auto-scaling enabled.
This document provides guidance on using cloud tools to build data assets like Kafka topics, schemas, and Iceberg tables. It describes how to create each type of data asset step-by-step within the CDP Sandbox environment. The document also lists streaming data topics and examples that are available for use in applications. It suggests options for integrating external public data sources or data simulators. References are provided for additional documentation and code examples.
This document provides instructions for several StreamSets Academy labs:
1. The "Lab: Set Up a Deployment" lab guides the user to create a deployment in StreamSets Cloud, generate an install script, and register execution engines to the deployment from their lab environment.
2. The "Lab: First Pipeline to Test Deployment" lab has the user build a simple pipeline with a dev data generator origin and trash destination to test their new deployment.
3. The "Lab: Build a Pipeline" lab modifies the first pipeline to connect to real data from the Zomato dataset using a directory origin and adds a stream selector processor and local FS destination.
4. The "Lab: Run a
New Flash Builder 4 WSDL and HTTP Connectorsrtretola
This document provides instructions for setting up a Java SDK and Tomcat server on Windows and Mac OS X systems in order to run a Flash Builder project. It describes downloading and configuring a Java SDK by setting the JAVA_HOME environment variable. It then explains how to navigate to the Tomcat directory in the command line and start the server using specific commands for Windows and Mac. The document tests that the server is running properly by accessing certain URLs and describes how a crossdomain.xml policy file works to allow access to remote data services. It concludes by outlining the initial steps to create a new Flash Builder project and connect to REST data from an XML service using MXML and ActionScript.
Cast Iron Cloud Integration Best PracticesSarath Ambadas
This document provides best practices for developing and managing WebSphere Cast Iron integrations. It discusses naming conventions, error handling, orchestration development, appliance configuration, performance tuning, and upgrade processes. Development best practices include splitting large orchestrations, using configuration properties, and testing before deploying. Appliance best practices involve monitoring resources and purging logs. Performance can be improved by configuring connection pooling, batch processing, and tuning job concurrency. Upgrades involve backing up repositories and deploying existing projects to new versions.
- The document provides instructions for a hands-on lab to demonstrate big data concepts in Azure.
- It includes steps to create an Azure storage account and load sample data, set up an HDInsight Hadoop cluster, ingest data using Stream Analytics, and visualize data with Power BI.
- The labs will have participants create various Azure services, load and query data, and gain an understanding of how to use common pieces of a big data and analytics solution in Azure.
Openfire xmpp server on windows server 2012 r2 with spark ssolaonap166
1. The document provides step-by-step instructions for configuring single sign-on between an Openfire XMPP server and Spark client on Windows Server 2012 R2 using Kerberos authentication. It describes setting up Active Directory, installing and configuring Openfire and Spark, and modifying registry settings to enable Kerberos ticket sharing. The configuration involves creating service principals, a keytab file, GSSAPI and Kerberos configuration files, and enabling SASL in Openfire. Testing is done on virtual machines for a domain controller, Openfire server, and Spark client.
You are tasked with gaining privileged access to a Windows 2008 server through a capture-the-flag event. You first use Metasploit to exploit vulnerable MS SQL services to get an unprivileged shell. Then, you use the exploit suggester module to find exploits for privilege escalation, using the ms16_014_wmi_recv_notif exploit to achieve a privileged shell. Finally, you perform an action like deleting important files to cause an information security breach on the target system.
This document provides instructions for deploying a simple LAMP stack application using Cloud Application Manager. It defines the database and app tiers separately, connecting them with a binding. The database tier is an Amazon RDS MySQL instance. The app tier installs Apache, PHP and connects to the database using the binding. It takes under 30 minutes to complete the deployment.
This document provides guidance on using cloud tools to build data assets like Kafka topics, schemas, and Iceberg tables. It describes how to create each type of data asset step-by-step within the CDP Sandbox environment. The document also lists streaming data topics and examples that are available for use in applications. It suggests options for integrating external public data sources or data simulators. References are provided for additional documentation and code examples.
This document provides instructions for several StreamSets Academy labs:
1. The "Lab: Set Up a Deployment" lab guides the user to create a deployment in StreamSets Cloud, generate an install script, and register execution engines to the deployment from their lab environment.
2. The "Lab: First Pipeline to Test Deployment" lab has the user build a simple pipeline with a dev data generator origin and trash destination to test their new deployment.
3. The "Lab: Build a Pipeline" lab modifies the first pipeline to connect to real data from the Zomato dataset using a directory origin and adds a stream selector processor and local FS destination.
4. The "Lab: Run a
New Flash Builder 4 WSDL and HTTP Connectorsrtretola
This document provides instructions for setting up a Java SDK and Tomcat server on Windows and Mac OS X systems in order to run a Flash Builder project. It describes downloading and configuring a Java SDK by setting the JAVA_HOME environment variable. It then explains how to navigate to the Tomcat directory in the command line and start the server using specific commands for Windows and Mac. The document tests that the server is running properly by accessing certain URLs and describes how a crossdomain.xml policy file works to allow access to remote data services. It concludes by outlining the initial steps to create a new Flash Builder project and connect to REST data from an XML service using MXML and ActionScript.
Cast Iron Cloud Integration Best PracticesSarath Ambadas
This document provides best practices for developing and managing WebSphere Cast Iron integrations. It discusses naming conventions, error handling, orchestration development, appliance configuration, performance tuning, and upgrade processes. Development best practices include splitting large orchestrations, using configuration properties, and testing before deploying. Appliance best practices involve monitoring resources and purging logs. Performance can be improved by configuring connection pooling, batch processing, and tuning job concurrency. Upgrades involve backing up repositories and deploying existing projects to new versions.
- The document provides instructions for a hands-on lab to demonstrate big data concepts in Azure.
- It includes steps to create an Azure storage account and load sample data, set up an HDInsight Hadoop cluster, ingest data using Stream Analytics, and visualize data with Power BI.
- The labs will have participants create various Azure services, load and query data, and gain an understanding of how to use common pieces of a big data and analytics solution in Azure.
Openfire xmpp server on windows server 2012 r2 with spark ssolaonap166
1. The document provides step-by-step instructions for configuring single sign-on between an Openfire XMPP server and Spark client on Windows Server 2012 R2 using Kerberos authentication. It describes setting up Active Directory, installing and configuring Openfire and Spark, and modifying registry settings to enable Kerberos ticket sharing. The configuration involves creating service principals, a keytab file, GSSAPI and Kerberos configuration files, and enabling SASL in Openfire. Testing is done on virtual machines for a domain controller, Openfire server, and Spark client.
You are tasked with gaining privileged access to a Windows 2008 server through a capture-the-flag event. You first use Metasploit to exploit vulnerable MS SQL services to get an unprivileged shell. Then, you use the exploit suggester module to find exploits for privilege escalation, using the ms16_014_wmi_recv_notif exploit to achieve a privileged shell. Finally, you perform an action like deleting important files to cause an information security breach on the target system.
This document provides instructions for deploying a simple LAMP stack application using Cloud Application Manager. It defines the database and app tiers separately, connecting them with a binding. The database tier is an Amazon RDS MySQL instance. The app tier installs Apache, PHP and connects to the database using the binding. It takes under 30 minutes to complete the deployment.
How to Perform Test Automation With Gauge & Selenium Framework Sarah Elson
Gauge is a free open source test automation framework released by creators of Selenium, ThoughtWorks. Test automation with Gauge framework is used to create readable and maintainable tests with languages of your choice. Users who are looking for integrating continuous testing pipeline into their CI-CD(Continuous Integration and Continuous Delivery) process for supporting faster release cycles. Gauge framework is gaining the popularity as a great test automation framework for performing cross browser testing.
Wamp & LAMP - Installation and ConfigurationChetan Soni
This document provides instructions for installing and configuring WAMP (Windows, Apache, MySQL, PHP) and LAMP (Linux, Apache, MySQL, PHP) servers on Windows and Linux respectively. For the WAMP installation, it describes downloading and installing Apache, PHP, MySQL, and configuring them to work together. It then tests the installation with sample PHP files. For the LAMP installation, it describes initial steps like installing gcc and logging in as root before explaining how to install Apache, PHP and MySQL from source code.
How to Transfer Magento Project from One Server to another ServerKaushal Mewar
This document provides step-by-step instructions for migrating a Magento project from one server to another server. It describes how to take backups of the database and files, upload them to the new server, extract the files, import the database, and configure settings like the database connection details and domain names. The process involves backing up the database using PHPMyAdmin or command line tools, compressing the files using the file manager or command line, importing the database and extracting files on the new server, editing configuration files, clearing caches, and updating domain name server settings.
Here are the key points covered in the essay:
- Exercise 15.1 involves creating a custom backup job in Windows 7 to back up selected files and folders to a hard disk partition.
- The C: system drive does not appear as a backup destination because you cannot back up a drive to itself.
- A warning appears when selecting the X: drive for backup because although it appears as a separate drive letter, it is physically located on the same hard disk as the system drive C:. Backing up to this location would not provide the benefits of an off-site backup if the hard disk failed.
- When selecting folders and files for backup, you must ensure the selected items are not part of an operating system
This series of tutorials walks through how to get started in the ConnectSMART creator, troubleshooting common issues, and how to download and install maintenance release upgrades.
The document discusses PowerCenter 9.x upgrade strategies presented by Softpath at the Atlanta User Group. It introduces the presenters and provides an overview of Softpath. Various upgrade approaches - such as zero downtime, parallel, cloned, and in-place upgrades - are presented along with their benefits, risks, and time requirements. The stages of an upgrade including planning, preparation work, installation, testing, and production implementation are also outlined.
Google Hacking Lab ClassNameDate This is an introducti.docxwhittemorelucilla
This document provides instructions for setting up a virtual lab environment using Kali Linux and Metasploitable VMs to demonstrate penetration testing techniques. It describes how to download, install, and configure the necessary virtual machines and tools. The document then guides the user through launching Metasploit and exploring its modules, conducting searches and gathering information about exploits and payloads. It also includes steps for using specific exploits against the Metasploitable VM to deliver a reverse shell payload and obtain a foothold on the target system.
Basic commands for powershell : Configuring Windows PowerShell and working wi...Hitesh Mohapatra
This document provides an overview of common PowerShell commands for automating tasks and managing configurations in Windows. It discusses commands for configuring the PowerShell console and ISE application, finding available commands, getting help, and viewing services, events, and processes. The document also covers using the history, setting execution policies, filtering output, and managing aliases, modules, drives and sessions. Specific commands demonstrated include Get-Command, Get-Help, Get-Service, Get-EventLog, Get-Process, Clear-History, Set-ExecutionPolicy, Select-Object, and more.
This document describes Lab 2 of a WebSphere Cast Iron lab session which focuses on connecting to a database. The lab will create a project with 3 orchestrations - one to insert a record into a database table via an HTTP request, another to query the database and return all records. It introduces concepts like database endpoints and configuration properties. Students will create variables to store parts of an HTTP request string, and use substring functions to parse the string and map values to variables for database insertion.
1) This document provides 30 tips for using XPages in 60 minutes.
2) The tips cover general programming, debugging, user interface design, using XPages in the Notes client, and working with Dojo.
3) Example tips include using scoped variables to store data, calling Java classes from XPages, turning on debugging to view error messages, using themes for global configuration, and enabling Dojo parseOnLoad to initialize widgets.
Pharo is a modern and powerful Smalltalk environment that is open source, supports many platforms, and actively adds new features. Version 5.0 includes performance improvements from the new Spur VM, as well as new debugging tools and a unified FFI. An example web application built with Teapot and PunQLite demonstrates how easily full-stack web applications can now be developed in Pharo.
This document provides instructions for setting up a simple registration example using Struts. It involves:
1. Modifying the struts-config.xml file to map the URL /actions/register1.do to the RegisterAction1 class.
2. Adding a <forward> element to struts-config.xml to specify that the result1.jsp page should be displayed when RegisterAction1 returns "success".
3. Creating the RegisterAction1 class to handle requests to /actions/register1.do. When executed, it will always return "success".
The end result is that accessing /actions/register1.do via a web browser will invoke the RegisterAction1 class and
This document discusses how to set up an Apache ActiveMQ master-slave topology using a shared broker data directory. It provides instructions for installing ActiveMQ on two machines, configuring the brokers, and starting them in a way that the slave will take over if the master fails. The master and slave brokers are configured to share a data directory so they can maintain the same state across the cluster.
Here are the steps to configure the target in Scribe Workbench:
1. Click Configure Target. The Configure Target dialog box appears.
2. Click the Connections drop-down and select Scribe Sample. This is the target database.
3. Expand Tables and select Accounts. This will be the target table.
4. Click OK. The target fields from the Accounts table will display in the target pane.
5. The target fields may be in a different order than the source fields. You can rearrange the order of the target fields by dragging and dropping them.
6. Notice that the Ref column shows the reference number for each target field, similar to the source fields. These reference numbers
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9Nuno Godinho
This document discusses using Windows Server AppFabric caching to scale data layers. AppFabric caching provides a distributed, in-memory cache that can span machines and processes. It addresses issues like limited cache memory on individual servers. The document outlines how AppFabric caching works, how to install and configure it, and how to access the cache through the API. It also describes features like data distribution, eviction policies, and change notifications that allow the cache to efficiently scale to large workloads and data sets.
The document provides best practices and recommendations for developing data flows with Cloudera DataFlow (CDF). It discusses topics such as flow development best practices, container-based data flow deployment options in CDF, and interactive development using test sessions. Common errors and resources for additional documentation are also listed.
1. The document describes the steps to create and manage database users to support security. It involves creating users "BOB" and "JACK", granting them privileges to access databases, and testing their access.
2. The procedure guides creating output log files to capture each MySQL session. It then walks through creating users, assigning passwords, granting privileges to access databases and tables, and testing access levels.
3. Conclusions depend on testing access as each user. The procedure demonstrates how to restrict access through user accounts and privileges while allowing necessary data access.
This document provides information about installing and configuring Linux, Apache web server, PostgreSQL database, and Apache Tomcat on a Linux system. It discusses installing Ubuntu using VirtualBox, creating users and groups, setting file permissions, important Linux files and directories. It also covers configuring Apache server and Tomcat, installing and configuring PostgreSQL, and some self-study questions about the Linux boot process, run levels, finding the kernel version and learning about NIS, NFS, and RPM package management.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
More Related Content
Similar to BestInFlowCompetitionTutorials03May2023
How to Perform Test Automation With Gauge & Selenium Framework Sarah Elson
Gauge is a free open source test automation framework released by creators of Selenium, ThoughtWorks. Test automation with Gauge framework is used to create readable and maintainable tests with languages of your choice. Users who are looking for integrating continuous testing pipeline into their CI-CD(Continuous Integration and Continuous Delivery) process for supporting faster release cycles. Gauge framework is gaining the popularity as a great test automation framework for performing cross browser testing.
Wamp & LAMP - Installation and ConfigurationChetan Soni
This document provides instructions for installing and configuring WAMP (Windows, Apache, MySQL, PHP) and LAMP (Linux, Apache, MySQL, PHP) servers on Windows and Linux respectively. For the WAMP installation, it describes downloading and installing Apache, PHP, MySQL, and configuring them to work together. It then tests the installation with sample PHP files. For the LAMP installation, it describes initial steps like installing gcc and logging in as root before explaining how to install Apache, PHP and MySQL from source code.
How to Transfer Magento Project from One Server to another ServerKaushal Mewar
This document provides step-by-step instructions for migrating a Magento project from one server to another server. It describes how to take backups of the database and files, upload them to the new server, extract the files, import the database, and configure settings like the database connection details and domain names. The process involves backing up the database using PHPMyAdmin or command line tools, compressing the files using the file manager or command line, importing the database and extracting files on the new server, editing configuration files, clearing caches, and updating domain name server settings.
Here are the key points covered in the essay:
- Exercise 15.1 involves creating a custom backup job in Windows 7 to back up selected files and folders to a hard disk partition.
- The C: system drive does not appear as a backup destination because you cannot back up a drive to itself.
- A warning appears when selecting the X: drive for backup because although it appears as a separate drive letter, it is physically located on the same hard disk as the system drive C:. Backing up to this location would not provide the benefits of an off-site backup if the hard disk failed.
- When selecting folders and files for backup, you must ensure the selected items are not part of an operating system
This series of tutorials walks through how to get started in the ConnectSMART creator, troubleshooting common issues, and how to download and install maintenance release upgrades.
The document discusses PowerCenter 9.x upgrade strategies presented by Softpath at the Atlanta User Group. It introduces the presenters and provides an overview of Softpath. Various upgrade approaches - such as zero downtime, parallel, cloned, and in-place upgrades - are presented along with their benefits, risks, and time requirements. The stages of an upgrade including planning, preparation work, installation, testing, and production implementation are also outlined.
Google Hacking Lab ClassNameDate This is an introducti.docxwhittemorelucilla
This document provides instructions for setting up a virtual lab environment using Kali Linux and Metasploitable VMs to demonstrate penetration testing techniques. It describes how to download, install, and configure the necessary virtual machines and tools. The document then guides the user through launching Metasploit and exploring its modules, conducting searches and gathering information about exploits and payloads. It also includes steps for using specific exploits against the Metasploitable VM to deliver a reverse shell payload and obtain a foothold on the target system.
Basic commands for powershell : Configuring Windows PowerShell and working wi...Hitesh Mohapatra
This document provides an overview of common PowerShell commands for automating tasks and managing configurations in Windows. It discusses commands for configuring the PowerShell console and ISE application, finding available commands, getting help, and viewing services, events, and processes. The document also covers using the history, setting execution policies, filtering output, and managing aliases, modules, drives and sessions. Specific commands demonstrated include Get-Command, Get-Help, Get-Service, Get-EventLog, Get-Process, Clear-History, Set-ExecutionPolicy, Select-Object, and more.
This document describes Lab 2 of a WebSphere Cast Iron lab session which focuses on connecting to a database. The lab will create a project with 3 orchestrations - one to insert a record into a database table via an HTTP request, another to query the database and return all records. It introduces concepts like database endpoints and configuration properties. Students will create variables to store parts of an HTTP request string, and use substring functions to parse the string and map values to variables for database insertion.
1) This document provides 30 tips for using XPages in 60 minutes.
2) The tips cover general programming, debugging, user interface design, using XPages in the Notes client, and working with Dojo.
3) Example tips include using scoped variables to store data, calling Java classes from XPages, turning on debugging to view error messages, using themes for global configuration, and enabling Dojo parseOnLoad to initialize widgets.
Pharo is a modern and powerful Smalltalk environment that is open source, supports many platforms, and actively adds new features. Version 5.0 includes performance improvements from the new Spur VM, as well as new debugging tools and a unified FFI. An example web application built with Teapot and PunQLite demonstrates how easily full-stack web applications can now be developed in Pharo.
This document provides instructions for setting up a simple registration example using Struts. It involves:
1. Modifying the struts-config.xml file to map the URL /actions/register1.do to the RegisterAction1 class.
2. Adding a <forward> element to struts-config.xml to specify that the result1.jsp page should be displayed when RegisterAction1 returns "success".
3. Creating the RegisterAction1 class to handle requests to /actions/register1.do. When executed, it will always return "success".
The end result is that accessing /actions/register1.do via a web browser will invoke the RegisterAction1 class and
This document discusses how to set up an Apache ActiveMQ master-slave topology using a shared broker data directory. It provides instructions for installing ActiveMQ on two machines, configuring the brokers, and starting them in a way that the slave will take over if the master fails. The master and slave brokers are configured to share a data directory so they can maintain the same state across the cluster.
Here are the steps to configure the target in Scribe Workbench:
1. Click Configure Target. The Configure Target dialog box appears.
2. Click the Connections drop-down and select Scribe Sample. This is the target database.
3. Expand Tables and select Accounts. This will be the target table.
4. Click OK. The target fields from the Accounts table will display in the target pane.
5. The target fields may be in a different order than the source fields. You can rearrange the order of the target fields by dragging and dropping them.
6. Notice that the Ref column shows the reference number for each target field, similar to the source fields. These reference numbers
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9Nuno Godinho
This document discusses using Windows Server AppFabric caching to scale data layers. AppFabric caching provides a distributed, in-memory cache that can span machines and processes. It addresses issues like limited cache memory on individual servers. The document outlines how AppFabric caching works, how to install and configure it, and how to access the cache through the API. It also describes features like data distribution, eviction policies, and change notifications that allow the cache to efficiently scale to large workloads and data sets.
The document provides best practices and recommendations for developing data flows with Cloudera DataFlow (CDF). It discusses topics such as flow development best practices, container-based data flow deployment options in CDF, and interactive development using test sessions. Common errors and resources for additional documentation are also listed.
1. The document describes the steps to create and manage database users to support security. It involves creating users "BOB" and "JACK", granting them privileges to access databases, and testing their access.
2. The procedure guides creating output log files to capture each MySQL session. It then walks through creating users, assigning passwords, granting privileges to access databases and tables, and testing access levels.
3. Conclusions depend on testing access as each user. The procedure demonstrates how to restrict access through user accounts and privileges while allowing necessary data access.
This document provides information about installing and configuring Linux, Apache web server, PostgreSQL database, and Apache Tomcat on a Linux system. It discusses installing Ubuntu using VirtualBox, creating users and groups, setting file permissions, important Linux files and directories. It also covers configuring Apache server and Tomcat, installing and configuring PostgreSQL, and some self-study questions about the Linux boot process, run levels, finding the kernel version and learning about NIS, NFS, and RPM package management.
Similar to BestInFlowCompetitionTutorials03May2023 (20)
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
Building Real-Time Pipelines With FLaNK
Timothy Spann, Principal Developer Advocate, Streaming - Cloudera Future of Data meetup, startup grind, AI Camp
The combination of Apache Flink, Apache NiFi, and Apache Kafka for building real-time data processing pipelines is extremely powerful, as demonstrated by this case study using the FLaNK-MTA project. The project leverages these technologies to process and analyze real-time data from the New York City Metropolitan Transportation Authority (MTA). FLaNK-MTA demonstrates how to efficiently collect, transform, and analyze high-volume data streams, enabling timely insights and decision-making.
Apache NiFi
Apache Kafka
Apache Flink
Apache Iceberg
LLM
Generative AI
Slack
Postgresql
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
Gen AI on Enterprise Cloud
Apache NiFi
Milvus
Apache Kafka
Apache Flink
Cloudera Machine Learning
Cloudera DataFlow
https://medium.com/@tspann/building-a-milvus-connector-for-nifi-34372cb3c7fa
https://www.meetup.com/futureofdata-princeton/events/300737266/
https://lu.ma/q7pcfyjn?source=post_page-----34372cb3c7fa--------------------------------&tk=TTyakY
If you're interested in working with Generative AI on the cloud, this virtual workshop is for you.
Tim Spann from Cloudera and Yujian Tang from Zilliz will cover how you can implement your own GenAI workflows on the cloud at enterprise scale.
9:00 - 9:05: Intro
9:05 - 9:15: What is Milvus
9:15 - 9:25: Cloudera Development Platform
9:25 - 10:00: Demo
Location
https://www.youtube.com/watch?v=IfWIzKsoHnA
https://github.com/tspannhw/SpeakerProfile
https://www.linkedin.com/in/yujiantang/
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
https://www.youtube.com/watch?v=Yeua8NlzQ3Y
https://www.conf42.com/Large_Language_Models_LLMs_2024_Tim_Spann_generative_ai_streaming
Adding Generative AI to Real-Time Streaming Pipelines
Abstract
Let’s build streaming pipelines that convert streaming events into prompts, call LLMs, and process the results.
Summary
Tim Spann: My talk is adding generative AI to real time streaming pipelines. I'm going to discuss a couple of different open source technologies. We'll touch on Kafka, Nifi, Flink, Python, Iceberg. All the slides, all the code and GitHub are out there.
Llm, if you didn't know, is rapidly evolving. There's a lot of different ways to interact with models. That enrichment, transformation, processing really needs tools. The amount of models and projects and software that are available is massive.
Nifi supports hundreds of different inputs and can convert them on the fly. Great way to distribute your data quickly to whoever needs it without duplication, without tight coupling. Fun to find new things to integrate into.
So what we can do is, well, I want to get a meetup chat going. I have a processor here that just listens for events as they come from slack. And then I'm going to clean it up, add a couple fields and push that out to slack. Every model is a little bit of different tweaking.
Nifi acts as a whole website. And as you see here, it can be get, post, put, whatever you want. We send that response back to flink and it shows up here. Thank you for attending this talk. I'm going to be speaking at some other events very shortly.
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, Tim Spann here. My talk is adding generative AI to real time streaming pipelines, and we're here for the large language model conference at Comp 42, which is always a nice one, great place to be. I'm going to discuss a couple of different open source technologies that work together to enable you to build real time pipelines using large language models. So we'll touch on Kafka, Nifi, Flink, Python, Iceberg, and I'll show you a little bit of each one in the demos. I've been working with data machine learning, streaming IoT, some other things for a number of years, and you could contact me at any of these places, whether Twitter or whatever it's called, some different blogs, or in person at my meetups and at different conferences around the world. I do a weekly newsletter, cover streaming ML, a lot of LLM, open source, Python, Java, all kinds of fun stuff, as I mentioned, do a bunch of different meetups. They are not just in the east coast of the US, they are available virtually live, and I also put them on YouTube, and if you need them somewhere else, let me know. We publish all the slides, all the code and GitHub. Everything you need is out there. Let's get into the talk. Llm, if you didn't know, is rapidly evolving. While you're typing down the things that you use, it
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...Timothy Spann
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
https://xtremej.dev/2023/schedule/
Building Real-time Pipelines with FLaNK: A Case Study with Transit Data
Overview of the problem, the application (code walkthru and running), overview of FLaNK, introduction to NiFi, introduction to Kafka, and introduction to Flink.
28March2024-Codeless-Generative-AI-Pipelines
https://www.meetup.com/futureofdata-princeton/events/299440871/
https://www.meetup.com/real-time-analytics-meetup-ny/events/299290822/
******Note*****
The event is seat-limited, therefore please complete your registration here. Only people completing the form will be able to attend.
-----------------------
We're excited to invite you to join us in-person, for a Real-Time Analytics exploration!
Join us for an evening of insights, networking as we delve into the OSS technologies shaping the field!
Agenda:
05:30-06:00: Pizza and friends
06:00- 06:40: Codeless GenAI Pipelines with Flink, Kafka, NiFi
06:40- 07:20 Real-Time Analytics in the Corporate World: How Apache Pinot® Powers Industry Leaders
07:20-07:30 QNA
Codeless GenAI Pipelines with Flink, Kafka, NiFi | Tim Spann, Cloudera
Explore the power of real-time streaming with GenAI using Apache NiFi. Learn how NiFi simplifies data engineering workflows, allowing you to focus on creativity over technical complexities. I'll guide you through practical examples, showcasing NiFi's automation impact from ingestion to delivery. Whether you're a seasoned data engineer or new to GenAI, this talk offers valuable insights into optimizing workflows. Join us to unlock the potential of real-time streaming and witness how NiFi makes data engineering a breeze for GenAI applications!
Real-Time Analytics in the Corporate World: How Apache Pinot® Powers Industry Leaders | Viktor Gamov, StarTree
Explore how industry leaders like LinkedIn, Uber Eats, and Stripe are mastering real-time data with Viktor as your guide. Discover how Apache Pinot transforms data into actionable insights instantly. Viktor will showcase Pinot's features, including the Star-Tree Index, and explain why it's a game-changer in data strategy. This session is for everyone, from data geeks to business gurus, eager to uncover the future of tech. Join us and be wowed by the power of real-time analytics with Apache Pinot!
-------
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera.
He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more.
TCFPro24 Building Real-Time Generative AI PipelinesTimothy Spann
https://princetonacm.acm.org/tcfpro/
18th Annual IEEE IT Professional Conference (ITPC)
Armstrong Hall at The College of New Jersey
Friday, March 15th, 2024 | 10:00 AM to 5:00 PM
IT Professional Conference at Trenton Computer Festival
IEEE Information Technology Professional Conference on Friday, March 15th, 2024
TCFPro24 Building Real-Time Generative AI Pipelines
Building Real-Time Generative AI Pipelines
In this talk, Tim will delve into the exciting realm of building real-time generative AI pipelines with streaming capabilities. The discussion will revolve around the integration of cutting-edge technologies to create dynamic and responsive systems that harness the power of generative algorithms.
From leveraging streaming data sources to implementing advanced machine learning models, the presentation will explore the key components necessary for constructing a robust real-time generative AI pipeline. Practical insights, use cases, and best practices will be shared, offering a comprehensive guide for developers and data scientists aspiring to design and implement dynamic AI systems in a streaming environment.
Tim will show a live demo showing we can use Apache NiFi to provide a live chat between a person in Slack and several LLM models all orchestrated with Apache NiFi, Apache Kafka and Python. We will use RAG against Chroma and Pinecone vector data stores, Hugging Face and WatsonX.AI LLM, and add additional context with NiFi lookups of stocks, weather and other data streams in real-time.
Timothy Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark.
Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines
https://www.meetup.com/futureofdata-newyork/events/298660453/
Unlocking Financial Data with Real-Time Pipelines
(Flink Analytics on Stocks with SQL )
By Timothy Spann
Financial institutions thrive on accurate and timely data to drive critical decision-making processes, risk assessments, and regulatory compliance. However, managing and processing vast amounts of financial data in real-time can be a daunting task. To overcome this challenge, modern data engineering solutions have emerged, combining powerful technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient and reliable real-time data pipelines. In this talk, we will explore how this technology stack can unlock the full potential of financial data, enabling organizations to make data-driven decisions swiftly and with confidence.
Introduction: Financial institutions operate in a fast-paced environment where real-time access to accurate and reliable data is crucial. Traditional batch processing falls short when it comes to handling rapidly changing financial markets and responding to customer demands promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of financial data. I will be utilizing NiFi 2.0 with Python and Vector Databases.
Timothy Spann
Principal Developer Advocate, Cloudera
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
https://twitter.com/PaaSDev
https://www.linkedin.com/in/timothyspann/
https://medium.com/@tspann
https://github.com/tspannhw/FLiPStackWeekly/
Conf42-Python-Building Apache NiFi 2.0 Python Processors
https://www.conf42.com/Python_2024_Tim_Spann_apache_nifi_2_processors
Building Apache NiFi 2.0 Python Processors
Abstract
Let’s enhance real-time streaming pipelines with smart Python code. Adding code for vector databases and LLM.
Summary
Tim Spann: I'm going to be talking today, be building Apache 9520 Python processors. One of the main purposes of supporting Python in the streaming tool Apache Nifi is to interface with new machine learning and AI and Gen AI. He says Python is a real game changer for Cloudera.
You're just going to add some metadata around it. It's a great way to pass a file along without changing it too substantially. We really need you to have Python 310 and again JDK 21 on your machine. You got to be smart about how you use these models.
There are a ton of python processors available. You can use them in multiple ways. We're still in the early world of Python processors, so now's the time to start putting yours out there. Love to see a lot of people write their own.
When we are parsing documents here, again, this is the Python one I'm picking PDF. Lots of different things you could do. If you're interested on writing your own python code for Apache Nifi, definitely reach out and thank.
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg with Stock Data and LLM
Abstract
In this talk, we’ll discuss how to use Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg to process and analyze stock data. We demonstrated the ingestion, processing, and analysis of stock data. Additionally, we illustrated how to use an LLM to generate predictions from the analyzed data.
Karin Wolok
Developer Relations, Dev Marketing, and Community Programming @ Project Elevate
Karin Wolok's LinkedIn account Karin Wolok's twitter account
Tim Spann
Principal Developer Advocate @ Cloudera
Tim Spann's LinkedIn account Tim Spann's twitter account
https://www.conf42.com/Python_2024_Karin_Wolok_Tim_Spann_nifi__kafka_risingwave_iceberg_llm
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
https://www.aicamp.ai/event/eventdetails/W2024022214
apache nifi
llm
generative ai
gen ai
ml
dl
machine learning
apache kafka
apache flink
postgresql
python
AI Meetup (NYC): GenAI, LLMs, ML and Data
Feb 22, 05:30 PM EST
Welcome to the monthly in-person AI meetup in New York City, in collaboration with Microsoft. Join us for deep dive tech talks on AI, GenAI, LLMs and machine learning, food/drink, networking with speakers and fellow developers
Agenda:
* 5:30pm~6:00pm: Checkin, Food/drink and networking
* 6:00pm~6:10pm: Welcome/community update
* 6:10pm~8:30pm: Tech talks
* 8:30pm: Q&A, Open discussion
Tech Talk: Searching and Reasoning Over Multimedia Data with Vector Databases and LMMs
Speaker: Zain Hasan (Weaviate LinkedIn)
Abstract: In this talk, Zain Hasan will discuss how we can use open-source multimodal embedding models in conjunction with large generative multimodal models that can that can see, hear, read, and feel data(!), to perform cross-modal search(searching audio with images, videos with text etc.) and multimodal retrieval augmented generation (MM-RAG) at the billion-object scale with the help of open source vector databases. I will also demonstrate, with live code demos, how being able to perform this cross-modal retrieval in real-time can enables users to use LLMs that can reason over their enterprise multimodal data. This talk will revolve around how we can scale the usage of multimodal embedding and generative models in production.
Tech Talk: Codeless Generative AI Pipelines
Speaker: Timothy Spann (Cloudera LinkedIn)
Abstract: Join us for an insightful talk on leveraging the power of real-time streaming tools, specifically Apache NiFi, to revolutionize GenAI data engineering. In this session, we’ll explore how the integration of Apache NiFi can automate the entire process of prompt building, making it a seamless and efficient task.
Speakers/Topics:
Stay tuned as we are updating speakers and schedules. If you have a keen interest in speaking to our community, we invite you to submit topics for consideration: Submit Topics
Sponsors:
We are actively seeking sponsors to support our community. Whether it is by offering venue spaces, providing food/drink, or cash sponsorship. Sponsors will have the chance to speak at the meetups, receive prominent recognition, and gain exposure to our extensive membership base of 20,000+ local or 300K+ developers worldwide.
Venue:
Microsoft NYC - Times Square, 11 Times Square, New York, NY 10036
Room Name: Central Park West 6501
Community on Slack/Discord
- Event chat: chat and connect with speakers and attendees
- Sharing blogs, events, job openings, projects collaborations
Join Slack (search and join the #newyork channel) | Join Discord
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
20-Feb-2024
In this talk, I will walk through how someone can set up and run continuous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas, and publishing data.
We will then cover consuming Kafka data, joining Kafka topics, and inserting new events into Kafka topics as they arrive. This basic overview will show hands-on techniques, tips, and examples of how to do this.
Tim Spann
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesTimothy Spann
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Unlocking Financial Data with Real-Time Pipelines
Financial institutions thrive on accurate and timely data to drive critical decision-making processes, risk assessments, and regulatory compliance. However, managing and processing vast amounts of financial data in real-time can be a daunting task. To overcome this challenge, modern data engineering solutions have emerged, combining powerful technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient and reliable real-time data pipelines. In this talk, we will explore how this technology stack can unlock the full potential of financial data, enabling organizations to make data-driven decisions swiftly and with confidence.
Introduction: Financial institutions operate in a fast-paced environment where real-time access to accurate and reliable data is crucial. Traditional batch processing falls short when it comes to handling rapidly changing financial markets and responding to customer demands promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of financial data.
Key Points to be Covered:
Introduction to Real-Time Data Pipelines: a. The limitations of traditional batch processing in the financial domain. b. Understanding the need for real-time data processing.
Apache Flink: Powering Real-Time Stream Processing: a. Overview of Apache Flink and its role in real-time stream processing. b. Use cases for Apache Flink in the financial industry. c. How Flink enables fast, scalable, and fault-tolerant processing of streaming financial data.
Apache Kafka: Building Resilient Event Streaming Platforms: a. Introduction to Apache Kafka and its role as a distributed streaming platform. b. Kafka's capabilities in handling high-throughput, fault-tolerant, and real-time data streaming. c. Integration of Kafka with financial data sources and consumers.
Apache NiFi: Data Ingestion and Flow Management: a. Overview of Apache NiFi and its role in data ingestion and flow management. b. Data integration and transformation capabilities of NiFi for financial data. c. Utilizing NiFi to collect and process financial data from diverse sources.
Iceberg: Efficient Data Lake Management: a. Understanding Iceberg and its role in managing large-scale data lakes. b. Iceberg's schema evolution and table-level metadata capabilities. c. How Iceberg simplifies data lake management in financial institutions.
Real-World Use Cases: a. Real-time fraud detection using Flink, Kafka, and NiFi. b. Portfolio risk analysis with Iceberg and Flink. c. Streamlined regulatory reporting leveraging all four technologies.
Best Practices and Considerations: a. Architectural considerations when building real-time financial data pipelines. b. Ensuring data integrity, security, and compliance in real-time pipelines. c. Scalability an
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
SMS API Integration in Saudi Arabia| Best SMS API ServiceYara Milbes
Discover the benefits and implementation of SMS API integration in the UAE and Middle East. This comprehensive guide covers the importance of SMS messaging APIs, the advantages of bulk SMS APIs, and real-world case studies. Learn how CEQUENS, a leader in communication solutions, can help your business enhance customer engagement and streamline operations with innovative CPaaS, reliable SMS APIs, and omnichannel solutions, including WhatsApp Business. Perfect for businesses seeking to optimize their communication strategies in the digital age.
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Transform Your Communication with Cloud-Based IVR SolutionsTheSMSPoint
Discover the power of Cloud-Based IVR Solutions to streamline communication processes. Embrace scalability and cost-efficiency while enhancing customer experiences with features like automated call routing and voice recognition. Accessible from anywhere, these solutions integrate seamlessly with existing systems, providing real-time analytics for continuous improvement. Revolutionize your communication strategy today with Cloud-Based IVR Solutions. Learn more at: https://thesmspoint.com/channel/cloud-telephony
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
What is Augmented Reality Image Trackingpavan998932
Augmented Reality (AR) Image Tracking is a technology that enables AR applications to recognize and track images in the real world, overlaying digital content onto them. This enhances the user's interaction with their environment by providing additional information and interactive elements directly tied to physical images.
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
WhatsApp offers simple, reliable, and private messaging and calling services for free worldwide. With end-to-end encryption, your personal messages and calls are secure, ensuring only you and the recipient can access them. Enjoy voice and video calls to stay connected with loved ones or colleagues. Express yourself using stickers, GIFs, or by sharing moments on Status. WhatsApp Business enables global customer outreach, facilitating sales growth and relationship building through showcasing products and services. Stay connected effortlessly with group chats for planning outings with friends or staying updated on family conversations.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
1. Best iBest in Flow Competition Tutorials
Author: Michael Kohs George Vetticaden Timothy Spann
Date: 04/18/2023
Last Updated: 5/3/2023
Useful Data Assets
Setting Your Workload Password
Creating a Kafka Topic
Use Case walkthrough 9
1. Reading and filtering a stream of syslog data 9
2. Writing critical syslog events to Apache Iceberg for analysis 29
3. Resize image flow deployed as serverless function 56
2. Use Case Walkthrough for Competition
Notice
This document assumes that you have registered for an account, activated it and logged into
the CDP Sandbox. This is for authorized users only who have attended the webinar and have
read the training materials.
A short guide and references are listed here.
Competition Resources
Login to the Cluster
https://login.cdpworkshops.cloudera.com/auth/realms/se-workshop-5/protocol/saml/clients/cdp-sso
Kafka Broker connection string
● oss-kafka-demo-corebroker2.oss-demo.qsm5-opic.cloudera.site:9093,
● oss-kafka-demo-corebroker1.oss-demo.qsm5-opic.cloudera.site:9093,
● oss-kafka-demo-corebroker0.oss-demo.qsm5-opic.cloudera.site:9093
Kafka Topics
● syslog_json
● syslog_avro
● syslog_critical
Schema Registry Hostname
● oss-kafka-demo-master0.oss-demo.qsm5-opic.cloudera.site
Schema Name
● syslog
● syslog_avro
● syslog_transformed
3. Syslog Filter Rule
● SELECT * FROM FLOWFILE WHERE severity <= 2
Access Key and Private Key for Machine User in DataFlow Function
● Access Key: eda9f909-d1c2-4934-bad7-95ec6e326de8
● Private Key: eon6eFzLlxZI/gpU0dWtht21DI60MkSQZjIzeWSGBSI=
The following keys are needed if you want to deploy a DataFlow Function that you build during
the Best in Flow Competition.
Your Workflow User Name and Password
1. Click on your name at the bottom left corner of the screen for a menu to pop up.
2. Click on Profile to be redirected to your user’s profile page with important information.
4. If your Workload Password does not say currently set or you forgot it, follow the steps below to
reset it. Your userid is shown above at Workload User Name.
Setting Workload Password
You will need to define your workload password that will be used to access non-SSO
interfaces. You may read more about it here. Please keep it with you. If you have
forgotten it, you will be able to repeat this process and define another one.
1. From the Home Page, click on your User Name (Ex: tim) at the lower left corner.
2. Click on the Profile option.
5. 1. Click option Set Workload Password.
2. Enter a suitable Password and Confirm Password.
3. Click the button Set Workload Password.
6. Check that you got the message - Workload password is currently set or alternatively,
look for a message next to Workload Password which says (Workload password is
currently set). Save the password you configured as well as the workload user name for
use later.
7. Create a Kafka Topic
The tutorials require you to create an Apache Kafka topic to send your data to, this is how you
can create that topic. You will also need this information to create topics for any of your own
custom applications for the competition.
1. Navigate to Data Hub Clusters from the Home Page
Info: You can always navigate back to the home page by clicking the app switcher icon
at the top left of your screen.
2. Navigate to the oss-kafka-demo cluster
8. 3. Navigate to Streams Messaging Manager
Info: Streams Messaging Manager (SMM) is a tool for working with Apache Kafka.
4. Now that you are in SMM.
9. 5. Navigate to the round icon third from the top, click this Topic button.
6. You are now in the Topic browser.
7. Click Add New to build a new topic.
8. Enter the name of your topic prefixed with your Workload User Name, ex:
<<replace_with_userid>>_syslog_critical.
10. 9. For settings you should create it with (3 partitions, cleanup.policy: delete, availability
maximum) as shown above.
After successfully creating a topic, close the tab that opened when navigating to Streams
Messaging Manager
Congratulations! You have built a new topic.
11. 10. After successfully creating a topic, close the tab that opened when navigating to Streams
Messaging Manager
12. 1. Reading and filtering a stream of syslog data
You have been tasked with filtering a noisy stream of syslog events which are available in a
Kafka topic. The goal is to identify critical events and write them the Kafka topic you just
created.
Related documentation is here.
1.1 Open ReadyFlow & start Test Session
1. Navigate to DataFlow from the Home Page
13. 2. Navigate to the ReadyFlow Gallery
3. Explore the ReadyFlow Gallery
Info:
The ReadyFlow Gallery is where you can find out-of-box templates for common data movement
use cases. You can directly create deployments from a ReadyFlow or create new drafts and
modify the processing logic according to your needs before deploying.
4. Select the “Kafka filter to Kafka” ReadyFlow.
5. Get your user id from your profile, it is usually the first part of your email, so my email is
tim@sparkdeveloper.com so my user id is tim. It is your “Workload User Name” that
you are going to need for several things, remember that.
6. You already created a new topic to receive data in the setup section.
<<replace_with_userid>>_syslog_critical Ex: tim_syslog_critical.
7. Click on “Create New Draft” to open the ReadyFlow in the Designer
with the name youruserid_kafkafilterkafka, for example tim_kafkafilterkafka. If your
name has periods, underscores or other non-alphanumeric characters just leave those
out. Select from the available workspaces in the dropdown, you should only have one
available.
14. 8. Start a Test Session by either clicking on the start a test session link in the banner or
going to Flow Options and selecting Start in the Test Session section.
9. In the Test Session creation wizard, select the latest NiFi version and click Start Test
Session. Leave the other options to its default values. Notice how the status at the top
now says “Initializing Test Session”.
Info:
Test Sessions provision infrastructure on the fly and allow you to start and stop individual
processors and send data through your flow By running data through processors step by step
and using the data viewer as needed, you’re able to validate your processing logic during
development in an iterative way without having to treat your entire data flow as one deployable
unit.
1.2 Modifying the flow to read syslog data
The flow consists of three processors and looks very promising for our use case. The first
processor reads data from a Kafka topic, the second processor allows us to filter the events
before the third processor writes the filtered events to another Kafka topic.
All we have to do now to reach our goal is to customize its configuration to our use case.
1. Provide values for predefined parameters
a. Navigate to Flow Options→ Parameters
b. For some settings there are some that are set already as parameters, for others
they are not, you can set them manually. Make sure you create a parameter for
the Group Id.
c. Configure the following parameters:
15. Name Description Value
CDP Workload User CDP Workload User <Your own workload user ID
that you saved when you
configured your workload
password>
CDP Workload User Password CDP Workload User
Password
<Your own workload user
password you configured in
the earlier step>
Filter Rule Filter Rule SELECT * FROM FLOWFILE
WHERE severity <= 2
Data Input Format AVRO
Data Output Format JSON
Kafka Consumer Group ID ConsumeFromKafka <<replace_with_userid>>_cdf
Ex: tim_cdf
Group ID ConsumeFromKafka <<replace_with_userid>>_cdf
Ex: tim_cdf
Kafka Broker Endpoint Comma-separated list
of Kafka Broker
addresses
oss-kafka-demo-corebroker2
.oss-demo.qsm5-opic.cloud
era.site:9093,
oss-kafka-demo-corebroker1
.oss-demo.qsm5-opic.cloud
era.site:9093,
oss-kafka-demo-corebroker0
.oss-demo.qsm5-opic.cloud
era.site:9093
Kafka Destination Topic Must be unique <<replace_with_userid>>_sy
slog_critical
Ex: tim_syslog_critical
Kafka Producer ID Must be unique <<replace_with_userid>>_cdf
_producer1
16. Ex: tim_cdf_producer1
Kafka Source Topic syslog_avro
Schema Name syslog
Schema Registry Hostname Hostname from Kafka
cluster
oss-kafka-demo-master0.os
s-demo.qsm5-opic.cloudera.
site
d. Click Apply Changes to save the parameter values
e. If confirmation is requested, Click Ok.
17. 2. Start Controller Services
a. Navigate to Flow Options → Services
b. Select CDP_Schema_Registry service and click Enable Service and Referencing
Components action. If this is not enabled, it may be an error or an extra space
in any of the parameters for example AVRO must not have a new line or blank
spaces. The first thing to try if you have an issue is to stop the Design
18. environment and then restart the test session. Check the Tips guide for more
help or contact us in the bestinflow.slack.com.
c. Start from the top of the list and enable all remaining Controller services
d. Make sure all services have been enabled. You may need to reload the page or
try it in a new tab.
3. If your processors have all started because you started your controller services, it is
best to stop them all by right clicking on each one and clicking ‘Stop’ and then start them
one at a time so you can follow the process easier. Start the ConsumeFromKafka
processor using the right click action menu or the Start button in the configuration
19. drawer.
After starting the processor, you should see events starting to queue up in the
success_ConsumeFromKafka-FilterEvents connection.
4. Verify data being consumed from Kafka
a. Right-click on the success_ConsumeFromKafka-FilterEvents connection and
select List Queue
Info:
The List Queue interface shows you all flow files that are being queued in this
connection. Click on a flow file to see its metadata in the form of attributes. In our
use case, the attributes tell us a lot about the Kafka source from which we are
consuming the data. Attributes change depending on the source you’re working
with and can also be used to store additional metadata that you generate in your
flow.
20. b. Select any flow file in the queue and click the book icon to open it in the Data
Viewer
Info: The Data Viewer displays the content of the selected flow file and shows
you the events that we have received from Kafka. It automatically detects the
data format - in this case JSON - and presents it in human readable format.
c. Scroll through the content and note how we are receiving syslog events with
varying severity.
21. 5. Define filter rule to filter out low severity events
a. Return to the Flow Designer by closing the Data Viewer tab and clicking Back To
Flow Designer in the List Queue screen.
b. Select the Filter Events processor on the canvas. We are using a QueryRecord
processor to filter out low severity events. The QueryRecord processor is very
flexible and can run several filtering or routing rules at once.
c. In the configuration drawer, scroll down until you see the filtered_events property.
We are going to use this property to filter out the events. Click on the menu at the
end of the row and select Go To Parameter.
d. If you wish to change this, you can change the Parameter value.
e. Click Apply Changes to update the parameter value. Return back to the Flow
Designer
f. Start the Filter Events processor using the right-click menu or the Start icon in the
configuration drawer.
22. 6. Verify that the filter rule works
a. After starting the Filter Events processor, flow files will start queueing up in the
filtered_events-FilterEvents-WriteToKafka connection
b. Right click the filtered_events-FilterEvents-WriteToKafka connection and select
List Queue.
c. Select a few random flow files and open them in the Data Viewer to verify that
only events with severity <=2 are present.
d. Navigate back to the Flow Designer canvas.
7. Write the filtered events to the Kafka alerts topic
Now all that is left is to start the WriteToKafka processor to write our filtered high severity
events to syslog_critical Kafka topic.
23. a. Select the WriteToKafka processor and explore its properties in the configuration
drawer.
b. Note how we are plugging in many of our parameters to configure this processor.
Values like Kafka Brokers, Topic Name, Username, Password and the Record
Writer have all been parameterized and use the values that we provided in the
very beginning.
c. Start the WriteToKafka processor using the right-click menu or the Start icon in
the configuration drawer.
Congratulations! You have successfully customized this ReadyFlow and achieved your goal of
sending critical alerts to a dedicated topic! Now that you are done with developing your flow, it is
time to deploy it in production!
1.3 Publishing your flow to the catalog
1. Stop the Test Session
a. Click the toggle next to Active Test Session to stop your Test Session
b. Click “End” in the dialog to confirm. The Test Session is now stopping and
allocated resources are being released
2. Publish your modified flow to the Catalog
a. Open the “Flow Options” menu at the top
b. Click “Publish” to make your modified flow available in the Catalog
c. Prefix your username to the Flow Name and provide a Flow Description. Click
Publish.
i.
24. d. You are now redirected to your published flow definition in the Catalog.
Info: The Catalog is the central repository for all your deployable flow definitions.
From here you can create auto-scaling deployments from any version or create
new drafts and update your flow processing logic to create new versions of your
flow.
1.4 Creating an auto-scaling flow deployment
1. As soon as you publish your flow, it should take you to the Catalog. If it does not
then locate your flow definition in the Catalog
a. Make sure you have navigated to the Catalog
b. If you have closed the sidebar, search for your published flow <<yourid>> into the
search bar in the Catalog. Click on the flow definition that matches the name you
25. gave it earlier.
c. After opening the side panel, click Deploy, select the available environment from
the drop down menu and click Continue to start the Deployment Wizard.
d.
If you have any issues, log out, close your browser, restart your browser, try an
incognito window and re-login. Also see the “Best Practices Guide”.
2. Complete the Deployment Wizard
The Deployment Wizard guides you through a six step process to create a flow
deployment. Throughout the six steps you will choose the NiFi configuration of your flow,
26. provide parameters and define KPIs. At the end of the process, you are able to generate
a CLI command to automate future deployments.
Note: The Deployment name has a cap of 27 characters which needs to be considered as
you write the prod name.
a. Provide a name such as <<your_username>>_kafkatokafka_prod to indicate the
use case and that you are deploying a production flow. Click Next.
b. The NiFi Configuration screen allows you to customize the runtime that will
execute your flow. You have the opportunity to pick from various released NiFi
versions.
Select the Latest Version and make sure Automatically start flow upon successful
deployment is checked.
Click Next.
c. The Parameters step is where you provide values for all the parameters that you
defined in your flow. In this example, you should recognize many of the prefilled
values from the previous exercise - including the Filter Rule and our Kafka
Source and Kafka Destination Topics.
To advance, you have to provide values for all parameters. Select the No Value
option to only display parameters without default values.
You should now only see one parameter - the CDP Workload User Password
parameter which is sensitive. Sensitive parameter values are removed when you
27. publish a flow to the catalog to make sure passwords don’t leak.
Provide your CDP Workload User Password and click Next to continue.
d. The Sizing & Scaling step lets you choose the resources that you want to
allocate for this deployment. You can choose from several node configurations
and turn on Auto-Scaling.
Let’s choose the Extra Small Node Size and turn on Auto-Scaling from 1-3
nodes. Click Next to advance.
e. The Key Performance Indicators (KPI) step allows you to monitor flow
performance. You can create KPIs for overall flow performance metrics or
28. in-depth processor or connection metrics.
Add the following KPI
● KPI Scope: Entire Flow
● Metric to Track: Data Out
● Alerts:
○ Trigger alert when metric is less than: 1 MB/sec
○ Alert will be triggered when metrics is outside the boundary(s) for:
1 Minute
Add the following KPI
● KPI Scope: Processor
● Processor Name: ConsumeFromKafka
● Metric to Track: Bytes Received
● Alerts:
○ Trigger alert when metric is less than: 512 KBytes/sec
○ Alert will be triggered when metrics is outside the boundary(s) for:
30 seconds
29. Review the KPIs and click Next.
f. In the Review page, review your deployment details.
Notice that in this page there's a >_ View CLI Command link. You will use the
information in the page in the next section to deploy a flow using the CLI. For now
you just need to save the script and dependencies provided there:
i. Click on the >_ View CLI Command link and familiarize yourself with the
content.
ii. Download the 2 JSON dependency files by click on the download button:
1. Flow Deployment Parameters JSON
30. 2. Flow Deployment KPIs JSON
iii. Copy the command at the end of this page and save that in a file called
deploy.sh
iv. Close the Equivalent CDP CLI Command tab.
g. Click Deploy to initiate the flow deployment!
h. You are redirected to the Deployment Dashboard where you can monitor the
progress of your deployment. Creating the deployment should only take a few
minutes.
i. Congratulations! Your flow deployment has been created and is already
processing Syslog events!
Please wait until your application is done Deploying, Importing Flow. Wait for Good Health.
31. 1.5 Monitoring your flow deployment
1. Notice how the dashboard shows you the data rates at which a deployment currently
receives and sends data. The data is also visualized in a graph that shows the two
metrics over time.
2. Change the Metrics Window setting at the top right. You can visualize as much as 1 Day.
3. Click on the yourid_kafkafilterkafka_prod deployment. The side panel opens and
shows more detail about the deployment. On the KPIs tab it will show information about
the KPIs that you created when deploying the flow.
Using the two KPIs Bytes Received and Data Out we can observe that our flow is
filtering out data as expected since it reads more than it sends out.
32. Wait a number of minutes so some data and metrics can be generated.
4. Switch to the System Metrics tab where you can observe the current CPU utilization rate
for the deployment. Our flow is not doing a lot of heavy transformation, so it should hover
around at ~10% CPU usage.
5. Close the side panel by clicking anywhere on the Dashboard.
6. Notice how your yourid_CriticalSyslogsProd deployment shows Concerning Health
status. Hover over the warning icon and click View Details.
33. 7. You will be redirected to the Alerts tab of the deployment. Here you get an overview of
active and past alerts and events. Expand the Active Alert to learn more about its cause.
After expanding the alert, it is clear that it is caused by a KPI threshold breach for
sending less than 1MB/s to external systems as defined earlier when you created the
deployment.
1.6 Managing your flow deployment
1. Click on the yourid_kafkafilterkafka_prod deployment in the Dashboard. In the side panel,
click Manage Deployment at the top right.
2. You are now being redirected to the Deployment Manager. The Deployment Manager
allows you to reconfigure the deployment and modify KPIs, modify the number of NiFi
nodes or turn auto-scaling on/off or update parameter values.
34. 3. Explore NiFi UI for deployment. Click the Actions menu and click on View in NiFi.
4. You are being redirected to the NiFi cluster running the flow deployment. You can use
this view for in-depth troubleshooting. Users can have read-only or read/write
permissions to the flow deployment.
35.
36.
37. 2. Writing critical syslog events to Apache Iceberg for analysis
A few weeks have passed since you built your data flow with DataFlow Designer to filter
out critical syslog events to a dedicated Kafka topic. Now that everyone has better
visibility into real-time health, management wants to do historical analysis on the data.
Your company is evaluating Apache Iceberg to build an open data lakehouse and you
are tasked with building a flow that ingests the most critical syslog events into an Iceberg
table.
Ensure your table is built and accessible.
Create an Apache Iceberg Table
1. From the Home page, click the Data Hub Clusters. Navigate to oss-kudu-demo from
the Data Hubs list
2. Navigate to Hue from the Kudu Data Hub.
38. 3. Inside of Hue you can now create your table. You will have your own database to work
with. To get to your database, click on the ‘<’ icon next to default database. You should
see your specific database in the format: <YourEmailWithUnderscores>_db. Click on
your database to go to the SQL Editor.
4. Create your Apache Iceberg table with the sql below and clicking the play icon to
execute the sql query. Note that the the table name must prefixed with your Work Load
User Name (userid).
CREATE TABLE <<userid>>_syslog_critical_archive
(priority int, severity int, facility int, version int, event_timestamp bigint, hostname string,
body string, appName string, procid string, messageid string,
structureddata struct<sdid:struct<eventid:string,eventsource:string,iut:string>>)
STORED BY ICEBERG;
39. 5. Once you have sent data to your table, you can query it.
Additional Documentation
● Create a Table
● Query a Table
● Apache Iceberg Table Properties
2.1 Open ReadyFlow & start Test Session
1. Navigate to DataFlow from the Home Page
2. Navigate to the ReadyFlow Gallery
3. Explore the ReadyFlow Gallery
40. 4. Search for the “Kafka to Iceberg” ReadyFlow.
5. Click on “Create New Draft” to open the ReadyFlow in the Designer named
yourid_kafkatoiceberg Ex: tim_kafkatoiceberg
6. Start a Test Session by either clicking on the start a test session link in the banner or
going to Flow Options and selecting Start in the Test Session section.
7. In the Test Session creation wizard, select the latest NiFi version and click Start Test
Session. Notice how the status at the top now says “Initializing Test Session”.
2.2 Modifying the flow to read syslog data
The flow consists of three processors and looks very promising for our use case. The first
processor reads data from a Kafka topic, the second processor gives us the option to batch up
events and create larger files which are then written out to Iceberg by the PutIceberg processor.
All we have to do now to reach our goal is to customize its configuration to our use case.
1. Provide values for predefined parameters
a. Navigate to Flow Options→ Parameters
b. Select all parameters that show No value set and provide the following values
Name Description Value
CDP Workload User CDP Workload User <Your own workload user
name>
CDP Workload User
Password
CDP Workload User
Password
<Your own workload user
password>
Data Input Format This flow supports
AVRO, JSON and CSV
JSON
41. Hive Catalog Namespace <YourEmailWithUnderScores
_db>
Iceberg Table Name <<replace_with_userid>>_sysl
og_critical_archive
Kafka Broker Endpoint Comma-separated list
of Kafka Broker
addresses
oss-kafka-demo-corebroker2.
oss-demo.qsm5-opic.clouder
a.site:9093,
oss-kafka-demo-corebroker1.
oss-demo.qsm5-opic.clouder
a.site:9093,
oss-kafka-demo-corebroker0.
oss-demo.qsm5-opic.clouder
a.site:9093
Kafka Consumer Group Id <<replace_with_userid>>_cdf
Ex: tim_cdf
Kafka Source Topic <<replace_with_userid>>_sysl
og_critical Ex:
tim_syslog_critical
Schema Name syslog
Schema Registry Hostname oss-kafka-demo-master0.oss
-demo.qsm5-opic.cloudera.si
te
c. Click Apply Changes to save the parameter values
2. Start Controller Services
a. Navigate to Flow Options → Services
42. b. Select CDP_Schema_Registry service and click Enable Service and Referencing
Components action
c. Start from the top of the list and enable all remaining Controller services including
KerberosPasswordUserService, HiveCatalogService, AvroReader, …
d. Click Ok if confirmation is asked.
43.
44. e. Make sure all services have been enabled
3. Start the ConsumeFromKafka processor using the right click action menu or the Start
button in the configuration drawer. It might already be started.
After starting the processor, you should see events starting to queue up in the
success_ConsumeFromKafka-FilterEvents connection.
NOTE:
To receive data from your topic, you will need either the first deployment still running or to run it
from another Flow Designer Test Session.
2.3 Changing the flow to modify the schema for Iceberg integration
45. Our data warehouse team has created an Iceberg table that they want us to ingest the critical
syslog data in. A challenge we are facing is that not all column names in the Iceberg table
match our syslog record schema. So we have to add functionality to our flow that allows us to
change the schema of our syslog records. To do this, we will be using the JoltTransformRecord
processor.
1. Add a new JoltTransformRecord to the canvas by dragging the processor icon to the
canvas.
2. In the Add Processor window, select the JoltTransformRecord type and name the
processor TransformSchema.
46. 3. Validate that your new processor now appears on the canvas.
4. Create connections from ConsumeFromKafka to TransformSchema by hovering over the
ConsumeFromKafka processor and dragging the arrow that appears to
TransformSchema. Pick the success relationship to connect.
Now connect the success relationship of TransformSchema to the MergeRecords
processor.
47. 5. Now that we have connected our new TransformSchema processor, we can delete the
original connection between ConsumeFromKafka and MergeRecords.
Make sure that the ConsumeFromKafka processor is stopped. Then select the
connection, empty the queue if needed, and then delete it. Now all syslog events that
we receive, will go through the TransformSchema processor.
6. To make sure that our schema transformation works, we have to create a new Record
Writer Service and use it as the Record Writer for the TransformSchema processor.
Select the TransformSchema processor and open the configuration panel. Scroll to the
48. Properties section, click the three dot menu in the Record Writer row and select Add
Service to create a new Record Writer.
7. Select AvroRecordSetWriter , name it TransformedSchemaWriter and click Add.
Click Apply in the configuration panel to save your changes.
8. Now click the three dot menu again and select Go To Service to configure our new Avro
Record Writer.
9. To configure our new Avro Record Writer, provide the following values:
Name Description Value
49. Schema Write
Strategy
Specify whether/how CDF should
write schema information
Embed Avro Schema
Schema Access
Strategy
Specify how CDF identifies the
schema to apply.
Use ‘Schema Name’
Property
Schema Registry Specify the Schema Registry that
stores our schema
CDP_Schema_Registry
Schema Name The schema name to look up in the
Schema Registry
syslog_transformed
10. Convert the value that you provided for Schema Name into a parameter. Click on the
three dot menu in the Schema Name row and select Convert To Parameter.
50. 11. Give the parameter the name Schema Name Transformed and click “add”. You have
now created a new parameter from a value that can be used in more places in your data
flow.
12. Apply your configuration changes and Enable the Service by clicking the power icon.
Now you have configured our new Schema Writer and we can return back to the Flow
Designer canvas.
If you have any issues, end the test session and restart. If your login timed out, close
your browser and re login.
13. Click Back to Flow Designer to navigate back to the canvas.
14. Select TransformSchema to configure it and provide the following values:
Name Description Value
51. Record Reader Service used to parse incoming
events
AvroReader
Record Writer Service used to format
outgoing events
TransformedSchemaWriter
Jolt Specification The specification that describes
how to modify the incoming
JSON data. We are
standardizing on lower case
field names and renaming the
timestamp field to
event_timestamp.
[
{
"operation": "shift",
"spec": {
"appName": "appname",
"timestamp": "event_timestamp",
"structuredData": {
"SDID": {
"eventId":
"structureddata.sdid.eventid",
"eventSource":
"structureddata.sdid.eventsource",
"iut": "structureddata.sdid.iut"
}
},
"*": {
"@": "&"
}
}
}
]
15. Scroll to Relationships and select Terminate for the failure, original relationships and
click Apply.
52. 16. Start your ConsumeFromKafka and TransformSchema processor and validate that the
transformed data matches our Iceberg table schema.
17. Once events are queuing up in the connection between TransformSchema and
MergeRecord, right click the connection and select List Queue.
18. Select any of the queued files and select the book icon to open it in the Data Viewer
19. Notice how all field names have been transformed to lower case and how the timestamp
field has been renamed to event_timestamp.
53. 2.4 Merging records and start writing to Iceberg
Now that we have verified that our schema is being transformed as needed, it’s time to start the
remaining processors and write our events into the Iceberg table. The MergeRecords processor is
configured to batch events up to increase efficiency when writing to Iceberg. The final processor,
WriteToIceberg takes our Avro records and writes them into a Parquet formatted table.
1. Tip: You can change the configuration to something like “30 sec” to speed up
processing.
2. Select the MergeRecords processor and explore its configuration. It is configured to
batch events up for at least 30 seconds or until the queued up events have reached
Maximum Bin Size of 1GB. You will want to lower these for testing.
3. Start the MergeRecords processor and verify that it batches up events and writes them
out after 30 seconds.
4. Select the WriteToIceberg processor and explore its configuration. Notice how it relies on
several parameters to establish a connection to the right database and table.
54. 5. Start the WriteToIceberg processor and verify that it writes records successfully to
Iceberg. If the metrics on the processor increase and you don’t see any warnings or
events being written to the failure_WriteToIceberg connection, your writes are
successful!
Congratulations! With this you have completed the second use case.
You may want to log into Hue to check your data has loaded.
55. Feel free to publish your flow to the catalog and create a deployment just like you did for
the first one.
56. 3. Resize image flow deployed as serverless function
DataFlow Functions provides a new, efficient way to run your event-driven Apache NiFi data
flows. You can have your flow executed within AWS Lambda, Azure Functions or Google Cloud
Functions and define the trigger that should start its execution.
DataFlow Functions is perfect for use cases such as:
- Processing files as soon as they land into the cloud provider object store
- Creating microservices over HTTPS
- CRON driven use cases
- etc
In this use case, we will be deploying a NiFi flow that will be triggered by HTTPS requests to
resize images. Once deployed, the cloud provider will provide an HTTPS endpoint that you’ll be
able to call to send an image, it will trigger the NiFi flow that will return a resized image based
on your parameters.
The deployment of the flow as a function will have to be done within your cloud provider.
The below tutorial will use AWS as the cloud provider. If you’re using Azure or Google Cloud,
you can still refer to this documentation to deploy the flow as a function.
3.1 Designing the flow for AWS Lambda
57. 1. Go into Cloudera DataFlow / Flow Design and create a new draft with a name of your
choice.
2. Drag and drop an Input Port named input onto the canvas. When triggered, AWS
Lambda is going to inject into that input port a FlowFile containing the information about
the HTTPS call that has been made.
Example of payload that will be injected by AWS Lambda as a FlowFile:
3. Drag and drop an EvaluateJsonPath processor, call it ExtractHTTPHeaders. We’re going
to use this to extract the HTTP headers that we want to keep in our flow. Add two
properties configured as below. It’ll save as FlowFile’s attributes the HTTP headers
(resize-height and resize-width) that we will be adding when making a call with our
image to specify the dimensions of the resized image.
58. resizeHeight => $.headers.resize-height
resizeWidth => $.headers.resize-width
Note: don’t forget to change Destination as “flowfile-attribute” and Click Apply.
4. Drag and drop another EvaluateJsonPath processor and then change it’s name to a
unique one. This one will be used to retrieve the content of the body field from the
payload we received and use it as the new content of the FlowFile. This field contains
the actual representation of the image we have been sending over HTTP with Base 64
encoding.
body => $.body
5. Drag and drop a Base64EncodeContent processor and change the mode to Decode.
This will Base64 decode the content of the FlowFile to retrieve its binary format.
59. 6. Drag and drop a ResizeImage processor. Use the previously created FlowFile attributes
to specify the new dimensions of the image. Also, specify true for maintaining the ratio.
7. Drag and drop a Base64EncodeContent processor. To send back the resized image to
the user, AWS Lambda expects us to send back a specific JSON payload with the Base
64 encoding of the image.
8. Drag and drop a ReplaceText processor. We use it to extract the Base 64 representation
of the resized image and add it in the expected JSON payload. Add the below JSON in
“Replacement Value” and change “Evaluation Mode” to “Entire text”.
{
"statusCode": 200,
"headers": { "Content-Type": "image/png" },
"isBase64Encoded": true,
"body": "$1"
}
9. Drag and drop an output port.
10. Connect all the components together, you can auto-terminate the unused relationships.
This should look like this:
60. You can now publish the flow into the DataFlow Catalog in the Flow Options menu:
Make sure to give it a name that is unique (you can prefix it with your name):
61. Once the flow is published, make sure to copy the CRN of the published version (it will end by
/v.1):
62. 3.2 Deploying the flow as a function in AWS Lambda
First thing first, go into DataFlow Functions and download the binary for running DataFlow
Functions in AWS Lambda:
63. This should download a binary with a name similar to:
naaf-aws-lambda-1.0.0.2.3.7.0-100-bin.zip
Once you have the binary, make sure, you also have:
● The CRN of the flow you published in the DataFlow Catalog
● The Access Key that has been provided with these instructions in “Competition
Resources” section
64. ● The Private Key that has been provided with these instructions in “Competition
Resources” section
In order to speed up the deployment, we’re going to leverage some scripts to automate the
deployment. It assumes that your AWS CLI is properly configured locally on your laptop and you
can use the jq command for reading JSON payloads. You can now follow the instructions from
this page here.
However, if you wish to deploy the flow in AWS Lambda manually through the AWS UI, you can
follow the steps described here.