SlideShare a Scribd company logo
Best iBest in Flow Competition Tutorials
Author: Michael Kohs George Vetticaden Timothy Spann
Date: 04/18/2023
Last Updated: 5/3/2023
Useful Data Assets
Setting Your Workload Password
Creating a Kafka Topic
Use Case walkthrough 9
1. Reading and filtering a stream of syslog data 9
2. Writing critical syslog events to Apache Iceberg for analysis 29
3. Resize image flow deployed as serverless function 56
Use Case Walkthrough for Competition
Notice
This document assumes that you have registered for an account, activated it and logged into
the CDP Sandbox. This is for authorized users only who have attended the webinar and have
read the training materials.
A short guide and references are listed here.
Competition Resources
Login to the Cluster
https://login.cdpworkshops.cloudera.com/auth/realms/se-workshop-5/protocol/saml/clients/cdp-sso
Kafka Broker connection string
● oss-kafka-demo-corebroker2.oss-demo.qsm5-opic.cloudera.site:9093,
● oss-kafka-demo-corebroker1.oss-demo.qsm5-opic.cloudera.site:9093,
● oss-kafka-demo-corebroker0.oss-demo.qsm5-opic.cloudera.site:9093
Kafka Topics
● syslog_json
● syslog_avro
● syslog_critical
Schema Registry Hostname
● oss-kafka-demo-master0.oss-demo.qsm5-opic.cloudera.site
Schema Name
● syslog
● syslog_avro
● syslog_transformed
Syslog Filter Rule
● SELECT * FROM FLOWFILE WHERE severity <= 2
Access Key and Private Key for Machine User in DataFlow Function
● Access Key: eda9f909-d1c2-4934-bad7-95ec6e326de8
● Private Key: eon6eFzLlxZI/gpU0dWtht21DI60MkSQZjIzeWSGBSI=
The following keys are needed if you want to deploy a DataFlow Function that you build during
the Best in Flow Competition.
Your Workflow User Name and Password
1. Click on your name at the bottom left corner of the screen for a menu to pop up.
2. Click on Profile to be redirected to your user’s profile page with important information.
If your Workload Password does not say currently set or you forgot it, follow the steps below to
reset it. Your userid is shown above at Workload User Name.
Setting Workload Password
You will need to define your workload password that will be used to access non-SSO
interfaces. You may read more about it here. Please keep it with you. If you have
forgotten it, you will be able to repeat this process and define another one.
1. From the Home Page, click on your User Name (Ex: tim) at the lower left corner.
2. Click on the Profile option.
1. Click option Set Workload Password.
2. Enter a suitable Password and Confirm Password.
3. Click the button Set Workload Password.
Check that you got the message - Workload password is currently set or alternatively,
look for a message next to Workload Password which says (Workload password is
currently set). Save the password you configured as well as the workload user name for
use later.
Create a Kafka Topic
The tutorials require you to create an Apache Kafka topic to send your data to, this is how you
can create that topic. You will also need this information to create topics for any of your own
custom applications for the competition.
1. Navigate to Data Hub Clusters from the Home Page
Info: You can always navigate back to the home page by clicking the app switcher icon
at the top left of your screen.
2. Navigate to the oss-kafka-demo cluster
3. Navigate to Streams Messaging Manager
Info: Streams Messaging Manager (SMM) is a tool for working with Apache Kafka.
4. Now that you are in SMM.
5. Navigate to the round icon third from the top, click this Topic button.
6. You are now in the Topic browser.
7. Click Add New to build a new topic.
8. Enter the name of your topic prefixed with your Workload User Name, ex:
<<replace_with_userid>>_syslog_critical.
9. For settings you should create it with (3 partitions, cleanup.policy: delete, availability
maximum) as shown above.
After successfully creating a topic, close the tab that opened when navigating to Streams
Messaging Manager
Congratulations! You have built a new topic.
10. After successfully creating a topic, close the tab that opened when navigating to Streams
Messaging Manager
1. Reading and filtering a stream of syslog data
You have been tasked with filtering a noisy stream of syslog events which are available in a
Kafka topic. The goal is to identify critical events and write them the Kafka topic you just
created.
Related documentation is here.
1.1 Open ReadyFlow & start Test Session
1. Navigate to DataFlow from the Home Page
2. Navigate to the ReadyFlow Gallery
3. Explore the ReadyFlow Gallery
Info:
The ReadyFlow Gallery is where you can find out-of-box templates for common data movement
use cases. You can directly create deployments from a ReadyFlow or create new drafts and
modify the processing logic according to your needs before deploying.
4. Select the “Kafka filter to Kafka” ReadyFlow.
5. Get your user id from your profile, it is usually the first part of your email, so my email is
tim@sparkdeveloper.com so my user id is tim. It is your “Workload User Name” that
you are going to need for several things, remember that.
6. You already created a new topic to receive data in the setup section.
<<replace_with_userid>>_syslog_critical Ex: tim_syslog_critical.
7. Click on “Create New Draft” to open the ReadyFlow in the Designer
with the name youruserid_kafkafilterkafka, for example tim_kafkafilterkafka. If your
name has periods, underscores or other non-alphanumeric characters just leave those
out. Select from the available workspaces in the dropdown, you should only have one
available.
8. Start a Test Session by either clicking on the start a test session link in the banner or
going to Flow Options and selecting Start in the Test Session section.
9. In the Test Session creation wizard, select the latest NiFi version and click Start Test
Session. Leave the other options to its default values. Notice how the status at the top
now says “Initializing Test Session”.
Info:
Test Sessions provision infrastructure on the fly and allow you to start and stop individual
processors and send data through your flow By running data through processors step by step
and using the data viewer as needed, you’re able to validate your processing logic during
development in an iterative way without having to treat your entire data flow as one deployable
unit.
1.2 Modifying the flow to read syslog data
The flow consists of three processors and looks very promising for our use case. The first
processor reads data from a Kafka topic, the second processor allows us to filter the events
before the third processor writes the filtered events to another Kafka topic.
All we have to do now to reach our goal is to customize its configuration to our use case.
1. Provide values for predefined parameters
a. Navigate to Flow Options→ Parameters
b. For some settings there are some that are set already as parameters, for others
they are not, you can set them manually. Make sure you create a parameter for
the Group Id.
c. Configure the following parameters:
Name Description Value
CDP Workload User CDP Workload User <Your own workload user ID
that you saved when you
configured your workload
password>
CDP Workload User Password CDP Workload User
Password
<Your own workload user
password you configured in
the earlier step>
Filter Rule Filter Rule SELECT * FROM FLOWFILE
WHERE severity <= 2
Data Input Format AVRO
Data Output Format JSON
Kafka Consumer Group ID ConsumeFromKafka <<replace_with_userid>>_cdf
Ex: tim_cdf
Group ID ConsumeFromKafka <<replace_with_userid>>_cdf
Ex: tim_cdf
Kafka Broker Endpoint Comma-separated list
of Kafka Broker
addresses
oss-kafka-demo-corebroker2
.oss-demo.qsm5-opic.cloud
era.site:9093,
oss-kafka-demo-corebroker1
.oss-demo.qsm5-opic.cloud
era.site:9093,
oss-kafka-demo-corebroker0
.oss-demo.qsm5-opic.cloud
era.site:9093
Kafka Destination Topic Must be unique <<replace_with_userid>>_sy
slog_critical
Ex: tim_syslog_critical
Kafka Producer ID Must be unique <<replace_with_userid>>_cdf
_producer1
Ex: tim_cdf_producer1
Kafka Source Topic syslog_avro
Schema Name syslog
Schema Registry Hostname Hostname from Kafka
cluster
oss-kafka-demo-master0.os
s-demo.qsm5-opic.cloudera.
site
d. Click Apply Changes to save the parameter values
e. If confirmation is requested, Click Ok.
2. Start Controller Services
a. Navigate to Flow Options → Services
b. Select CDP_Schema_Registry service and click Enable Service and Referencing
Components action. If this is not enabled, it may be an error or an extra space
in any of the parameters for example AVRO must not have a new line or blank
spaces. The first thing to try if you have an issue is to stop the Design
environment and then restart the test session. Check the Tips guide for more
help or contact us in the bestinflow.slack.com.
c. Start from the top of the list and enable all remaining Controller services
d. Make sure all services have been enabled. You may need to reload the page or
try it in a new tab.
3. If your processors have all started because you started your controller services, it is
best to stop them all by right clicking on each one and clicking ‘Stop’ and then start them
one at a time so you can follow the process easier. Start the ConsumeFromKafka
processor using the right click action menu or the Start button in the configuration
drawer.
After starting the processor, you should see events starting to queue up in the
success_ConsumeFromKafka-FilterEvents connection.
4. Verify data being consumed from Kafka
a. Right-click on the success_ConsumeFromKafka-FilterEvents connection and
select List Queue
Info:
The List Queue interface shows you all flow files that are being queued in this
connection. Click on a flow file to see its metadata in the form of attributes. In our
use case, the attributes tell us a lot about the Kafka source from which we are
consuming the data. Attributes change depending on the source you’re working
with and can also be used to store additional metadata that you generate in your
flow.
b. Select any flow file in the queue and click the book icon to open it in the Data
Viewer
Info: The Data Viewer displays the content of the selected flow file and shows
you the events that we have received from Kafka. It automatically detects the
data format - in this case JSON - and presents it in human readable format.
c. Scroll through the content and note how we are receiving syslog events with
varying severity.
5. Define filter rule to filter out low severity events
a. Return to the Flow Designer by closing the Data Viewer tab and clicking Back To
Flow Designer in the List Queue screen.
b. Select the Filter Events processor on the canvas. We are using a QueryRecord
processor to filter out low severity events. The QueryRecord processor is very
flexible and can run several filtering or routing rules at once.
c. In the configuration drawer, scroll down until you see the filtered_events property.
We are going to use this property to filter out the events. Click on the menu at the
end of the row and select Go To Parameter.
d. If you wish to change this, you can change the Parameter value.
e. Click Apply Changes to update the parameter value. Return back to the Flow
Designer
f. Start the Filter Events processor using the right-click menu or the Start icon in the
configuration drawer.
6. Verify that the filter rule works
a. After starting the Filter Events processor, flow files will start queueing up in the
filtered_events-FilterEvents-WriteToKafka connection
b. Right click the filtered_events-FilterEvents-WriteToKafka connection and select
List Queue.
c. Select a few random flow files and open them in the Data Viewer to verify that
only events with severity <=2 are present.
d. Navigate back to the Flow Designer canvas.
7. Write the filtered events to the Kafka alerts topic
Now all that is left is to start the WriteToKafka processor to write our filtered high severity
events to syslog_critical Kafka topic.
a. Select the WriteToKafka processor and explore its properties in the configuration
drawer.
b. Note how we are plugging in many of our parameters to configure this processor.
Values like Kafka Brokers, Topic Name, Username, Password and the Record
Writer have all been parameterized and use the values that we provided in the
very beginning.
c. Start the WriteToKafka processor using the right-click menu or the Start icon in
the configuration drawer.
Congratulations! You have successfully customized this ReadyFlow and achieved your goal of
sending critical alerts to a dedicated topic! Now that you are done with developing your flow, it is
time to deploy it in production!
1.3 Publishing your flow to the catalog
1. Stop the Test Session
a. Click the toggle next to Active Test Session to stop your Test Session
b. Click “End” in the dialog to confirm. The Test Session is now stopping and
allocated resources are being released
2. Publish your modified flow to the Catalog
a. Open the “Flow Options” menu at the top
b. Click “Publish” to make your modified flow available in the Catalog
c. Prefix your username to the Flow Name and provide a Flow Description. Click
Publish.
i.
d. You are now redirected to your published flow definition in the Catalog.
Info: The Catalog is the central repository for all your deployable flow definitions.
From here you can create auto-scaling deployments from any version or create
new drafts and update your flow processing logic to create new versions of your
flow.
1.4 Creating an auto-scaling flow deployment
1. As soon as you publish your flow, it should take you to the Catalog. If it does not
then locate your flow definition in the Catalog
a. Make sure you have navigated to the Catalog
b. If you have closed the sidebar, search for your published flow <<yourid>> into the
search bar in the Catalog. Click on the flow definition that matches the name you
gave it earlier.
c. After opening the side panel, click Deploy, select the available environment from
the drop down menu and click Continue to start the Deployment Wizard.
d.
If you have any issues, log out, close your browser, restart your browser, try an
incognito window and re-login. Also see the “Best Practices Guide”.
2. Complete the Deployment Wizard
The Deployment Wizard guides you through a six step process to create a flow
deployment. Throughout the six steps you will choose the NiFi configuration of your flow,
provide parameters and define KPIs. At the end of the process, you are able to generate
a CLI command to automate future deployments.
Note: The Deployment name has a cap of 27 characters which needs to be considered as
you write the prod name.
a. Provide a name such as <<your_username>>_kafkatokafka_prod to indicate the
use case and that you are deploying a production flow. Click Next.
b. The NiFi Configuration screen allows you to customize the runtime that will
execute your flow. You have the opportunity to pick from various released NiFi
versions.
Select the Latest Version and make sure Automatically start flow upon successful
deployment is checked.
Click Next.
c. The Parameters step is where you provide values for all the parameters that you
defined in your flow. In this example, you should recognize many of the prefilled
values from the previous exercise - including the Filter Rule and our Kafka
Source and Kafka Destination Topics.
To advance, you have to provide values for all parameters. Select the No Value
option to only display parameters without default values.
You should now only see one parameter - the CDP Workload User Password
parameter which is sensitive. Sensitive parameter values are removed when you
publish a flow to the catalog to make sure passwords don’t leak.
Provide your CDP Workload User Password and click Next to continue.
d. The Sizing & Scaling step lets you choose the resources that you want to
allocate for this deployment. You can choose from several node configurations
and turn on Auto-Scaling.
Let’s choose the Extra Small Node Size and turn on Auto-Scaling from 1-3
nodes. Click Next to advance.
e. The Key Performance Indicators (KPI) step allows you to monitor flow
performance. You can create KPIs for overall flow performance metrics or
in-depth processor or connection metrics.
Add the following KPI
● KPI Scope: Entire Flow
● Metric to Track: Data Out
● Alerts:
○ Trigger alert when metric is less than: 1 MB/sec
○ Alert will be triggered when metrics is outside the boundary(s) for:
1 Minute
Add the following KPI
● KPI Scope: Processor
● Processor Name: ConsumeFromKafka
● Metric to Track: Bytes Received
● Alerts:
○ Trigger alert when metric is less than: 512 KBytes/sec
○ Alert will be triggered when metrics is outside the boundary(s) for:
30 seconds
Review the KPIs and click Next.
f. In the Review page, review your deployment details.
Notice that in this page there's a >_ View CLI Command link. You will use the
information in the page in the next section to deploy a flow using the CLI. For now
you just need to save the script and dependencies provided there:
i. Click on the >_ View CLI Command link and familiarize yourself with the
content.
ii. Download the 2 JSON dependency files by click on the download button:
1. Flow Deployment Parameters JSON
2. Flow Deployment KPIs JSON
iii. Copy the command at the end of this page and save that in a file called
deploy.sh
iv. Close the Equivalent CDP CLI Command tab.
g. Click Deploy to initiate the flow deployment!
h. You are redirected to the Deployment Dashboard where you can monitor the
progress of your deployment. Creating the deployment should only take a few
minutes.
i. Congratulations! Your flow deployment has been created and is already
processing Syslog events!
Please wait until your application is done Deploying, Importing Flow. Wait for Good Health.
1.5 Monitoring your flow deployment
1. Notice how the dashboard shows you the data rates at which a deployment currently
receives and sends data. The data is also visualized in a graph that shows the two
metrics over time.
2. Change the Metrics Window setting at the top right. You can visualize as much as 1 Day.
3. Click on the yourid_kafkafilterkafka_prod deployment. The side panel opens and
shows more detail about the deployment. On the KPIs tab it will show information about
the KPIs that you created when deploying the flow.
Using the two KPIs Bytes Received and Data Out we can observe that our flow is
filtering out data as expected since it reads more than it sends out.
Wait a number of minutes so some data and metrics can be generated.
4. Switch to the System Metrics tab where you can observe the current CPU utilization rate
for the deployment. Our flow is not doing a lot of heavy transformation, so it should hover
around at ~10% CPU usage.
5. Close the side panel by clicking anywhere on the Dashboard.
6. Notice how your yourid_CriticalSyslogsProd deployment shows Concerning Health
status. Hover over the warning icon and click View Details.
7. You will be redirected to the Alerts tab of the deployment. Here you get an overview of
active and past alerts and events. Expand the Active Alert to learn more about its cause.
After expanding the alert, it is clear that it is caused by a KPI threshold breach for
sending less than 1MB/s to external systems as defined earlier when you created the
deployment.
1.6 Managing your flow deployment
1. Click on the yourid_kafkafilterkafka_prod deployment in the Dashboard. In the side panel,
click Manage Deployment at the top right.
2. You are now being redirected to the Deployment Manager. The Deployment Manager
allows you to reconfigure the deployment and modify KPIs, modify the number of NiFi
nodes or turn auto-scaling on/off or update parameter values.
3. Explore NiFi UI for deployment. Click the Actions menu and click on View in NiFi.
4. You are being redirected to the NiFi cluster running the flow deployment. You can use
this view for in-depth troubleshooting. Users can have read-only or read/write
permissions to the flow deployment.
2. Writing critical syslog events to Apache Iceberg for analysis
A few weeks have passed since you built your data flow with DataFlow Designer to filter
out critical syslog events to a dedicated Kafka topic. Now that everyone has better
visibility into real-time health, management wants to do historical analysis on the data.
Your company is evaluating Apache Iceberg to build an open data lakehouse and you
are tasked with building a flow that ingests the most critical syslog events into an Iceberg
table.
Ensure your table is built and accessible.
Create an Apache Iceberg Table
1. From the Home page, click the Data Hub Clusters. Navigate to oss-kudu-demo from
the Data Hubs list
2. Navigate to Hue from the Kudu Data Hub.
3. Inside of Hue you can now create your table. You will have your own database to work
with. To get to your database, click on the ‘<’ icon next to default database. You should
see your specific database in the format: <YourEmailWithUnderscores>_db. Click on
your database to go to the SQL Editor.
4. Create your Apache Iceberg table with the sql below and clicking the play icon to
execute the sql query. Note that the the table name must prefixed with your Work Load
User Name (userid).
CREATE TABLE <<userid>>_syslog_critical_archive
(priority int, severity int, facility int, version int, event_timestamp bigint, hostname string,
body string, appName string, procid string, messageid string,
structureddata struct<sdid:struct<eventid:string,eventsource:string,iut:string>>)
STORED BY ICEBERG;
5. Once you have sent data to your table, you can query it.
Additional Documentation
● Create a Table
● Query a Table
● Apache Iceberg Table Properties
2.1 Open ReadyFlow & start Test Session
1. Navigate to DataFlow from the Home Page
2. Navigate to the ReadyFlow Gallery
3. Explore the ReadyFlow Gallery
4. Search for the “Kafka to Iceberg” ReadyFlow.
5. Click on “Create New Draft” to open the ReadyFlow in the Designer named
yourid_kafkatoiceberg Ex: tim_kafkatoiceberg
6. Start a Test Session by either clicking on the start a test session link in the banner or
going to Flow Options and selecting Start in the Test Session section.
7. In the Test Session creation wizard, select the latest NiFi version and click Start Test
Session. Notice how the status at the top now says “Initializing Test Session”.
2.2 Modifying the flow to read syslog data
The flow consists of three processors and looks very promising for our use case. The first
processor reads data from a Kafka topic, the second processor gives us the option to batch up
events and create larger files which are then written out to Iceberg by the PutIceberg processor.
All we have to do now to reach our goal is to customize its configuration to our use case.
1. Provide values for predefined parameters
a. Navigate to Flow Options→ Parameters
b. Select all parameters that show No value set and provide the following values
Name Description Value
CDP Workload User CDP Workload User <Your own workload user
name>
CDP Workload User
Password
CDP Workload User
Password
<Your own workload user
password>
Data Input Format This flow supports
AVRO, JSON and CSV
JSON
Hive Catalog Namespace <YourEmailWithUnderScores
_db>
Iceberg Table Name <<replace_with_userid>>_sysl
og_critical_archive
Kafka Broker Endpoint Comma-separated list
of Kafka Broker
addresses
oss-kafka-demo-corebroker2.
oss-demo.qsm5-opic.clouder
a.site:9093,
oss-kafka-demo-corebroker1.
oss-demo.qsm5-opic.clouder
a.site:9093,
oss-kafka-demo-corebroker0.
oss-demo.qsm5-opic.clouder
a.site:9093
Kafka Consumer Group Id <<replace_with_userid>>_cdf
Ex: tim_cdf
Kafka Source Topic <<replace_with_userid>>_sysl
og_critical Ex:
tim_syslog_critical
Schema Name syslog
Schema Registry Hostname oss-kafka-demo-master0.oss
-demo.qsm5-opic.cloudera.si
te
c. Click Apply Changes to save the parameter values
2. Start Controller Services
a. Navigate to Flow Options → Services
b. Select CDP_Schema_Registry service and click Enable Service and Referencing
Components action
c. Start from the top of the list and enable all remaining Controller services including
KerberosPasswordUserService, HiveCatalogService, AvroReader, …
d. Click Ok if confirmation is asked.
e. Make sure all services have been enabled
3. Start the ConsumeFromKafka processor using the right click action menu or the Start
button in the configuration drawer. It might already be started.
After starting the processor, you should see events starting to queue up in the
success_ConsumeFromKafka-FilterEvents connection.
NOTE:
To receive data from your topic, you will need either the first deployment still running or to run it
from another Flow Designer Test Session.
2.3 Changing the flow to modify the schema for Iceberg integration
Our data warehouse team has created an Iceberg table that they want us to ingest the critical
syslog data in. A challenge we are facing is that not all column names in the Iceberg table
match our syslog record schema. So we have to add functionality to our flow that allows us to
change the schema of our syslog records. To do this, we will be using the JoltTransformRecord
processor.
1. Add a new JoltTransformRecord to the canvas by dragging the processor icon to the
canvas.
2. In the Add Processor window, select the JoltTransformRecord type and name the
processor TransformSchema.
3. Validate that your new processor now appears on the canvas.
4. Create connections from ConsumeFromKafka to TransformSchema by hovering over the
ConsumeFromKafka processor and dragging the arrow that appears to
TransformSchema. Pick the success relationship to connect.
Now connect the success relationship of TransformSchema to the MergeRecords
processor.
5. Now that we have connected our new TransformSchema processor, we can delete the
original connection between ConsumeFromKafka and MergeRecords.
Make sure that the ConsumeFromKafka processor is stopped. Then select the
connection, empty the queue if needed, and then delete it. Now all syslog events that
we receive, will go through the TransformSchema processor.
6. To make sure that our schema transformation works, we have to create a new Record
Writer Service and use it as the Record Writer for the TransformSchema processor.
Select the TransformSchema processor and open the configuration panel. Scroll to the
Properties section, click the three dot menu in the Record Writer row and select Add
Service to create a new Record Writer.
7. Select AvroRecordSetWriter , name it TransformedSchemaWriter and click Add.
Click Apply in the configuration panel to save your changes.
8. Now click the three dot menu again and select Go To Service to configure our new Avro
Record Writer.
9. To configure our new Avro Record Writer, provide the following values:
Name Description Value
Schema Write
Strategy
Specify whether/how CDF should
write schema information
Embed Avro Schema
Schema Access
Strategy
Specify how CDF identifies the
schema to apply.
Use ‘Schema Name’
Property
Schema Registry Specify the Schema Registry that
stores our schema
CDP_Schema_Registry
Schema Name The schema name to look up in the
Schema Registry
syslog_transformed
10. Convert the value that you provided for Schema Name into a parameter. Click on the
three dot menu in the Schema Name row and select Convert To Parameter.
11. Give the parameter the name Schema Name Transformed and click “add”. You have
now created a new parameter from a value that can be used in more places in your data
flow.
12. Apply your configuration changes and Enable the Service by clicking the power icon.
Now you have configured our new Schema Writer and we can return back to the Flow
Designer canvas.
If you have any issues, end the test session and restart. If your login timed out, close
your browser and re login.
13. Click Back to Flow Designer to navigate back to the canvas.
14. Select TransformSchema to configure it and provide the following values:
Name Description Value
Record Reader Service used to parse incoming
events
AvroReader
Record Writer Service used to format
outgoing events
TransformedSchemaWriter
Jolt Specification The specification that describes
how to modify the incoming
JSON data. We are
standardizing on lower case
field names and renaming the
timestamp field to
event_timestamp.
[
{
"operation": "shift",
"spec": {
"appName": "appname",
"timestamp": "event_timestamp",
"structuredData": {
"SDID": {
"eventId":
"structureddata.sdid.eventid",
"eventSource":
"structureddata.sdid.eventsource",
"iut": "structureddata.sdid.iut"
}
},
"*": {
"@": "&"
}
}
}
]
15. Scroll to Relationships and select Terminate for the failure, original relationships and
click Apply.
16. Start your ConsumeFromKafka and TransformSchema processor and validate that the
transformed data matches our Iceberg table schema.
17. Once events are queuing up in the connection between TransformSchema and
MergeRecord, right click the connection and select List Queue.
18. Select any of the queued files and select the book icon to open it in the Data Viewer
19. Notice how all field names have been transformed to lower case and how the timestamp
field has been renamed to event_timestamp.
2.4 Merging records and start writing to Iceberg
Now that we have verified that our schema is being transformed as needed, it’s time to start the
remaining processors and write our events into the Iceberg table. The MergeRecords processor is
configured to batch events up to increase efficiency when writing to Iceberg. The final processor,
WriteToIceberg takes our Avro records and writes them into a Parquet formatted table.
1. Tip: You can change the configuration to something like “30 sec” to speed up
processing.
2. Select the MergeRecords processor and explore its configuration. It is configured to
batch events up for at least 30 seconds or until the queued up events have reached
Maximum Bin Size of 1GB. You will want to lower these for testing.
3. Start the MergeRecords processor and verify that it batches up events and writes them
out after 30 seconds.
4. Select the WriteToIceberg processor and explore its configuration. Notice how it relies on
several parameters to establish a connection to the right database and table.
5. Start the WriteToIceberg processor and verify that it writes records successfully to
Iceberg. If the metrics on the processor increase and you don’t see any warnings or
events being written to the failure_WriteToIceberg connection, your writes are
successful!
Congratulations! With this you have completed the second use case.
You may want to log into Hue to check your data has loaded.
Feel free to publish your flow to the catalog and create a deployment just like you did for
the first one.
3. Resize image flow deployed as serverless function
DataFlow Functions provides a new, efficient way to run your event-driven Apache NiFi data
flows. You can have your flow executed within AWS Lambda, Azure Functions or Google Cloud
Functions and define the trigger that should start its execution.
DataFlow Functions is perfect for use cases such as:
- Processing files as soon as they land into the cloud provider object store
- Creating microservices over HTTPS
- CRON driven use cases
- etc
In this use case, we will be deploying a NiFi flow that will be triggered by HTTPS requests to
resize images. Once deployed, the cloud provider will provide an HTTPS endpoint that you’ll be
able to call to send an image, it will trigger the NiFi flow that will return a resized image based
on your parameters.
The deployment of the flow as a function will have to be done within your cloud provider.
The below tutorial will use AWS as the cloud provider. If you’re using Azure or Google Cloud,
you can still refer to this documentation to deploy the flow as a function.
3.1 Designing the flow for AWS Lambda
1. Go into Cloudera DataFlow / Flow Design and create a new draft with a name of your
choice.
2. Drag and drop an Input Port named input onto the canvas. When triggered, AWS
Lambda is going to inject into that input port a FlowFile containing the information about
the HTTPS call that has been made.
Example of payload that will be injected by AWS Lambda as a FlowFile:
3. Drag and drop an EvaluateJsonPath processor, call it ExtractHTTPHeaders. We’re going
to use this to extract the HTTP headers that we want to keep in our flow. Add two
properties configured as below. It’ll save as FlowFile’s attributes the HTTP headers
(resize-height and resize-width) that we will be adding when making a call with our
image to specify the dimensions of the resized image.
resizeHeight => $.headers.resize-height
resizeWidth => $.headers.resize-width
Note: don’t forget to change Destination as “flowfile-attribute” and Click Apply.
4. Drag and drop another EvaluateJsonPath processor and then change it’s name to a
unique one. This one will be used to retrieve the content of the body field from the
payload we received and use it as the new content of the FlowFile. This field contains
the actual representation of the image we have been sending over HTTP with Base 64
encoding.
body => $.body
5. Drag and drop a Base64EncodeContent processor and change the mode to Decode.
This will Base64 decode the content of the FlowFile to retrieve its binary format.
6. Drag and drop a ResizeImage processor. Use the previously created FlowFile attributes
to specify the new dimensions of the image. Also, specify true for maintaining the ratio.
7. Drag and drop a Base64EncodeContent processor. To send back the resized image to
the user, AWS Lambda expects us to send back a specific JSON payload with the Base
64 encoding of the image.
8. Drag and drop a ReplaceText processor. We use it to extract the Base 64 representation
of the resized image and add it in the expected JSON payload. Add the below JSON in
“Replacement Value” and change “Evaluation Mode” to “Entire text”.
{
"statusCode": 200,
"headers": { "Content-Type": "image/png" },
"isBase64Encoded": true,
"body": "$1"
}
9. Drag and drop an output port.
10. Connect all the components together, you can auto-terminate the unused relationships.
This should look like this:
You can now publish the flow into the DataFlow Catalog in the Flow Options menu:
Make sure to give it a name that is unique (you can prefix it with your name):
Once the flow is published, make sure to copy the CRN of the published version (it will end by
/v.1):
3.2 Deploying the flow as a function in AWS Lambda
First thing first, go into DataFlow Functions and download the binary for running DataFlow
Functions in AWS Lambda:
This should download a binary with a name similar to:
naaf-aws-lambda-1.0.0.2.3.7.0-100-bin.zip
Once you have the binary, make sure, you also have:
● The CRN of the flow you published in the DataFlow Catalog
● The Access Key that has been provided with these instructions in “Competition
Resources” section
● The Private Key that has been provided with these instructions in “Competition
Resources” section
In order to speed up the deployment, we’re going to leverage some scripts to automate the
deployment. It assumes that your AWS CLI is properly configured locally on your laptop and you
can use the jq command for reading JSON payloads. You can now follow the instructions from
this page here.
However, if you wish to deploy the flow in AWS Lambda manually through the AWS UI, you can
follow the steps described here.

More Related Content

Similar to BestInFlowCompetitionTutorials03May2023

How to Perform Test Automation With Gauge & Selenium Framework
How to Perform Test Automation With Gauge & Selenium Framework How to Perform Test Automation With Gauge & Selenium Framework
How to Perform Test Automation With Gauge & Selenium Framework
Sarah Elson
 
Practical solutions for connections administrators
Practical solutions for connections administratorsPractical solutions for connections administrators
Practical solutions for connections administrators
Sharon James
 
Wamp & LAMP - Installation and Configuration
Wamp & LAMP - Installation and ConfigurationWamp & LAMP - Installation and Configuration
Wamp & LAMP - Installation and Configuration
Chetan Soni
 
Data Warehousing (Practical Questions Paper) [CBSGS - 75:25 Pattern] {2015 Ma...
Data Warehousing (Practical Questions Paper) [CBSGS - 75:25 Pattern] {2015 Ma...Data Warehousing (Practical Questions Paper) [CBSGS - 75:25 Pattern] {2015 Ma...
Data Warehousing (Practical Questions Paper) [CBSGS - 75:25 Pattern] {2015 Ma...
Mumbai B.Sc.IT Study
 
How to Transfer Magento Project from One Server to another Server
How to Transfer Magento Project from One Server to another ServerHow to Transfer Magento Project from One Server to another Server
How to Transfer Magento Project from One Server to another Server
Kaushal Mewar
 
Lab 1 Essay
Lab 1 EssayLab 1 Essay
Lab 1 Essay
Melissa Moore
 
ConnectSMART Tutorials
ConnectSMART TutorialsConnectSMART Tutorials
ConnectSMART Tutorials
ConnectSMART
 
IUG ATL PC 9.5
IUG ATL PC 9.5IUG ATL PC 9.5
IUG ATL PC 9.5
Rizwan Mohammed
 
Google Hacking Lab ClassNameDate This is an introducti.docx
Google Hacking Lab ClassNameDate This is an introducti.docxGoogle Hacking Lab ClassNameDate This is an introducti.docx
Google Hacking Lab ClassNameDate This is an introducti.docx
whittemorelucilla
 
Basic commands for powershell : Configuring Windows PowerShell and working wi...
Basic commands for powershell : Configuring Windows PowerShell and working wi...Basic commands for powershell : Configuring Windows PowerShell and working wi...
Basic commands for powershell : Configuring Windows PowerShell and working wi...
Hitesh Mohapatra
 
websphere cast iron labs
 websphere cast iron labs websphere cast iron labs
websphere cast iron labs
AMIT KUMAR
 
XPages Blast - ILUG 2010
XPages Blast - ILUG 2010XPages Blast - ILUG 2010
XPages Blast - ILUG 2010
Tim Clark
 
Introduction of Pharo 5.0
Introduction of Pharo 5.0Introduction of Pharo 5.0
Introduction of Pharo 5.0
Masashi Umezawa
 
Jakarta struts
Jakarta strutsJakarta struts
Jakarta struts
rajeevsingh141
 
Activemq installation and master slave setup using shared broker data
Activemq installation and master slave setup using shared broker dataActivemq installation and master slave setup using shared broker data
Activemq installation and master slave setup using shared broker data
Ramakrishna Narkedamilli
 
Insight
InsightInsight
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
Nuno Godinho
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For Workflow
Timothy Spann
 
Dbm 438 week 6 ilab
Dbm 438 week 6 ilabDbm 438 week 6 ilab
Dbm 438 week 6 ilab
MelissaHarrington321
 
linux installation.pdf
linux installation.pdflinux installation.pdf
linux installation.pdf
MuhammadShoaibHussai2
 

Similar to BestInFlowCompetitionTutorials03May2023 (20)

How to Perform Test Automation With Gauge & Selenium Framework
How to Perform Test Automation With Gauge & Selenium Framework How to Perform Test Automation With Gauge & Selenium Framework
How to Perform Test Automation With Gauge & Selenium Framework
 
Practical solutions for connections administrators
Practical solutions for connections administratorsPractical solutions for connections administrators
Practical solutions for connections administrators
 
Wamp & LAMP - Installation and Configuration
Wamp & LAMP - Installation and ConfigurationWamp & LAMP - Installation and Configuration
Wamp & LAMP - Installation and Configuration
 
Data Warehousing (Practical Questions Paper) [CBSGS - 75:25 Pattern] {2015 Ma...
Data Warehousing (Practical Questions Paper) [CBSGS - 75:25 Pattern] {2015 Ma...Data Warehousing (Practical Questions Paper) [CBSGS - 75:25 Pattern] {2015 Ma...
Data Warehousing (Practical Questions Paper) [CBSGS - 75:25 Pattern] {2015 Ma...
 
How to Transfer Magento Project from One Server to another Server
How to Transfer Magento Project from One Server to another ServerHow to Transfer Magento Project from One Server to another Server
How to Transfer Magento Project from One Server to another Server
 
Lab 1 Essay
Lab 1 EssayLab 1 Essay
Lab 1 Essay
 
ConnectSMART Tutorials
ConnectSMART TutorialsConnectSMART Tutorials
ConnectSMART Tutorials
 
IUG ATL PC 9.5
IUG ATL PC 9.5IUG ATL PC 9.5
IUG ATL PC 9.5
 
Google Hacking Lab ClassNameDate This is an introducti.docx
Google Hacking Lab ClassNameDate This is an introducti.docxGoogle Hacking Lab ClassNameDate This is an introducti.docx
Google Hacking Lab ClassNameDate This is an introducti.docx
 
Basic commands for powershell : Configuring Windows PowerShell and working wi...
Basic commands for powershell : Configuring Windows PowerShell and working wi...Basic commands for powershell : Configuring Windows PowerShell and working wi...
Basic commands for powershell : Configuring Windows PowerShell and working wi...
 
websphere cast iron labs
 websphere cast iron labs websphere cast iron labs
websphere cast iron labs
 
XPages Blast - ILUG 2010
XPages Blast - ILUG 2010XPages Blast - ILUG 2010
XPages Blast - ILUG 2010
 
Introduction of Pharo 5.0
Introduction of Pharo 5.0Introduction of Pharo 5.0
Introduction of Pharo 5.0
 
Jakarta struts
Jakarta strutsJakarta struts
Jakarta struts
 
Activemq installation and master slave setup using shared broker data
Activemq installation and master slave setup using shared broker dataActivemq installation and master slave setup using shared broker data
Activemq installation and master slave setup using shared broker data
 
Insight
InsightInsight
Insight
 
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
TechDays 2010 Portugal - Scaling your data tier with app fabric 16x9
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For Workflow
 
Dbm 438 week 6 ilab
Dbm 438 week 6 ilabDbm 438 week 6 ilab
Dbm 438 week 6 ilab
 
linux installation.pdf
linux installation.pdflinux installation.pdf
linux installation.pdf
 

More from Timothy Spann

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Timothy Spann
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
Timothy Spann
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
Timothy Spann
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
Timothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
Timothy Spann
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
Timothy Spann
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
Timothy Spann
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
Timothy Spann
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Timothy Spann
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Timothy Spann
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
Timothy Spann
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 

More from Timothy Spann (20)

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
 

Recently uploaded

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 

Recently uploaded (20)

LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 

BestInFlowCompetitionTutorials03May2023

  • 1. Best iBest in Flow Competition Tutorials Author: Michael Kohs George Vetticaden Timothy Spann Date: 04/18/2023 Last Updated: 5/3/2023 Useful Data Assets Setting Your Workload Password Creating a Kafka Topic Use Case walkthrough 9 1. Reading and filtering a stream of syslog data 9 2. Writing critical syslog events to Apache Iceberg for analysis 29 3. Resize image flow deployed as serverless function 56
  • 2. Use Case Walkthrough for Competition Notice This document assumes that you have registered for an account, activated it and logged into the CDP Sandbox. This is for authorized users only who have attended the webinar and have read the training materials. A short guide and references are listed here. Competition Resources Login to the Cluster https://login.cdpworkshops.cloudera.com/auth/realms/se-workshop-5/protocol/saml/clients/cdp-sso Kafka Broker connection string ● oss-kafka-demo-corebroker2.oss-demo.qsm5-opic.cloudera.site:9093, ● oss-kafka-demo-corebroker1.oss-demo.qsm5-opic.cloudera.site:9093, ● oss-kafka-demo-corebroker0.oss-demo.qsm5-opic.cloudera.site:9093 Kafka Topics ● syslog_json ● syslog_avro ● syslog_critical Schema Registry Hostname ● oss-kafka-demo-master0.oss-demo.qsm5-opic.cloudera.site Schema Name ● syslog ● syslog_avro ● syslog_transformed
  • 3. Syslog Filter Rule ● SELECT * FROM FLOWFILE WHERE severity <= 2 Access Key and Private Key for Machine User in DataFlow Function ● Access Key: eda9f909-d1c2-4934-bad7-95ec6e326de8 ● Private Key: eon6eFzLlxZI/gpU0dWtht21DI60MkSQZjIzeWSGBSI= The following keys are needed if you want to deploy a DataFlow Function that you build during the Best in Flow Competition. Your Workflow User Name and Password 1. Click on your name at the bottom left corner of the screen for a menu to pop up. 2. Click on Profile to be redirected to your user’s profile page with important information.
  • 4. If your Workload Password does not say currently set or you forgot it, follow the steps below to reset it. Your userid is shown above at Workload User Name. Setting Workload Password You will need to define your workload password that will be used to access non-SSO interfaces. You may read more about it here. Please keep it with you. If you have forgotten it, you will be able to repeat this process and define another one. 1. From the Home Page, click on your User Name (Ex: tim) at the lower left corner. 2. Click on the Profile option.
  • 5. 1. Click option Set Workload Password. 2. Enter a suitable Password and Confirm Password. 3. Click the button Set Workload Password.
  • 6. Check that you got the message - Workload password is currently set or alternatively, look for a message next to Workload Password which says (Workload password is currently set). Save the password you configured as well as the workload user name for use later.
  • 7. Create a Kafka Topic The tutorials require you to create an Apache Kafka topic to send your data to, this is how you can create that topic. You will also need this information to create topics for any of your own custom applications for the competition. 1. Navigate to Data Hub Clusters from the Home Page Info: You can always navigate back to the home page by clicking the app switcher icon at the top left of your screen. 2. Navigate to the oss-kafka-demo cluster
  • 8. 3. Navigate to Streams Messaging Manager Info: Streams Messaging Manager (SMM) is a tool for working with Apache Kafka. 4. Now that you are in SMM.
  • 9. 5. Navigate to the round icon third from the top, click this Topic button. 6. You are now in the Topic browser. 7. Click Add New to build a new topic. 8. Enter the name of your topic prefixed with your Workload User Name, ex: <<replace_with_userid>>_syslog_critical.
  • 10. 9. For settings you should create it with (3 partitions, cleanup.policy: delete, availability maximum) as shown above. After successfully creating a topic, close the tab that opened when navigating to Streams Messaging Manager Congratulations! You have built a new topic.
  • 11. 10. After successfully creating a topic, close the tab that opened when navigating to Streams Messaging Manager
  • 12. 1. Reading and filtering a stream of syslog data You have been tasked with filtering a noisy stream of syslog events which are available in a Kafka topic. The goal is to identify critical events and write them the Kafka topic you just created. Related documentation is here. 1.1 Open ReadyFlow & start Test Session 1. Navigate to DataFlow from the Home Page
  • 13. 2. Navigate to the ReadyFlow Gallery 3. Explore the ReadyFlow Gallery Info: The ReadyFlow Gallery is where you can find out-of-box templates for common data movement use cases. You can directly create deployments from a ReadyFlow or create new drafts and modify the processing logic according to your needs before deploying. 4. Select the “Kafka filter to Kafka” ReadyFlow. 5. Get your user id from your profile, it is usually the first part of your email, so my email is tim@sparkdeveloper.com so my user id is tim. It is your “Workload User Name” that you are going to need for several things, remember that. 6. You already created a new topic to receive data in the setup section. <<replace_with_userid>>_syslog_critical Ex: tim_syslog_critical. 7. Click on “Create New Draft” to open the ReadyFlow in the Designer with the name youruserid_kafkafilterkafka, for example tim_kafkafilterkafka. If your name has periods, underscores or other non-alphanumeric characters just leave those out. Select from the available workspaces in the dropdown, you should only have one available.
  • 14. 8. Start a Test Session by either clicking on the start a test session link in the banner or going to Flow Options and selecting Start in the Test Session section. 9. In the Test Session creation wizard, select the latest NiFi version and click Start Test Session. Leave the other options to its default values. Notice how the status at the top now says “Initializing Test Session”. Info: Test Sessions provision infrastructure on the fly and allow you to start and stop individual processors and send data through your flow By running data through processors step by step and using the data viewer as needed, you’re able to validate your processing logic during development in an iterative way without having to treat your entire data flow as one deployable unit. 1.2 Modifying the flow to read syslog data The flow consists of three processors and looks very promising for our use case. The first processor reads data from a Kafka topic, the second processor allows us to filter the events before the third processor writes the filtered events to another Kafka topic. All we have to do now to reach our goal is to customize its configuration to our use case. 1. Provide values for predefined parameters a. Navigate to Flow Options→ Parameters b. For some settings there are some that are set already as parameters, for others they are not, you can set them manually. Make sure you create a parameter for the Group Id. c. Configure the following parameters:
  • 15. Name Description Value CDP Workload User CDP Workload User <Your own workload user ID that you saved when you configured your workload password> CDP Workload User Password CDP Workload User Password <Your own workload user password you configured in the earlier step> Filter Rule Filter Rule SELECT * FROM FLOWFILE WHERE severity <= 2 Data Input Format AVRO Data Output Format JSON Kafka Consumer Group ID ConsumeFromKafka <<replace_with_userid>>_cdf Ex: tim_cdf Group ID ConsumeFromKafka <<replace_with_userid>>_cdf Ex: tim_cdf Kafka Broker Endpoint Comma-separated list of Kafka Broker addresses oss-kafka-demo-corebroker2 .oss-demo.qsm5-opic.cloud era.site:9093, oss-kafka-demo-corebroker1 .oss-demo.qsm5-opic.cloud era.site:9093, oss-kafka-demo-corebroker0 .oss-demo.qsm5-opic.cloud era.site:9093 Kafka Destination Topic Must be unique <<replace_with_userid>>_sy slog_critical Ex: tim_syslog_critical Kafka Producer ID Must be unique <<replace_with_userid>>_cdf _producer1
  • 16. Ex: tim_cdf_producer1 Kafka Source Topic syslog_avro Schema Name syslog Schema Registry Hostname Hostname from Kafka cluster oss-kafka-demo-master0.os s-demo.qsm5-opic.cloudera. site d. Click Apply Changes to save the parameter values e. If confirmation is requested, Click Ok.
  • 17. 2. Start Controller Services a. Navigate to Flow Options → Services b. Select CDP_Schema_Registry service and click Enable Service and Referencing Components action. If this is not enabled, it may be an error or an extra space in any of the parameters for example AVRO must not have a new line or blank spaces. The first thing to try if you have an issue is to stop the Design
  • 18. environment and then restart the test session. Check the Tips guide for more help or contact us in the bestinflow.slack.com. c. Start from the top of the list and enable all remaining Controller services d. Make sure all services have been enabled. You may need to reload the page or try it in a new tab. 3. If your processors have all started because you started your controller services, it is best to stop them all by right clicking on each one and clicking ‘Stop’ and then start them one at a time so you can follow the process easier. Start the ConsumeFromKafka processor using the right click action menu or the Start button in the configuration
  • 19. drawer. After starting the processor, you should see events starting to queue up in the success_ConsumeFromKafka-FilterEvents connection. 4. Verify data being consumed from Kafka a. Right-click on the success_ConsumeFromKafka-FilterEvents connection and select List Queue Info: The List Queue interface shows you all flow files that are being queued in this connection. Click on a flow file to see its metadata in the form of attributes. In our use case, the attributes tell us a lot about the Kafka source from which we are consuming the data. Attributes change depending on the source you’re working with and can also be used to store additional metadata that you generate in your flow.
  • 20. b. Select any flow file in the queue and click the book icon to open it in the Data Viewer Info: The Data Viewer displays the content of the selected flow file and shows you the events that we have received from Kafka. It automatically detects the data format - in this case JSON - and presents it in human readable format. c. Scroll through the content and note how we are receiving syslog events with varying severity.
  • 21. 5. Define filter rule to filter out low severity events a. Return to the Flow Designer by closing the Data Viewer tab and clicking Back To Flow Designer in the List Queue screen. b. Select the Filter Events processor on the canvas. We are using a QueryRecord processor to filter out low severity events. The QueryRecord processor is very flexible and can run several filtering or routing rules at once. c. In the configuration drawer, scroll down until you see the filtered_events property. We are going to use this property to filter out the events. Click on the menu at the end of the row and select Go To Parameter. d. If you wish to change this, you can change the Parameter value. e. Click Apply Changes to update the parameter value. Return back to the Flow Designer f. Start the Filter Events processor using the right-click menu or the Start icon in the configuration drawer.
  • 22. 6. Verify that the filter rule works a. After starting the Filter Events processor, flow files will start queueing up in the filtered_events-FilterEvents-WriteToKafka connection b. Right click the filtered_events-FilterEvents-WriteToKafka connection and select List Queue. c. Select a few random flow files and open them in the Data Viewer to verify that only events with severity <=2 are present. d. Navigate back to the Flow Designer canvas. 7. Write the filtered events to the Kafka alerts topic Now all that is left is to start the WriteToKafka processor to write our filtered high severity events to syslog_critical Kafka topic.
  • 23. a. Select the WriteToKafka processor and explore its properties in the configuration drawer. b. Note how we are plugging in many of our parameters to configure this processor. Values like Kafka Brokers, Topic Name, Username, Password and the Record Writer have all been parameterized and use the values that we provided in the very beginning. c. Start the WriteToKafka processor using the right-click menu or the Start icon in the configuration drawer. Congratulations! You have successfully customized this ReadyFlow and achieved your goal of sending critical alerts to a dedicated topic! Now that you are done with developing your flow, it is time to deploy it in production! 1.3 Publishing your flow to the catalog 1. Stop the Test Session a. Click the toggle next to Active Test Session to stop your Test Session b. Click “End” in the dialog to confirm. The Test Session is now stopping and allocated resources are being released 2. Publish your modified flow to the Catalog a. Open the “Flow Options” menu at the top b. Click “Publish” to make your modified flow available in the Catalog c. Prefix your username to the Flow Name and provide a Flow Description. Click Publish. i.
  • 24. d. You are now redirected to your published flow definition in the Catalog. Info: The Catalog is the central repository for all your deployable flow definitions. From here you can create auto-scaling deployments from any version or create new drafts and update your flow processing logic to create new versions of your flow. 1.4 Creating an auto-scaling flow deployment 1. As soon as you publish your flow, it should take you to the Catalog. If it does not then locate your flow definition in the Catalog a. Make sure you have navigated to the Catalog b. If you have closed the sidebar, search for your published flow <<yourid>> into the search bar in the Catalog. Click on the flow definition that matches the name you
  • 25. gave it earlier. c. After opening the side panel, click Deploy, select the available environment from the drop down menu and click Continue to start the Deployment Wizard. d. If you have any issues, log out, close your browser, restart your browser, try an incognito window and re-login. Also see the “Best Practices Guide”. 2. Complete the Deployment Wizard The Deployment Wizard guides you through a six step process to create a flow deployment. Throughout the six steps you will choose the NiFi configuration of your flow,
  • 26. provide parameters and define KPIs. At the end of the process, you are able to generate a CLI command to automate future deployments. Note: The Deployment name has a cap of 27 characters which needs to be considered as you write the prod name. a. Provide a name such as <<your_username>>_kafkatokafka_prod to indicate the use case and that you are deploying a production flow. Click Next. b. The NiFi Configuration screen allows you to customize the runtime that will execute your flow. You have the opportunity to pick from various released NiFi versions. Select the Latest Version and make sure Automatically start flow upon successful deployment is checked. Click Next. c. The Parameters step is where you provide values for all the parameters that you defined in your flow. In this example, you should recognize many of the prefilled values from the previous exercise - including the Filter Rule and our Kafka Source and Kafka Destination Topics. To advance, you have to provide values for all parameters. Select the No Value option to only display parameters without default values. You should now only see one parameter - the CDP Workload User Password parameter which is sensitive. Sensitive parameter values are removed when you
  • 27. publish a flow to the catalog to make sure passwords don’t leak. Provide your CDP Workload User Password and click Next to continue. d. The Sizing & Scaling step lets you choose the resources that you want to allocate for this deployment. You can choose from several node configurations and turn on Auto-Scaling. Let’s choose the Extra Small Node Size and turn on Auto-Scaling from 1-3 nodes. Click Next to advance. e. The Key Performance Indicators (KPI) step allows you to monitor flow performance. You can create KPIs for overall flow performance metrics or
  • 28. in-depth processor or connection metrics. Add the following KPI ● KPI Scope: Entire Flow ● Metric to Track: Data Out ● Alerts: ○ Trigger alert when metric is less than: 1 MB/sec ○ Alert will be triggered when metrics is outside the boundary(s) for: 1 Minute Add the following KPI ● KPI Scope: Processor ● Processor Name: ConsumeFromKafka ● Metric to Track: Bytes Received ● Alerts: ○ Trigger alert when metric is less than: 512 KBytes/sec ○ Alert will be triggered when metrics is outside the boundary(s) for: 30 seconds
  • 29. Review the KPIs and click Next. f. In the Review page, review your deployment details. Notice that in this page there's a >_ View CLI Command link. You will use the information in the page in the next section to deploy a flow using the CLI. For now you just need to save the script and dependencies provided there: i. Click on the >_ View CLI Command link and familiarize yourself with the content. ii. Download the 2 JSON dependency files by click on the download button: 1. Flow Deployment Parameters JSON
  • 30. 2. Flow Deployment KPIs JSON iii. Copy the command at the end of this page and save that in a file called deploy.sh iv. Close the Equivalent CDP CLI Command tab. g. Click Deploy to initiate the flow deployment! h. You are redirected to the Deployment Dashboard where you can monitor the progress of your deployment. Creating the deployment should only take a few minutes. i. Congratulations! Your flow deployment has been created and is already processing Syslog events! Please wait until your application is done Deploying, Importing Flow. Wait for Good Health.
  • 31. 1.5 Monitoring your flow deployment 1. Notice how the dashboard shows you the data rates at which a deployment currently receives and sends data. The data is also visualized in a graph that shows the two metrics over time. 2. Change the Metrics Window setting at the top right. You can visualize as much as 1 Day. 3. Click on the yourid_kafkafilterkafka_prod deployment. The side panel opens and shows more detail about the deployment. On the KPIs tab it will show information about the KPIs that you created when deploying the flow. Using the two KPIs Bytes Received and Data Out we can observe that our flow is filtering out data as expected since it reads more than it sends out.
  • 32. Wait a number of minutes so some data and metrics can be generated. 4. Switch to the System Metrics tab where you can observe the current CPU utilization rate for the deployment. Our flow is not doing a lot of heavy transformation, so it should hover around at ~10% CPU usage. 5. Close the side panel by clicking anywhere on the Dashboard. 6. Notice how your yourid_CriticalSyslogsProd deployment shows Concerning Health status. Hover over the warning icon and click View Details.
  • 33. 7. You will be redirected to the Alerts tab of the deployment. Here you get an overview of active and past alerts and events. Expand the Active Alert to learn more about its cause. After expanding the alert, it is clear that it is caused by a KPI threshold breach for sending less than 1MB/s to external systems as defined earlier when you created the deployment. 1.6 Managing your flow deployment 1. Click on the yourid_kafkafilterkafka_prod deployment in the Dashboard. In the side panel, click Manage Deployment at the top right. 2. You are now being redirected to the Deployment Manager. The Deployment Manager allows you to reconfigure the deployment and modify KPIs, modify the number of NiFi nodes or turn auto-scaling on/off or update parameter values.
  • 34. 3. Explore NiFi UI for deployment. Click the Actions menu and click on View in NiFi. 4. You are being redirected to the NiFi cluster running the flow deployment. You can use this view for in-depth troubleshooting. Users can have read-only or read/write permissions to the flow deployment.
  • 35.
  • 36.
  • 37. 2. Writing critical syslog events to Apache Iceberg for analysis A few weeks have passed since you built your data flow with DataFlow Designer to filter out critical syslog events to a dedicated Kafka topic. Now that everyone has better visibility into real-time health, management wants to do historical analysis on the data. Your company is evaluating Apache Iceberg to build an open data lakehouse and you are tasked with building a flow that ingests the most critical syslog events into an Iceberg table. Ensure your table is built and accessible. Create an Apache Iceberg Table 1. From the Home page, click the Data Hub Clusters. Navigate to oss-kudu-demo from the Data Hubs list 2. Navigate to Hue from the Kudu Data Hub.
  • 38. 3. Inside of Hue you can now create your table. You will have your own database to work with. To get to your database, click on the ‘<’ icon next to default database. You should see your specific database in the format: <YourEmailWithUnderscores>_db. Click on your database to go to the SQL Editor. 4. Create your Apache Iceberg table with the sql below and clicking the play icon to execute the sql query. Note that the the table name must prefixed with your Work Load User Name (userid). CREATE TABLE <<userid>>_syslog_critical_archive (priority int, severity int, facility int, version int, event_timestamp bigint, hostname string, body string, appName string, procid string, messageid string, structureddata struct<sdid:struct<eventid:string,eventsource:string,iut:string>>) STORED BY ICEBERG;
  • 39. 5. Once you have sent data to your table, you can query it. Additional Documentation ● Create a Table ● Query a Table ● Apache Iceberg Table Properties 2.1 Open ReadyFlow & start Test Session 1. Navigate to DataFlow from the Home Page 2. Navigate to the ReadyFlow Gallery 3. Explore the ReadyFlow Gallery
  • 40. 4. Search for the “Kafka to Iceberg” ReadyFlow. 5. Click on “Create New Draft” to open the ReadyFlow in the Designer named yourid_kafkatoiceberg Ex: tim_kafkatoiceberg 6. Start a Test Session by either clicking on the start a test session link in the banner or going to Flow Options and selecting Start in the Test Session section. 7. In the Test Session creation wizard, select the latest NiFi version and click Start Test Session. Notice how the status at the top now says “Initializing Test Session”. 2.2 Modifying the flow to read syslog data The flow consists of three processors and looks very promising for our use case. The first processor reads data from a Kafka topic, the second processor gives us the option to batch up events and create larger files which are then written out to Iceberg by the PutIceberg processor. All we have to do now to reach our goal is to customize its configuration to our use case. 1. Provide values for predefined parameters a. Navigate to Flow Options→ Parameters b. Select all parameters that show No value set and provide the following values Name Description Value CDP Workload User CDP Workload User <Your own workload user name> CDP Workload User Password CDP Workload User Password <Your own workload user password> Data Input Format This flow supports AVRO, JSON and CSV JSON
  • 41. Hive Catalog Namespace <YourEmailWithUnderScores _db> Iceberg Table Name <<replace_with_userid>>_sysl og_critical_archive Kafka Broker Endpoint Comma-separated list of Kafka Broker addresses oss-kafka-demo-corebroker2. oss-demo.qsm5-opic.clouder a.site:9093, oss-kafka-demo-corebroker1. oss-demo.qsm5-opic.clouder a.site:9093, oss-kafka-demo-corebroker0. oss-demo.qsm5-opic.clouder a.site:9093 Kafka Consumer Group Id <<replace_with_userid>>_cdf Ex: tim_cdf Kafka Source Topic <<replace_with_userid>>_sysl og_critical Ex: tim_syslog_critical Schema Name syslog Schema Registry Hostname oss-kafka-demo-master0.oss -demo.qsm5-opic.cloudera.si te c. Click Apply Changes to save the parameter values 2. Start Controller Services a. Navigate to Flow Options → Services
  • 42. b. Select CDP_Schema_Registry service and click Enable Service and Referencing Components action c. Start from the top of the list and enable all remaining Controller services including KerberosPasswordUserService, HiveCatalogService, AvroReader, … d. Click Ok if confirmation is asked.
  • 43.
  • 44. e. Make sure all services have been enabled 3. Start the ConsumeFromKafka processor using the right click action menu or the Start button in the configuration drawer. It might already be started. After starting the processor, you should see events starting to queue up in the success_ConsumeFromKafka-FilterEvents connection. NOTE: To receive data from your topic, you will need either the first deployment still running or to run it from another Flow Designer Test Session. 2.3 Changing the flow to modify the schema for Iceberg integration
  • 45. Our data warehouse team has created an Iceberg table that they want us to ingest the critical syslog data in. A challenge we are facing is that not all column names in the Iceberg table match our syslog record schema. So we have to add functionality to our flow that allows us to change the schema of our syslog records. To do this, we will be using the JoltTransformRecord processor. 1. Add a new JoltTransformRecord to the canvas by dragging the processor icon to the canvas. 2. In the Add Processor window, select the JoltTransformRecord type and name the processor TransformSchema.
  • 46. 3. Validate that your new processor now appears on the canvas. 4. Create connections from ConsumeFromKafka to TransformSchema by hovering over the ConsumeFromKafka processor and dragging the arrow that appears to TransformSchema. Pick the success relationship to connect. Now connect the success relationship of TransformSchema to the MergeRecords processor.
  • 47. 5. Now that we have connected our new TransformSchema processor, we can delete the original connection between ConsumeFromKafka and MergeRecords. Make sure that the ConsumeFromKafka processor is stopped. Then select the connection, empty the queue if needed, and then delete it. Now all syslog events that we receive, will go through the TransformSchema processor. 6. To make sure that our schema transformation works, we have to create a new Record Writer Service and use it as the Record Writer for the TransformSchema processor. Select the TransformSchema processor and open the configuration panel. Scroll to the
  • 48. Properties section, click the three dot menu in the Record Writer row and select Add Service to create a new Record Writer. 7. Select AvroRecordSetWriter , name it TransformedSchemaWriter and click Add. Click Apply in the configuration panel to save your changes. 8. Now click the three dot menu again and select Go To Service to configure our new Avro Record Writer. 9. To configure our new Avro Record Writer, provide the following values: Name Description Value
  • 49. Schema Write Strategy Specify whether/how CDF should write schema information Embed Avro Schema Schema Access Strategy Specify how CDF identifies the schema to apply. Use ‘Schema Name’ Property Schema Registry Specify the Schema Registry that stores our schema CDP_Schema_Registry Schema Name The schema name to look up in the Schema Registry syslog_transformed 10. Convert the value that you provided for Schema Name into a parameter. Click on the three dot menu in the Schema Name row and select Convert To Parameter.
  • 50. 11. Give the parameter the name Schema Name Transformed and click “add”. You have now created a new parameter from a value that can be used in more places in your data flow. 12. Apply your configuration changes and Enable the Service by clicking the power icon. Now you have configured our new Schema Writer and we can return back to the Flow Designer canvas. If you have any issues, end the test session and restart. If your login timed out, close your browser and re login. 13. Click Back to Flow Designer to navigate back to the canvas. 14. Select TransformSchema to configure it and provide the following values: Name Description Value
  • 51. Record Reader Service used to parse incoming events AvroReader Record Writer Service used to format outgoing events TransformedSchemaWriter Jolt Specification The specification that describes how to modify the incoming JSON data. We are standardizing on lower case field names and renaming the timestamp field to event_timestamp. [ { "operation": "shift", "spec": { "appName": "appname", "timestamp": "event_timestamp", "structuredData": { "SDID": { "eventId": "structureddata.sdid.eventid", "eventSource": "structureddata.sdid.eventsource", "iut": "structureddata.sdid.iut" } }, "*": { "@": "&" } } } ] 15. Scroll to Relationships and select Terminate for the failure, original relationships and click Apply.
  • 52. 16. Start your ConsumeFromKafka and TransformSchema processor and validate that the transformed data matches our Iceberg table schema. 17. Once events are queuing up in the connection between TransformSchema and MergeRecord, right click the connection and select List Queue. 18. Select any of the queued files and select the book icon to open it in the Data Viewer 19. Notice how all field names have been transformed to lower case and how the timestamp field has been renamed to event_timestamp.
  • 53. 2.4 Merging records and start writing to Iceberg Now that we have verified that our schema is being transformed as needed, it’s time to start the remaining processors and write our events into the Iceberg table. The MergeRecords processor is configured to batch events up to increase efficiency when writing to Iceberg. The final processor, WriteToIceberg takes our Avro records and writes them into a Parquet formatted table. 1. Tip: You can change the configuration to something like “30 sec” to speed up processing. 2. Select the MergeRecords processor and explore its configuration. It is configured to batch events up for at least 30 seconds or until the queued up events have reached Maximum Bin Size of 1GB. You will want to lower these for testing. 3. Start the MergeRecords processor and verify that it batches up events and writes them out after 30 seconds. 4. Select the WriteToIceberg processor and explore its configuration. Notice how it relies on several parameters to establish a connection to the right database and table.
  • 54. 5. Start the WriteToIceberg processor and verify that it writes records successfully to Iceberg. If the metrics on the processor increase and you don’t see any warnings or events being written to the failure_WriteToIceberg connection, your writes are successful! Congratulations! With this you have completed the second use case. You may want to log into Hue to check your data has loaded.
  • 55. Feel free to publish your flow to the catalog and create a deployment just like you did for the first one.
  • 56. 3. Resize image flow deployed as serverless function DataFlow Functions provides a new, efficient way to run your event-driven Apache NiFi data flows. You can have your flow executed within AWS Lambda, Azure Functions or Google Cloud Functions and define the trigger that should start its execution. DataFlow Functions is perfect for use cases such as: - Processing files as soon as they land into the cloud provider object store - Creating microservices over HTTPS - CRON driven use cases - etc In this use case, we will be deploying a NiFi flow that will be triggered by HTTPS requests to resize images. Once deployed, the cloud provider will provide an HTTPS endpoint that you’ll be able to call to send an image, it will trigger the NiFi flow that will return a resized image based on your parameters. The deployment of the flow as a function will have to be done within your cloud provider. The below tutorial will use AWS as the cloud provider. If you’re using Azure or Google Cloud, you can still refer to this documentation to deploy the flow as a function. 3.1 Designing the flow for AWS Lambda
  • 57. 1. Go into Cloudera DataFlow / Flow Design and create a new draft with a name of your choice. 2. Drag and drop an Input Port named input onto the canvas. When triggered, AWS Lambda is going to inject into that input port a FlowFile containing the information about the HTTPS call that has been made. Example of payload that will be injected by AWS Lambda as a FlowFile: 3. Drag and drop an EvaluateJsonPath processor, call it ExtractHTTPHeaders. We’re going to use this to extract the HTTP headers that we want to keep in our flow. Add two properties configured as below. It’ll save as FlowFile’s attributes the HTTP headers (resize-height and resize-width) that we will be adding when making a call with our image to specify the dimensions of the resized image.
  • 58. resizeHeight => $.headers.resize-height resizeWidth => $.headers.resize-width Note: don’t forget to change Destination as “flowfile-attribute” and Click Apply. 4. Drag and drop another EvaluateJsonPath processor and then change it’s name to a unique one. This one will be used to retrieve the content of the body field from the payload we received and use it as the new content of the FlowFile. This field contains the actual representation of the image we have been sending over HTTP with Base 64 encoding. body => $.body 5. Drag and drop a Base64EncodeContent processor and change the mode to Decode. This will Base64 decode the content of the FlowFile to retrieve its binary format.
  • 59. 6. Drag and drop a ResizeImage processor. Use the previously created FlowFile attributes to specify the new dimensions of the image. Also, specify true for maintaining the ratio. 7. Drag and drop a Base64EncodeContent processor. To send back the resized image to the user, AWS Lambda expects us to send back a specific JSON payload with the Base 64 encoding of the image. 8. Drag and drop a ReplaceText processor. We use it to extract the Base 64 representation of the resized image and add it in the expected JSON payload. Add the below JSON in “Replacement Value” and change “Evaluation Mode” to “Entire text”. { "statusCode": 200, "headers": { "Content-Type": "image/png" }, "isBase64Encoded": true, "body": "$1" } 9. Drag and drop an output port. 10. Connect all the components together, you can auto-terminate the unused relationships. This should look like this:
  • 60. You can now publish the flow into the DataFlow Catalog in the Flow Options menu: Make sure to give it a name that is unique (you can prefix it with your name):
  • 61. Once the flow is published, make sure to copy the CRN of the published version (it will end by /v.1):
  • 62. 3.2 Deploying the flow as a function in AWS Lambda First thing first, go into DataFlow Functions and download the binary for running DataFlow Functions in AWS Lambda:
  • 63. This should download a binary with a name similar to: naaf-aws-lambda-1.0.0.2.3.7.0-100-bin.zip Once you have the binary, make sure, you also have: ● The CRN of the flow you published in the DataFlow Catalog ● The Access Key that has been provided with these instructions in “Competition Resources” section
  • 64. ● The Private Key that has been provided with these instructions in “Competition Resources” section In order to speed up the deployment, we’re going to leverage some scripts to automate the deployment. It assumes that your AWS CLI is properly configured locally on your laptop and you can use the jq command for reading JSON payloads. You can now follow the instructions from this page here. However, if you wish to deploy the flow in AWS Lambda manually through the AWS UI, you can follow the steps described here.