WarsawITDays_ ApacheNiFi202

MAIN ORGANIZER: ORGANIZING COMMITTEE: dozens of organizations from the IT / data science sector (full list on the event website
Apache NiFi 202: Integration and Best
Practices
Tim Spann
Principal Developer Advocate, Cloudera

3
FLiPN-FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
https://github.com/tspannhw/EverythingApacheNiFi
https://medium.com/@tspann
Apache NiFi x Apache Kafka x Apache Flink

4
CONNECTED DEVICES ARE EVERYWHERE
EDGE
DATA
CENTER
Capture data from
all these sources
and scale with a
data streaming
platform in a hybrid
architecture
Miles driven Wear-out
tires
Wearing of
the doors
Engine wear
Rising temperature
Data capacity Compute speed

5
TODAY’S NEEDS FOR DATA STREAMING
Gain Competitive Advantage
“Many leading enterprises realize that real-time analytics — the analytics of
the present — is an incredible competitive advantage because they can
act now to serve ﬁckle customers, ﬁx operational problems, power
internet-of-things (IoT) apps, and respond decisively to competitors.”
Forrester
Supply chain
impacts
manufacturing
Predict
customer
buying pattern
Utilities
prevent power
outage
Telecoms
deliver
continuous
QoS
Reduce
cyber
threats

FLaNK Stack
MiNiFi
Agent
https://flankstack.dev/

7
1
Pick Data Source(s)
2
Validate and Convert
3
Aggregate or Eliminate
4
Send to Data Sink(s)
NiFi Flow Example -Visually Build Serverless Apps
Apache Kafka,
Databases, Files, REST
Endpoints, Cloud
Resources, NoSQL, JMS,
MQTT, TCP/IP, etc...
Validate types, nulls,
convert types, SQL and
check against schemas.
Build up key data
with multiple
sources or
removing unneeded
via Queries on
records
Apache Pulsar,
Databases, Files,
REST Endpoints,
Cloud Resources,
NoSQL, JMS, MQTT,
TCP/IP, etc...

Connect to any data source anywhere, process, and deliver to any
destination
CLOUDERA DATAFLOW - APACHE NIFI
Solve the First Mile Data Collection Problem
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud

9
Data Sources – Edge 2
Data Collection
at the Edge
Collect
NiFi / MiNiFi / Kafka Clients
Distribute
NiFi
Data Filtered by
NiFi
Visualize
Data Visualization
with Cloudera
DataViz
Analyze
Streaming OLAP
Analytics & Time
Series Store Powered
by
Kudu
Analyze
Streaming Analytics Apps
Stream
Processing
Powered by SQL
Stream Builder on
Flink
Buffer
Kafka
Syndicate
topics
Syndicate
Services Powered
by Kafka
Learn
CML
Build and Enrich
Models by
Cloudera Machine
Learning
LOG ANALYTICS ARCHITECTURE

10
SYSLOG RFC 5424
• PRI — or "priority", Facility (what kind of message) * 8 + Severity (how urgent is the message)
• VERSION — version is always "1" for RFC 5424
• TIMESTAMP — valid timestamp examples (must follow ISO 8601 format with uppercase "T" and "Z")
• HOSTNAME — using FQDN (fully qualiﬁed domain name) is recommended
• APP-NAME — usually the name of the device or application that provided the message
• PROCID — often used to provide the process name or process ID (is - "nil" in the example)
• MSGID — should identify the type of message
• STRUCTURED-DATA — named lists of key-value pairs for easy parsing and searching
• MSG — details about the event

11
NIFI CLUSTER ARCHITECTURE
• Zero-Master Clustering - Each node in a NiFi
cluster performs the same tasks on the data, but each
operates on a different set of data.
• Cluster Coordinator – Elected by Zookeeper and
Responsible for disconnecting and connecting nodes.
• Primary Node – Elected by Zookeeper. On this node,
it is possible to run Isolated Processors.
As a DataFlow manager, you can interact with the NiFi
cluster through the user interface (UI) of any node. Any
change you make is replicated to all nodes in the cluster,
allowing for multiple entry points.

12
CONCURRENT TASKS FOR
PROCESSORS
• Concurrent Tasks increases how many FlowFiles are
processed by a single processor by using system
resources that then are not usable by other Processors
• Increasing this value typically allows the processor to
handle more data in the same amount of time

13
QUEUE CONFIGURATION
• FlowFile Expiration - Data that cannot be processed
in a timely fashion can be automatically removed from
the flow
• Back Pressure Thresholds - Thresholds indicate
how much data should be allowed to exist in the
queue before the component that is the source of the
Connection is no longer scheduled to run. This allows
the system to avoid being overrun with data
• Load Balance Strategy – Strategy to distribute the
data in a flow across the nodes in the cluster. When
enabled, compression can be configured on FlowFile
contents and attributes
• Prioritization – Determines the order in which flow
files are processed

14
RECORD-ORIENTED DATA WITH NiFi
• Record Readers - Avro, CSV, Grok, IPFIX, JSAN1,
JSON, Parquet, Scripted, Syslog5424, Syslog,
WindowsEvent, XML
• Record Writers - Avro, CSV, FreeFromText, Json,
Parquet, Scripted, XML
• Record Reader and Writer support referencing a
schema registry for retrieving schemas when
necessary.
• Enable processors that accept any data format without
having to worry about the parsing and serialization
logic.
• Allows us to keep FlowFiles larger, each consisting of
multiple records, which results in far better
performance.

15
RUNNING SQL ON FLOWFILES
• Evaluates one or more SQL queries against
the contents of a FlowFile.
• This can be used, for example, for
field-specific filtering, transformation, and
row-level filtering.
• Columns can be renamed, simple calculations
and aggregations performed.
• The SQL statement must be valid ANSI SQL
and is powered by Apache Calcite.

16
STREAMS MESSAGING WITH KAFKA
• Highly reliable distributed messaging system.
• Decouple applications, enables many-to-many patterns.
• Publish-Subscribe semantics.
• Horizontal scalability.
• Efficient implementation to operate at speed with big
data volumes.
• Organized by topic to support several use cases.

17
Stateless Engine
• Granular containers per flow
• Flows From NiFi Registry
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
bin/nifi.sh stateless RunFromRegistry Continuous --file kafka.json
https://github.com/apache/nifi/blob/ea1becac4fc519c54b8b4d21773e68f8da364755/nifi-nar-bundles/nifi-framework-bundle/
nifi-framework/nifi-stateless/README.md

18
Stateless Engine
• See also Parameters
• Docker
• YARN
• Kubernetes (K8)
• Stateful NiFi clusters
• Apache OpenWhisk (FaaS)
{"registryUrl": "http://tspann-mbp15-hw14277:18080",
"bucketId": "140b30f0-5a47-4747-9021-19d4fde7f993",
"flowId": "0540e1fd-c7ca-46fb-9296-e37632021945",
"ssl": {
"keystoreFile": "","keystorePass": "","keyPass": "","keystoreType": "",
"truststoreFile":
"/Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home/lib/sec
urity/cacerts",
"truststorePass": "changeit", "truststoreType": "JKS"
},
"parameters": {
"broker" : "4.317.852.100:9092",
"topic" : "iot",
"group_id" : "nifi-stateless-kafka-consumer",
"DestinationDirectory" : "/tmp/nifistateless/output2/",
"output_dir": "/Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/logs/output"
}
}
https://github.com/tspannhw/stateless-examples

19
Parameters
• Parameters
• Parameter Context

20
Parameters
• Advanced Editors
• Easy to Use
• PARAM

21
RetryFlowFile
• Conﬁgurable Retries
• Maximum #
• Penalties
• When to Fail
• Reuse Mode
https://medium.com/@abdelkrim.hadjidj/apache-niﬁ-1-10-series-simplifying-error-handling-7de86f130acd

22
BackPressure
Prediction
• OrdinaryLeastSquares
• SimpleRegression
• Enable analytics feature
http://lonniﬁ.blogspot.com/2019/11/back-pressure-prediction-deep-dive.html?es_id=5233333939
https://youtu.be/Tt8TSlHu7PE

23
Flow Catalog
• Central repository for flow
definitions
• Import existing NiFi flows
• Manage flow definitions
• Initiate flow deployments

24
ReadyFlows
• Cloudera provided flow
definitions
• Cover most common data flow
use cases
• Can be deployed and adjusted
as needed
• Made available through docs
during Tech Preview

25
Deployment
Wizard
• Turns flow definitions into flow
deployments
• Guides users through providing
required configuration
• Pick from pre-defined NiFi
node sizes
• Define KPIs for the deployment
Start Deployment Wizard Provide Parameters
Configure Sizing & Scaling Define KPIs

26
Key Performance
Indicators
• Visibility into flow deployments
• Track high level flow
performance
• Track in-depth NiFi component
metrics
• Defined in Deployment Wizard
• Monitoring & Alerts in
Deployment Details
KPI Definition in Deployment Wizard KPI Monitoring

27
Dashboard
• Central Monitoring View
• Monitors flow deployments
across CDP environments
• Monitors flow deployment
health & performance
• Drill into flow deployment to
monitor system metrics and
deployment events

28
DATA FLOW
DESIGN FOR
EVERYONE
• Cloud-native data ﬂow
development
• Developers get their own
sandbox
• Start developing ﬂows without
installing NiFi
• Redesigned visual canvas
• Optimized interaction patterns
• Integration into CDF-PC Catalog
for versioning

30
© 2022 Cloudera, Inc. All rights reserved.
DATAFLOW FUNCTIONS - NO CODE SERVERLESS DEVELOPMENT
First no-code UI in the industry to quickly build & deploy functions to any Function as a Service (FaaS)
solution
Developers develop
functions on their local
developer workstation or in
CDP Public Cloud using
no-code designer
Deploy the functions on
Function as a Service
(FAAS) Solutions on
AWS, Azure & GCP
AWS Lambda Azure Functions Google Cloud Functions

31
EVOLUTION OF DATAFLOW OFFERINGS
Infrastructure abstraction, simplicity,
operational eﬃciency
High throughput, low latency streaming use cases Next-Gen Cloud Service
DF Deployments on CDF-PC
K8S / Containers
Pay
for
Value
Bare Metal
Cluster Based Products
on-prem
Virtual Machines
(Cloud)
Cluster-as-service offering--
Public Cloud
Reduced Operational
Overhead
1
2
Cost effective through
auto-scaling
3
Central Monitoring & Easy
CI/CD Integration
CDF-PC
Event driven and microservice use cases
DF Functions on
CDF-PC
Serverless NiFi ﬂows
new

34
Resources

Thank you for watching!
Remember to rate the presentation and
leave your questions in the section below.
www.WarszawskieDniInformatyki.pl 31 March - 1 April 2023 Politechnika Warszawska + online
MAIN ORGANIZER: ORGANIZING COMMITTEE: dozens of organizations from the IT / data science sector (full list on the event website

WarsawITDays_ ApacheNiFi202

More Related Content

Similar to WarsawITDays_ ApacheNiFi202

More from Timothy Spann

Recently uploaded

WarsawITDays_ ApacheNiFi202