MAIN ORGANIZER: ORGANIZING COMMITTEE: dozens of organizations from the IT / data science sector (full list on the event website
Apache NiFi 202: Integration and Best
Practices
Tim Spann
Principal Developer Advocate, Cloudera
3
FLiPN-FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
https://github.com/tspannhw/EverythingApacheNiFi
https://medium.com/@tspann
Apache NiFi x Apache Kafka x Apache Flink
4
CONNECTED DEVICES ARE EVERYWHERE
EDGE
DATA
CENTER
Capture data from
all these sources
and scale with a
data streaming
platform in a hybrid
architecture
Miles driven Wear-out
tires
Wearing of
the doors
Engine wear
Rising temperature
Data capacity Compute speed
5
TODAY’S NEEDS FOR DATA STREAMING
Gain Competitive Advantage
“Many leading enterprises realize that real-time analytics — the analytics of
the present — is an incredible competitive advantage because they can
act now to serve fickle customers, fix operational problems, power
internet-of-things (IoT) apps, and respond decisively to competitors.”
Forrester
Supply chain
impacts
manufacturing
Predict
customer
buying pattern
Utilities
prevent power
outage
Telecoms
deliver
continuous
QoS
Reduce
cyber
threats
FLaNK Stack
MiNiFi
Agent
https://flankstack.dev/
7
1
Pick Data Source(s)
2
Validate and Convert
3
Aggregate or Eliminate
4
Send to Data Sink(s)
NiFi Flow Example -Visually Build Serverless Apps
Apache Kafka,
Databases, Files, REST
Endpoints, Cloud
Resources, NoSQL, JMS,
MQTT, TCP/IP, etc...
Validate types, nulls,
convert types, SQL and
check against schemas.
Build up key data
with multiple
sources or
removing unneeded
via Queries on
records
Apache Pulsar,
Databases, Files,
REST Endpoints,
Cloud Resources,
NoSQL, JMS, MQTT,
TCP/IP, etc...
Connect to any data source anywhere, process, and deliver to any
destination
CLOUDERA DATAFLOW - APACHE NIFI
Solve the First Mile Data Collection Problem
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud
9
Data Sources – Edge 2
Data Sources – Edge 3
Data Sources – Edge 1
Data Collection
at the Edge
Collect
NiFi / MiNiFi / Kafka Clients
Distribute
NiFi
Data Filtered by
NiFi
Visualize
Data Visualization
with Cloudera
DataViz
Analyze
Streaming OLAP
Analytics & Time
Series Store Powered
by
Kudu
Analyze
Streaming Analytics Apps
Stream
Processing
Powered by SQL
Stream Builder on
Flink
Buffer
Kafka
Syndicate
topics
Syndicate
Services Powered
by Kafka
Learn
CML
Build and Enrich
Models by
Cloudera Machine
Learning
LOG ANALYTICS ARCHITECTURE
10
SYSLOG RFC 5424
• PRI — or "priority", Facility (what kind of message) * 8 + Severity (how urgent is the message)
• VERSION — version is always "1" for RFC 5424
• TIMESTAMP — valid timestamp examples (must follow ISO 8601 format with uppercase "T" and "Z")
• HOSTNAME — using FQDN (fully qualified domain name) is recommended
• APP-NAME — usually the name of the device or application that provided the message
• PROCID — often used to provide the process name or process ID (is - "nil" in the example)
• MSGID — should identify the type of message
• STRUCTURED-DATA — named lists of key-value pairs for easy parsing and searching
• MSG — details about the event
11
NIFI CLUSTER ARCHITECTURE
• Zero-Master Clustering - Each node in a NiFi
cluster performs the same tasks on the data, but each
operates on a different set of data.
• Cluster Coordinator – Elected by Zookeeper and
Responsible for disconnecting and connecting nodes.
• Primary Node – Elected by Zookeeper. On this node,
it is possible to run Isolated Processors.
As a DataFlow manager, you can interact with the NiFi
cluster through the user interface (UI) of any node. Any
change you make is replicated to all nodes in the cluster,
allowing for multiple entry points.
12
CONCURRENT TASKS FOR
PROCESSORS
• Concurrent Tasks increases how many FlowFiles are
processed by a single processor by using system
resources that then are not usable by other Processors
• Increasing this value typically allows the processor to
handle more data in the same amount of time
13
QUEUE CONFIGURATION
• FlowFile Expiration - Data that cannot be processed
in a timely fashion can be automatically removed from
the flow
• Back Pressure Thresholds - Thresholds indicate
how much data should be allowed to exist in the
queue before the component that is the source of the
Connection is no longer scheduled to run. This allows
the system to avoid being overrun with data
• Load Balance Strategy – Strategy to distribute the
data in a flow across the nodes in the cluster. When
enabled, compression can be configured on FlowFile
contents and attributes
• Prioritization – Determines the order in which flow
files are processed
14
RECORD-ORIENTED DATA WITH NiFi
• Record Readers - Avro, CSV, Grok, IPFIX, JSAN1,
JSON, Parquet, Scripted, Syslog5424, Syslog,
WindowsEvent, XML
• Record Writers - Avro, CSV, FreeFromText, Json,
Parquet, Scripted, XML
• Record Reader and Writer support referencing a
schema registry for retrieving schemas when
necessary.
• Enable processors that accept any data format without
having to worry about the parsing and serialization
logic.
• Allows us to keep FlowFiles larger, each consisting of
multiple records, which results in far better
performance.
15
RUNNING SQL ON FLOWFILES
• Evaluates one or more SQL queries against
the contents of a FlowFile.
• This can be used, for example, for
field-specific filtering, transformation, and
row-level filtering.
• Columns can be renamed, simple calculations
and aggregations performed.
• The SQL statement must be valid ANSI SQL
and is powered by Apache Calcite.
16
STREAMS MESSAGING WITH KAFKA
• Highly reliable distributed messaging system.
• Decouple applications, enables many-to-many patterns.
• Publish-Subscribe semantics.
• Horizontal scalability.
• Efficient implementation to operate at speed with big
data volumes.
• Organized by topic to support several use cases.
17
Stateless Engine
• Granular containers per flow
• Flows From NiFi Registry
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
bin/nifi.sh stateless RunFromRegistry Continuous --file kafka.json
https://github.com/apache/nifi/blob/ea1becac4fc519c54b8b4d21773e68f8da364755/nifi-nar-bundles/nifi-framework-bundle/
nifi-framework/nifi-stateless/README.md
18
Stateless Engine
• See also Parameters
• Docker
• YARN
• Kubernetes (K8)
• Stateful NiFi clusters
• Apache OpenWhisk (FaaS)
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
{"registryUrl": "http://tspann-mbp15-hw14277:18080",
"bucketId": "140b30f0-5a47-4747-9021-19d4fde7f993",
"flowId": "0540e1fd-c7ca-46fb-9296-e37632021945",
"ssl": {
"keystoreFile": "","keystorePass": "","keyPass": "","keystoreType": "",
"truststoreFile":
"/Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home/lib/sec
urity/cacerts",
"truststorePass": "changeit", "truststoreType": "JKS"
},
"parameters": {
"broker" : "4.317.852.100:9092",
"topic" : "iot",
"group_id" : "nifi-stateless-kafka-consumer",
"DestinationDirectory" : "/tmp/nifistateless/output2/",
"output_dir": "/Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/logs/output"
}
}
https://github.com/tspannhw/stateless-examples
19
Parameters
• Parameters
• Parameter Context
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
20
Parameters
• Advanced Editors
• Easy to Use
• PARAM
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
21
RetryFlowFile
• Configurable Retries
• Maximum #
• Penalties
• When to Fail
• Reuse Mode
https://medium.com/@abdelkrim.hadjidj/apache-nifi-1-10-series-simplifying-error-handling-7de86f130acd
22
BackPressure
Prediction
• OrdinaryLeastSquares
• SimpleRegression
• Enable analytics feature
http://lonnifi.blogspot.com/2019/11/back-pressure-prediction-deep-dive.html?es_id=5233333939
https://youtu.be/Tt8TSlHu7PE
23
Flow Catalog
• Central repository for flow
definitions
• Import existing NiFi flows
• Manage flow definitions
• Initiate flow deployments
24
ReadyFlows
• Cloudera provided flow
definitions
• Cover most common data flow
use cases
• Can be deployed and adjusted
as needed
• Made available through docs
during Tech Preview
25
Deployment
Wizard
• Turns flow definitions into flow
deployments
• Guides users through providing
required configuration
• Pick from pre-defined NiFi
node sizes
• Define KPIs for the deployment
Start Deployment Wizard Provide Parameters
Configure Sizing & Scaling Define KPIs
26
Key Performance
Indicators
• Visibility into flow deployments
• Track high level flow
performance
• Track in-depth NiFi component
metrics
• Defined in Deployment Wizard
• Monitoring & Alerts in
Deployment Details
KPI Definition in Deployment Wizard KPI Monitoring
27
Dashboard
• Central Monitoring View
• Monitors flow deployments
across CDP environments
• Monitors flow deployment
health & performance
• Drill into flow deployment to
monitor system metrics and
deployment events
28
DATA FLOW
DESIGN FOR
EVERYONE
• Cloud-native data flow
development
• Developers get their own
sandbox
• Start developing flows without
installing NiFi
• Redesigned visual canvas
• Optimized interaction patterns
• Integration into CDF-PC Catalog
for versioning
DATAFLOW FUNCTIONS
30
© 2022 Cloudera, Inc. All rights reserved.
DATAFLOW FUNCTIONS - NO CODE SERVERLESS DEVELOPMENT
First no-code UI in the industry to quickly build & deploy functions to any Function as a Service (FaaS)
solution
Developers develop
functions on their local
developer workstation or in
CDP Public Cloud using
no-code designer
Deploy the functions on
Function as a Service
(FAAS) Solutions on
AWS, Azure & GCP
AWS Lambda Azure Functions Google Cloud Functions
31
© 2022 Cloudera, Inc. All rights reserved.
EVOLUTION OF DATAFLOW OFFERINGS
Infrastructure abstraction, simplicity,
operational efficiency
High throughput, low latency streaming use cases Next-Gen Cloud Service
DF Deployments on CDF-PC
K8S / Containers
Pay
for
Value
Bare Metal
Cluster Based Products
on-prem
Virtual Machines
(Cloud)
Cluster-as-service offering--
Public Cloud
Reduced Operational
Overhead
1
2
Cost effective through
auto-scaling
3
Central Monitoring & Easy
CI/CD Integration
CDF-PC
Event driven and microservice use cases
DF Functions on
CDF-PC
Serverless NiFi flows
new
DEMO
© 2022 Cloudera, Inc. All rights reserved. 33
TH N Y U
34
© 2022 Cloudera, Inc. All rights reserved.
Resources
Thank you for watching!
Remember to rate the presentation and
leave your questions in the section below.
www.WarszawskieDniInformatyki.pl 31 March - 1 April 2023 Politechnika Warszawska + online
MAIN ORGANIZER: ORGANIZING COMMITTEE: dozens of organizations from the IT / data science sector (full list on the event website

WarsawITDays_ ApacheNiFi202

  • 1.
    MAIN ORGANIZER: ORGANIZINGCOMMITTEE: dozens of organizations from the IT / data science sector (full list on the event website Apache NiFi 202: Integration and Best Practices Tim Spann Principal Developer Advocate, Cloudera
  • 3.
    3 FLiPN-FLaNK Stack Tim Spann @PaasDev// Blog: www.datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. https://github.com/tspannhw/EverythingApacheNiFi https://medium.com/@tspann Apache NiFi x Apache Kafka x Apache Flink
  • 4.
    4 CONNECTED DEVICES AREEVERYWHERE EDGE DATA CENTER Capture data from all these sources and scale with a data streaming platform in a hybrid architecture Miles driven Wear-out tires Wearing of the doors Engine wear Rising temperature Data capacity Compute speed
  • 5.
    5 TODAY’S NEEDS FORDATA STREAMING Gain Competitive Advantage “Many leading enterprises realize that real-time analytics — the analytics of the present — is an incredible competitive advantage because they can act now to serve fickle customers, fix operational problems, power internet-of-things (IoT) apps, and respond decisively to competitors.” Forrester Supply chain impacts manufacturing Predict customer buying pattern Utilities prevent power outage Telecoms deliver continuous QoS Reduce cyber threats
  • 6.
  • 7.
    7 1 Pick Data Source(s) 2 Validateand Convert 3 Aggregate or Eliminate 4 Send to Data Sink(s) NiFi Flow Example -Visually Build Serverless Apps Apache Kafka, Databases, Files, REST Endpoints, Cloud Resources, NoSQL, JMS, MQTT, TCP/IP, etc... Validate types, nulls, convert types, SQL and check against schemas. Build up key data with multiple sources or removing unneeded via Queries on records Apache Pulsar, Databases, Files, REST Endpoints, Cloud Resources, NoSQL, JMS, MQTT, TCP/IP, etc...
  • 8.
    Connect to anydata source anywhere, process, and deliver to any destination CLOUDERA DATAFLOW - APACHE NIFI Solve the First Mile Data Collection Problem Process Route Filter Enrich Transform Distribute Connectors Any destination Deliver Ingest Active Passive Connectors Gateway Endpoint Connect & Pull Send Data born in the cloud Data born outside the cloud
  • 9.
    9 Data Sources –Edge 2 Data Sources – Edge 3 Data Sources – Edge 1 Data Collection at the Edge Collect NiFi / MiNiFi / Kafka Clients Distribute NiFi Data Filtered by NiFi Visualize Data Visualization with Cloudera DataViz Analyze Streaming OLAP Analytics & Time Series Store Powered by Kudu Analyze Streaming Analytics Apps Stream Processing Powered by SQL Stream Builder on Flink Buffer Kafka Syndicate topics Syndicate Services Powered by Kafka Learn CML Build and Enrich Models by Cloudera Machine Learning LOG ANALYTICS ARCHITECTURE
  • 10.
    10 SYSLOG RFC 5424 •PRI — or "priority", Facility (what kind of message) * 8 + Severity (how urgent is the message) • VERSION — version is always "1" for RFC 5424 • TIMESTAMP — valid timestamp examples (must follow ISO 8601 format with uppercase "T" and "Z") • HOSTNAME — using FQDN (fully qualified domain name) is recommended • APP-NAME — usually the name of the device or application that provided the message • PROCID — often used to provide the process name or process ID (is - "nil" in the example) • MSGID — should identify the type of message • STRUCTURED-DATA — named lists of key-value pairs for easy parsing and searching • MSG — details about the event
  • 11.
    11 NIFI CLUSTER ARCHITECTURE •Zero-Master Clustering - Each node in a NiFi cluster performs the same tasks on the data, but each operates on a different set of data. • Cluster Coordinator – Elected by Zookeeper and Responsible for disconnecting and connecting nodes. • Primary Node – Elected by Zookeeper. On this node, it is possible to run Isolated Processors. As a DataFlow manager, you can interact with the NiFi cluster through the user interface (UI) of any node. Any change you make is replicated to all nodes in the cluster, allowing for multiple entry points.
  • 12.
    12 CONCURRENT TASKS FOR PROCESSORS •Concurrent Tasks increases how many FlowFiles are processed by a single processor by using system resources that then are not usable by other Processors • Increasing this value typically allows the processor to handle more data in the same amount of time
  • 13.
    13 QUEUE CONFIGURATION • FlowFileExpiration - Data that cannot be processed in a timely fashion can be automatically removed from the flow • Back Pressure Thresholds - Thresholds indicate how much data should be allowed to exist in the queue before the component that is the source of the Connection is no longer scheduled to run. This allows the system to avoid being overrun with data • Load Balance Strategy – Strategy to distribute the data in a flow across the nodes in the cluster. When enabled, compression can be configured on FlowFile contents and attributes • Prioritization – Determines the order in which flow files are processed
  • 14.
    14 RECORD-ORIENTED DATA WITHNiFi • Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet, Scripted, Syslog5424, Syslog, WindowsEvent, XML • Record Writers - Avro, CSV, FreeFromText, Json, Parquet, Scripted, XML • Record Reader and Writer support referencing a schema registry for retrieving schemas when necessary. • Enable processors that accept any data format without having to worry about the parsing and serialization logic. • Allows us to keep FlowFiles larger, each consisting of multiple records, which results in far better performance.
  • 15.
    15 RUNNING SQL ONFLOWFILES • Evaluates one or more SQL queries against the contents of a FlowFile. • This can be used, for example, for field-specific filtering, transformation, and row-level filtering. • Columns can be renamed, simple calculations and aggregations performed. • The SQL statement must be valid ANSI SQL and is powered by Apache Calcite.
  • 16.
    16 STREAMS MESSAGING WITHKAFKA • Highly reliable distributed messaging system. • Decouple applications, enables many-to-many patterns. • Publish-Subscribe semantics. • Horizontal scalability. • Efficient implementation to operate at speed with big data volumes. • Organized by topic to support several use cases.
  • 17.
    17 Stateless Engine • Granularcontainers per flow • Flows From NiFi Registry https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html bin/nifi.sh stateless RunFromRegistry Continuous --file kafka.json https://github.com/apache/nifi/blob/ea1becac4fc519c54b8b4d21773e68f8da364755/nifi-nar-bundles/nifi-framework-bundle/ nifi-framework/nifi-stateless/README.md
  • 18.
    18 Stateless Engine • Seealso Parameters • Docker • YARN • Kubernetes (K8) • Stateful NiFi clusters • Apache OpenWhisk (FaaS) https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html {"registryUrl": "http://tspann-mbp15-hw14277:18080", "bucketId": "140b30f0-5a47-4747-9021-19d4fde7f993", "flowId": "0540e1fd-c7ca-46fb-9296-e37632021945", "ssl": { "keystoreFile": "","keystorePass": "","keyPass": "","keystoreType": "", "truststoreFile": "/Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home/lib/sec urity/cacerts", "truststorePass": "changeit", "truststoreType": "JKS" }, "parameters": { "broker" : "4.317.852.100:9092", "topic" : "iot", "group_id" : "nifi-stateless-kafka-consumer", "DestinationDirectory" : "/tmp/nifistateless/output2/", "output_dir": "/Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/logs/output" } } https://github.com/tspannhw/stateless-examples
  • 19.
    19 Parameters • Parameters • ParameterContext https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
  • 20.
    20 Parameters • Advanced Editors •Easy to Use • PARAM https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
  • 21.
    21 RetryFlowFile • Configurable Retries •Maximum # • Penalties • When to Fail • Reuse Mode https://medium.com/@abdelkrim.hadjidj/apache-nifi-1-10-series-simplifying-error-handling-7de86f130acd
  • 22.
    22 BackPressure Prediction • OrdinaryLeastSquares • SimpleRegression •Enable analytics feature http://lonnifi.blogspot.com/2019/11/back-pressure-prediction-deep-dive.html?es_id=5233333939 https://youtu.be/Tt8TSlHu7PE
  • 23.
    23 Flow Catalog • Centralrepository for flow definitions • Import existing NiFi flows • Manage flow definitions • Initiate flow deployments
  • 24.
    24 ReadyFlows • Cloudera providedflow definitions • Cover most common data flow use cases • Can be deployed and adjusted as needed • Made available through docs during Tech Preview
  • 25.
    25 Deployment Wizard • Turns flowdefinitions into flow deployments • Guides users through providing required configuration • Pick from pre-defined NiFi node sizes • Define KPIs for the deployment Start Deployment Wizard Provide Parameters Configure Sizing & Scaling Define KPIs
  • 26.
    26 Key Performance Indicators • Visibilityinto flow deployments • Track high level flow performance • Track in-depth NiFi component metrics • Defined in Deployment Wizard • Monitoring & Alerts in Deployment Details KPI Definition in Deployment Wizard KPI Monitoring
  • 27.
    27 Dashboard • Central MonitoringView • Monitors flow deployments across CDP environments • Monitors flow deployment health & performance • Drill into flow deployment to monitor system metrics and deployment events
  • 28.
    28 DATA FLOW DESIGN FOR EVERYONE •Cloud-native data flow development • Developers get their own sandbox • Start developing flows without installing NiFi • Redesigned visual canvas • Optimized interaction patterns • Integration into CDF-PC Catalog for versioning
  • 29.
  • 30.
    30 © 2022 Cloudera,Inc. All rights reserved. DATAFLOW FUNCTIONS - NO CODE SERVERLESS DEVELOPMENT First no-code UI in the industry to quickly build & deploy functions to any Function as a Service (FaaS) solution Developers develop functions on their local developer workstation or in CDP Public Cloud using no-code designer Deploy the functions on Function as a Service (FAAS) Solutions on AWS, Azure & GCP AWS Lambda Azure Functions Google Cloud Functions
  • 31.
    31 © 2022 Cloudera,Inc. All rights reserved. EVOLUTION OF DATAFLOW OFFERINGS Infrastructure abstraction, simplicity, operational efficiency High throughput, low latency streaming use cases Next-Gen Cloud Service DF Deployments on CDF-PC K8S / Containers Pay for Value Bare Metal Cluster Based Products on-prem Virtual Machines (Cloud) Cluster-as-service offering-- Public Cloud Reduced Operational Overhead 1 2 Cost effective through auto-scaling 3 Central Monitoring & Easy CI/CD Integration CDF-PC Event driven and microservice use cases DF Functions on CDF-PC Serverless NiFi flows new
  • 32.
  • 33.
    © 2022 Cloudera,Inc. All rights reserved. 33 TH N Y U
  • 34.
    34 © 2022 Cloudera,Inc. All rights reserved. Resources
  • 35.
    Thank you forwatching! Remember to rate the presentation and leave your questions in the section below. www.WarszawskieDniInformatyki.pl 31 March - 1 April 2023 Politechnika Warszawska + online MAIN ORGANIZER: ORGANIZING COMMITTEE: dozens of organizations from the IT / data science sector (full list on the event website