Integrating Apache NiFi and Apache Flink

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Integrating Apache NiFi and Apache Flink
Feb 4th 2016
Bryan Bende – Member of Technical Staff

Outline
• Introduction to NiFi
• NiFi Site-To-Site
• Flink + NiFi Integration
• Use Case Discussion

About Me
• Member of Technical Staff at Hortonworks
• Apache NiFi Committer & PMC Member since June 2015
• Contributed NiFi + Flink Streaming Integration
• Twitter: @bbende / Blog: bryanbende.com

Introduction to Apache NiFi

Apache NiFi
• Powerful and reliable system to process and
distribute data
• Directed graphs of data routing and transformation
• Web-based User Interface for creating, monitoring,
& controlling data flows
• Highly configurable - modify data flow at runtime,
dynamically prioritize data
• Data Provenance tracks data through entire
system
• Easily extensible through development of custom
components
[1] https://nifi.apache.org/

NiFi - Terminology
FlowFile
• Unit of data moving through the system
• Content + Attributes (key/value pairs)
Processor
• Performs the work, can access FlowFiles
Connection
• Links between processors
• Queues that can be dynamically prioritized
Process Group
• Set of processors and their connections
• Receive data via input ports, send data via output ports

NiFi - User Interface
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections

NiFi - Provenance
• Tracks data at each point as it flows
through the system
• Records, indexes, and makes
events available for display
• Handles fan-in/fan-out, i.e. merging
and splitting data
• View attributes and content at given
points in time

NiFi - Queue Prioritization
• Configure a prioritizer per
connection
• Determine what is important for your
data – time based, arrival order,
importance of a data set
• Funnel many connections down to a
single connection to prioritize across
data sets
• Develop your own prioritizer if
needed

NiFi - Extensibility
Built from the ground up with extensions in mind
Service-loader pattern for…
• Processors
• Controller Services
• Reporting Tasks
• Prioritizers
Extensions packaged as NiFi Archives (NARs)
• Deploy NiFi lib directory and restart
• Provides ClassLoader isolation
• Same model as standard components

NiFi - Architecture
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
NiFi Cluster Manager – Request Replicator
Web Server
Master
NiFi Cluster
Manager (NCM)
OS/Host
JVM
Flow Controller
Web Server
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Slaves
NiFi Nodes

NiFi Site-To-Site

NiFi Site-To-Site
• Direct communication between two NiFi instances
• Push to Input Port on receiver, or Pull from Output Port on source
• Communicate between clusters, standalone instances, or both
• Handles load balancing and reliable delivery
• Secure connections using certificates (optional)

Site-To-Site Push
• Source connects Remote Process Group to Input Port on destination
• Site-To-Site takes care of load balancing across the nodes in the cluster
NCM
Node 1
Input Port
Node 2
Input Port
Standalone NiFi
RPG

Site-To-Site Pull
• Destination connects Remote Process Group to Output Port on the source
• If source was a cluster, each node would pull from each node in cluster
NCM
Node 1
RPG
Node 2
RPG
Standalone NiFi
Output Port

Site-To-Site Client
• Code for Site-To-Site broken out into reusable module
• https://github.com/apache/nifi/tree/master/nifi-commons/nifi-site-to-site-client
• Can be used from any Java program to push/pull from NiFi
Java Program
Site-To-Site Client
Node 1
Output Port
NCM
Node 2
Output Port

Flink + NiFi Integration

Flink + NiFi Integration
• Use Site-To-Site Client in Flink Streaming
• NiFiSource to pull data from NiFi Output Port
• NiFiSink to push data to NiFi Input Port
• NiFiDataPacket to represent data to/from NiFi (think FlowFile)
public interface NiFiDataPacket {
byte[] getContent();
Map<String, String> getAttributes();
}

NiFi Source Example
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
SiteToSiteClientConfig clientConfig = new
SiteToSiteClient.Builder()
.url("http://localhost:8080/nifi")
.portName("Data for Flink")
.requestBatchCount(…)
.buildConfig();
SourceFunction<NiFiDataPacket> nifiSource = new
NiFiSource(clientConfig);
DataStream<NiFiDataPacket> streamSource =
env.addSource(nifiSource);

NiFi Sink Example
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
SiteToSiteClientConfig clientConfig = new
SiteToSiteClient.Builder()
.url("http://localhost:8080/nifi")
.portName("Data from Flink")
.buildConfig();
// Creates a NiFiDataPacket from incoming data of a given type
// Here we are creating NiFiDataPackets for each String
NiFiDataPacketBuilder<String> dpb = ...
DataStreamSink<String> dataStream = ...
.addSink(new NiFiSink<>(clientConfig, dpb));

Use Case Discussion

Drive Data to Flink for Analysis
NiFi Flink
NiFi
NiFi
• Drive data from sources to central data center for analysis
• Tiered collection approach at various locations, think regional data centers
Edge
Edge
Core

Dynamically Adjusting Data Flow
• Push analytic results from Flink back to NiFi
• Push results back to edge locations/devices to change behavior
NiFi Flink
NiFi
NiFi
Edge
Edge
Core

1. Logs filtered by level and sent from Edge -> Core
2. Flink produces new filter levels based on rate & sends back to core
3. Edge polls core for new filter levels & updates filtering
Example: Dynamic Log Collection
Core NiFi
Flink
Edge NiFi
Logs Logs
New Filters
Logs Output Log Input Log Output
Result Input Store Result
Service Fetch ResultPoll Service
Filter
New Filters
New
Filters
Poll
Analytic

Dynamic Log Collection – Edge NiFi

Dynamic Log Collection – Core NiFi

Dynamic Log Collection – Flink Streaming
StreamExecutionEnvironment env = ...
SiteToSiteClientConfig clientConfig = getSourceConfig(props);
DataStream<NiFiDataPacket> streamSource =
env.addSource(new NiFiSource(clientConfig));
int windowMs = ...
LogLevelFlatMap logLevelFlatMap = new LogLevelFlatMap(...);
DataStream<LogLevels> counts =
streamSource.flatMap(logLevelFlatMap)
.timeWindowAll(Time.of(windowSize, TimeUnit.MILLISECONDS))
.apply(new LogLevelWindowCounter());
double rate = ...
SiteToSiteClientConfig sinkConfig = getSinkConfig(props);
NiFiDataPacketBuilder<LogLevels> builder = new DictionaryBuilder(window, rate);
counts.addSink(new NiFiSink<>(sinkConfig, builder));

Dynamic Log Collection – Full Flow
NiFi Flink
NiFi
NiFi
Edge
Edge
Core
Logs
Logs
Logs
New Filters
New Filters
New Filters

Summary
• Use NiFi to drive data from sources to Flink
• Leverage Flink results to adjust your dataflows
Sources
• [1] https://nifi.apache.org/
Resources
• https://github.com/bbende/nifi-streaming-examples
• https://github.com/apache/flink/tree/master/flink-examples/flink-examples-streaming
• https://flink.apache.org/news/2015/02/09/streaming-example.html
Contact Info:
• Email: bbende@hortonworks.com
• Twitter: @bbende

Thank you

Integrating Apache NiFi and Apache Flink

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (8)

Similar to Integrating Apache NiFi and Apache Flink

Similar to Integrating Apache NiFi and Apache Flink (20)

Recently uploaded

Recently uploaded (20)

Integrating Apache NiFi and Apache Flink