Introduction to data flow management using apache nifi

Introduction to DataFlow
management using Apache NiFi
Presented by: Anshuman Ghosh

Topics we will cover
 DataFlow and problems.
 What is Apache NiFi – History, key features, core components
 Architecture To start with NiFi (Single server setup)
 Architecture To scale with NiFi (NiFi cluster setup)
 Fundamentals of NiFi Web UI
 Building a NiFi DataFlow Processor
 Live demo
 Testing
 Deployment and automation
 What next?
 Q&A

DataFlow
 The term “DataFlow” can be used in variety of contexts.
 In our context it is the flow of information between systems.
 It is crucial to have a robust platform to create, manage and automate the
flow of enterprise data.
 There are many tools for data gathering and data flow, but more often
than not we lack an integrated platform for that.
 Probably an ideal situation would be have a seamless integration ,..

What enterprises look for
To be able to get data from any source
… To the systems that performs Analytics
… And to those for user availability

Common DataFlow challenges
 System failure
 Difference between data production and consumption
 Change in dynamic data priority
 Protocols and format changes; new systems, new protocols
 Need of bidirectional data flow
 Transparency and control
 Security and privacy

Brief history of Apache NiFi
 Developed at NSA (National Security Agency, USA) for over 8 years.
 Onyara engineers, for NSA, have developed a project called “Niagara
Files” which later went on to become NiFi.
 Trough NSA Technology transfer program it was made available as an open
source Apache project “Apache NiFi” in the year 2014.
 Hortonworks has a partnership with Onyara on their “Hortonworks DataFlow
powered by Apache NiFi”

What is Apache NiFi
 Holistically Apache NiFi is an integrated platform to collect, conduct and
curate real-time data (data in motion).
 Provides an end to end DataFlow management from any source* to any
destination*.
 Provides data logistics – real-time operational visibility and control of
DataFlow.
 Supports powerful and scalable directed graphs of data routing and data
transformation.
 All these in a reliable and secure manner.
*complete list of source and destination on official documentation

Key features
 Guaranteed data delivery – “at least once” semantics
 Data buffering and Back pressure
 Data prioritization in queue
 Flow specific setting for “latency vs. throughput”
 Data provenance
 Visual control
 Flow templates
 Recovery/ Recording through content repository
 Clustering to scale-out
 Security
 Classloader Isolation

Core components of NiFi
 NiFi at it’s core follow the concept of Flow Based programming.
 Core components of NiFi are
 FlowFile – the unit of information packet
 FlowFile Processor – the processing engine; black box.
 Connection – the relation between Processors and bounded buffer.
 Flow Controller – the scheduler in real world.
 Process Group – the compact function or subnet

Core components diagram
 This is how a typical NiFi DataFlow might look

NiFi Architecture
 NiFi executes within a JVM on a host Operating System.

NiFi Architecture – Clustering
 Typical NiFi cluster

Core components of NiFi Cluster
 NiFi Cluster Manager
 Nodes
 Primary Node
 Isolated Processors
 Heartbeats

Building a DataFlow Processor
 Drag the “Processor” icon from “Component Toolbar” into the canvas; this
will provide a ‘Add Processor’ wizard

 General ‘SETTINGS’ for the processor

 ‘SCHEDULING’ information

 Setting up mandatory and optional ‘PROPERTIES’

 Auto alert mechanism
 If there is an error it will not allow to start the processor

 If everything is se, we are ready to initiate/ start the process

Demo 1
 In this demo, we will go through a NiFi DataFlow that deals with the
following steps
 Connect to Kafka and consume from a topic.
 Store consumed data in a local storage (optional).
 Anonymize IP address.
 Merge content before writing to HDFS (small file issues).
 Finally store Kafka data onto HDFS
 Look into error handling.
 Look into use of expression language.

Demo 2
 In this demo, we will go through a NiFi DataFlow that deals with the
following steps
 Collect/ fetch data files from a local location.
 Update/ add attributes.
 Parse JSON strings to DB Insert statements.
 Connect to PostgreSQL and Insert.
 Error handling.

Unit testing components
 For component testing nifi-mock module can be used with JUnit.
 The TestRunner interface allows us to test Processors and Controller Services.
 We need to instantiate and get a new TestRunner (org.apache.nifi.util)
 Add Controller Services and configure
 Set property of Processors setProperty(PropertyDescriptor, String)
 Enqueue FlowFiles by using the enqueue methods of the TestRunner class.
 Processor can be started by triggering run() method of TestRunner.
 Validate output – using the TestRunners assertAllFlowFilesTransferred and
assertTransferCount methods.
 More details can be found here – https://nifi.apache.org/docs/nifi-
docs/html/developer-guide.html#testing

 Add Maven dependency
 Call static newTestRunner method of the TestRunners class
 Call addControllerService method to add controller
 Set properties by setProperty(ControllerService, PropertyDescriptor, String)
 Enable services by enableControllerService(ControllerService)
 Set processor property setProperty(PropertyDescriptor, String)
 Override enqueue method for byte[], InputStream, or Path.
 run(int); This will call methods with @OnScheduled annotation, Processor’s
onTrigger method, and then run the @OnUnscheduled and finally @OnStopped
methods.
 Validate result by assertAllFlowFilesTransferred and assertTransferCount methods.
 Access FlowFiles by calling getFlowFilesForRelationship() method

Error handling
 Following can occur
 Unexpected data format
 Network connection, disk failure
 Bug in processor
 ProcessException and all others (like null pointer)
 ProcessException – Rollback and penalize the FlowFiles
 All others – Rollback, penalize the FlowFiles and Yield the Processor

Testing automation, Deployment
 NiFi provides ‘ReST’ API for all components and entire documentation can
be found here https://nifi.apache.org/docs/nifi-docs/rest-api/index.html
 Apache NiFi Community is working to improve on this area
 We can setup the deployment in following way
 Create an application i.e. entire DataFlow in your local machine and test.
 Create a process group around that (optional though)
 Create a template. (Can be done from Web UI/ ReST API call)
 Download the template. (Can be done from Web UI/ ReST API call)
 Use ReST API call to import the template in new environment.
 Use ReST API call to Update Processors (Properties, Schedule, and Settings etc.)
 Use ReST API call to Instantiate a template

Deployment
 There can be one more option to do it.
 Copying the whole flow (flow.xml.gz) from one environment to another
 Need to copy the entire canvas.
 Need to take care of sensitive properties encryption.

What is next
 We are planning to work on the testing, deployment side and update it.
 Please read more on NiFi development here –
https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html
 And for user guide – https://nifi.apache.org/docs/nifi-docs/html/user-
guide.html
 We have carried out POCs on some of our real use cases; please find them
here
 Link HDFS data ingestion using Apache
 Link How to setup Apache NiFi
 Link Expression Language Guide
 Any questions and/ or suggestions please come by or write 

Thank you!
Presented by: Anshuman Ghosh

Introduction to data flow management using apache nifi

More Related Content

What's hot

Viewers also liked

Similar to Introduction to data flow management using apache nifi

Recently uploaded

Introduction to data flow management using apache nifi