This document summarizes a presentation on developing with Apache NiFi. It discusses NiFi's REST API for programmatic access, the NiFi developer guide for building custom processors, and tips for contributing to the NiFi project through the GitHub pull request process. Key aspects of the NiFi architecture like its repositories and FlowFile lifecycle are also overviewed.
NiFi REST API
•The Rest API provides programmatic access to command and control a
NiFi instance in real time.
• Start and stop processors, monitor queues, query provenance data, and
more.
NiFi REST API
•Every component in NiFi actually has a unique ID.
• Every operation to component is actually REST request to NiFi instance.
• Most of operation need to specify component ID
• https:// /nifi-api/process-groups
/015d1045-0b88-1db2-da38-cb71ac006792/process-groups
NiFi Instance URL
REST API Usage
REST Path
Unique Component ID
NiFi REST API
•RevisionDTO – indentify component version view to client
ProcessGroupDTO – Component body of ProcessGroup
PositionDTO – Position in canvas
• All DTO, Entity are provided.
<dependency>
<groupId>org.apache.nifi</groupId>
<artifactId>nifi-client-dto</artifactId>
<version>1.1.2</version>
</dependency>
13.
REST API Recap
•Every component in NiFi actually has a unique ID.
• Every operation to component is actually REST request to NiFi instance.
• Most of operation need to specify component ID
14.
NiFi in Depth
•Repositories
• Life of FlowFile
FlowFile Mechanism in Depth
NiFi in Depth
•FlowFile are the heart of NiFi and its flow-based design.
• A FlowFile is a data record, Consist of a pointer to its content, attributes
and associated with provenance events
• Attribute are key/value pairs act as metadata for the FlowFile
• Content is the actual data of the file
• Provenance is a record of what has happened to the FlowFile
18.
NiFi in Depth
•Repository are immutable.
• The benefits of this are many, including: substantial reduction in storage
space required for the typical complex graphs of processing, natural
replay capability, takes advantage of OS caching, reduces random
read/write performance hits, and is easy to reason over.
• All three repositories actually directories on local storage to persist data.
19.
NiFi in Depth
•The FlowFile repository contains metadata for all current FlowFiles in the
flow
• The Content Repository holds the content for current and past FlowFiles
• The Provenance Repository holds the history of FlowFiles
20.
NiFi in Depth
•FlowFiles are held in Map in JVM memory
• FlowFile metadata include
- Attributes
- A pointer to the actual contet of FlowFile
- State (Which Connection/Queue belonged in)
• FlowFile Repository act as NiFi’s “Write-Ahead Log”
• Each change happens as a transactional unit of work
21.
NiFi in Depth
•NiFi recover a FlowFile by restoring a snapshot of the FlowFile
• A snapshot is automatically taken periodically by the system
• Compute a new base checkpoint by serializing FlowFile map into disk
with filename ‘.partial’
• Step by Step WAL in NiFi
https://cwiki.apache.org/confluence/display/NIFI/NiFi%27s+Write-
Ahead+Log+Implementation
22.
Content Repository
• LargestRepositories, utilize immutability and copy-on-write to maximize
speed and thread-safety
• Resource Claims are Java objects that point to specific files on disk
• The FlowFile has a “Content Claim” object
- a reference to Resource Claims
- offset of content within the file
- length of the content
23.
Provenance Repository
• Historyof each FlowFile, provide Data Lineage (Chain of Custody)
• When a provenance event is created, it copies all the FlowFile’s
attributes and content pointer and stat to one location in the
Provenance Repo
• Provenance Repository design decisions
https://cwiki.apache.org/confluence/display/NIFI/Persistent+Provenance
+Repository+Design
Repositories Recap
• TheFlowFile repository contains metadata for all current FlowFiles in the
flow
• The Content Repository holds the content for current and past FlowFiles
• The Provenance Repository holds the history of FlowFiles
• Best practice
- Analyze contents of FlowFile as few times as possible
- Extract key information into attributes
- Update FlowFile repository is much faster than content repository
26.
Life of FlowFile
•Data Ingress → Pass by Reference → Copy-On-Write → Data Egress
• Important aspect of flow-based programming is the resource-
constrained relationships between the black boxes.
• Route from one processor to another simply by passing a reference to
FlowFile
Data Egress
• EventuallyFlowFile will be “DROPPED”, no longer processing and is
available for deletion.
• Remains in the FlowFile repository until next repository checkpoint. (24
hours default) release all old content claims.
• Periodically, The Content Repo ask the Resource Claim Manager which
Resource Claims can be cleaned up.
Common Processor Patterns
•Data Ingress
• Data Egress
• Route Based on Content
• Route Based on Attribute
• Split Content
• Update Attributes Based on Content
• Enrich Modify Content
37.
Error Handling
• ProcessExceptionor other Exception means it is known failure
and roll back session
• Don’t catch general Exceptions, Throwable.
• Penalization vs Yielding
38.
Session rollback
• ProcessSessionprovide transactionality
• Call commit() or rollback() to end session.
• Best practice is to keep simplicity
39.
Testing
• NiFi providemock framework for Processor testing.
Use TestRunner interface
• 1-AddControllerService if needed
runner.addControllerService()
• 2-Set Property Value
Map<String, String> attributes
attributes.put(‘property name’, ‘property value’);
• 3-Enqueue FlowFiles
runner.enqueuer(“Select ….”.getBytes(),attributes);
• 4-Run the processor
runner.run();
runner.assertAllFlowFilesTransferred(Success,1);
40.
Recap Developer guide
•Understand life cycle of Processor
• Understand supporting component API
• Understand processor general pattern
• Understand how to handle process failure
• Understand how to test processor
41.
Contribution preparation
• NiFiContributor Guide
https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide
• Git Feature Branch Workflow
https://www.atlassian.com/git/tutorials/comparing-workflows
• How to Write a Git Commit Message
https://chris.beams.io/posts/git-commit/
42.
Contribution feedback
• Don’tproduce trailing whitespace
• GitHub Pull request procedure
• Commit title start with NIFI-2829
• Open Source Ci fail all the time, Don’t panic.
• Keep patient and humble for reviewers feedback.
43.
Contribution feedback
• Whiledealing with Time Zone problem.
We should consider building in different time zone.
• In java 1.8, there is standard library provide great support to dealing
with Time issue in Java.
https://docs.oracle.com/javase/8/docs/api/java/time/package-
summary.html
https://magiclen.org/java-8-date-time-api/
44.
Reference
• Official ApacheNiFi
https://nifi.apache.org/
• All Micron nifi instance
http://nifi.micron.com/
• Hortonworks forum