NIFI DEVELOPER GUIDE
Presenter Deon Huang
2017/7/7
Agenda
• NiFi REST API
• NiFi In Depth
• NiFi developer Guide
• Custom Processor
• Contribution Sharing
NiFi REST API
• The Rest API provides programmatic access to command and control a
NiFi instance in real time.
• Start and stop processors, monitor queues, query provenance data, and
more.
NiFi REST API
What happen?
NiFi REST API
We’ve send a REST request to NiFi instance
NiFi REST API
Request URL
Component ID
Request body we actually send
NiFi REST API
• Every component in NiFi actually has a unique ID.
• Every operation to component is actually REST request to NiFi instance.
• Most of operation need to specify component ID
• https:// /nifi-api/process-groups
/015d1045-0b88-1db2-da38-cb71ac006792/process-groups
NiFi Instance URL
REST API Usage
REST Path
Unique Component ID
NiFi REST API
• RevisionDTO
NiFi REST API
• RevisionDTO
NiFi REST API
• RevisionDTO
NiFi REST API
• RevisionDTO
NiFi REST API
• RevisionDTO – indentify component version view to client
ProcessGroupDTO – Component body of ProcessGroup
PositionDTO – Position in canvas
• All DTO, Entity are provided.
<dependency>
<groupId>org.apache.nifi</groupId>
<artifactId>nifi-client-dto</artifactId>
<version>1.1.2</version>
</dependency>
REST API Recap
• Every component in NiFi actually has a unique ID.
• Every operation to component is actually REST request to NiFi instance.
• Most of operation need to specify component ID
NiFi in Depth
• Repositories
• Life of FlowFile
FlowFile Mechanism in Depth
NiFi Architecture
NiFi Architecture
Attribute
1. HashMap in JVM
2. WAL in FlowFile Repository
Content
Immutable in disk
NiFi in Depth
• FlowFile are the heart of NiFi and its flow-based design.
• A FlowFile is a data record, Consist of a pointer to its content, attributes
and associated with provenance events
• Attribute are key/value pairs act as metadata for the FlowFile
• Content is the actual data of the file
• Provenance is a record of what has happened to the FlowFile
NiFi in Depth
• Repository are immutable.
• The benefits of this are many, including: substantial reduction in storage
space required for the typical complex graphs of processing, natural
replay capability, takes advantage of OS caching, reduces random
read/write performance hits, and is easy to reason over.
• All three repositories actually directories on local storage to persist data.
NiFi in Depth
• The FlowFile repository contains metadata for all current FlowFiles in the
flow
• The Content Repository holds the content for current and past FlowFiles
• The Provenance Repository holds the history of FlowFiles
NiFi in Depth
• FlowFiles are held in Map in JVM memory
• FlowFile metadata include
- Attributes
- A pointer to the actual contet of FlowFile
- State (Which Connection/Queue belonged in)
• FlowFile Repository act as NiFi’s “Write-Ahead Log”
• Each change happens as a transactional unit of work
NiFi in Depth
• NiFi recover a FlowFile by restoring a snapshot of the FlowFile
• A snapshot is automatically taken periodically by the system
• Compute a new base checkpoint by serializing FlowFile map into disk
with filename ‘.partial’
• Step by Step WAL in NiFi
https://cwiki.apache.org/confluence/display/NIFI/NiFi%27s+Write-
Ahead+Log+Implementation
Content Repository
• Largest Repositories, utilize immutability and copy-on-write to maximize
speed and thread-safety
• Resource Claims are Java objects that point to specific files on disk
• The FlowFile has a “Content Claim” object
- a reference to Resource Claims
- offset of content within the file
- length of the content
Provenance Repository
• History of each FlowFile, provide Data Lineage (Chain of Custody)
• When a provenance event is created, it copies all the FlowFile’s
attributes and content pointer and stat to one location in the
Provenance Repo
• Provenance Repository design decisions
https://cwiki.apache.org/confluence/display/NIFI/Persistent+Provenance
+Repository+Design
Provenance Repository
• Provenance Event
-CLONE
-ATTIBUTES_MODIFIED
-CONTENT_MODIFIED
-CREATE
-DROP
-EXPIRE
-FORK
-JOIN
-ROUTE
…
Repositories Recap
• The FlowFile repository contains metadata for all current FlowFiles in the
flow
• The Content Repository holds the content for current and past FlowFiles
• The Provenance Repository holds the history of FlowFiles
• Best practice
- Analyze contents of FlowFile as few times as possible
- Extract key information into attributes
- Update FlowFile repository is much faster than content repository
Life of FlowFile
• Data Ingress → Pass by Reference → Copy-On-Write → Data Egress
• Important aspect of flow-based programming is the resource-
constrained relationships between the black boxes.
• Route from one processor to another simply by passing a reference to
FlowFile
Pass by Reference
Funnels
Copy On Write
Update Attribute
Data Egress
• Eventually FlowFile will be “DROPPED”, no longer processing and is
available for deletion.
• Remains in the FlowFile repository until next repository checkpoint. (24
hours default) release all old content claims.
• Periodically, The Content Repo ask the Resource Claim Manager which
Resource Claims can be cleaned up.
Developer Guide
• Processor
• Reporting Task
• ControllerService
• FlowFilePrioritizer
• AuthorityProvider
Supporting API
• ProcessSession
• ProcessContext
• PropertyDesciptor
• Validator
• ValidationContext
• PropertyValue
• RelationShip
• StateManager
• ComponentLog
Proceesor Life Cycle
• Processor Initialization →
• Exposing Processor’s Relationships →
• Exposing Processor Properties →
• Validating Processor Properties →
• Triggered and Performing the Work →
• ProcessSeesion finish
Component Life Cycle
• @OnAdded →
• @OnEnabled →
• @OnRemoved →
• @OnScheduled →
• @OnUnscheduled →
• @OnStopped →
• @OnShutdown
Common Processor Patterns
• Data Ingress
• Data Egress
• Route Based on Content
• Route Based on Attribute
• Split Content
• Update Attributes Based on Content
• Enrich Modify Content
Error Handling
• ProcessException or other Exception means it is known failure
and roll back session
• Don’t catch general Exceptions, Throwable.
• Penalization vs Yielding
Session rollback
• ProcessSession provide transactionality
• Call commit() or rollback() to end session.
• Best practice is to keep simplicity
Testing
• NiFi provide mock framework for Processor testing.
Use TestRunner interface
• 1-AddControllerService if needed
runner.addControllerService()
• 2-Set Property Value
Map<String, String> attributes
attributes.put(‘property name’, ‘property value’);
• 3-Enqueue FlowFiles
runner.enqueuer(“Select ….”.getBytes(),attributes);
• 4-Run the processor
runner.run();
runner.assertAllFlowFilesTransferred(Success,1);
Recap Developer guide
• Understand life cycle of Processor
• Understand supporting component API
• Understand processor general pattern
• Understand how to handle process failure
• Understand how to test processor
Contribution preparation
• NiFi Contributor Guide
https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide
• Git Feature Branch Workflow
https://www.atlassian.com/git/tutorials/comparing-workflows
• How to Write a Git Commit Message
https://chris.beams.io/posts/git-commit/
Contribution feedback
• Don’t produce trailing whitespace
• GitHub Pull request procedure
• Commit title start with NIFI-2829
• Open Source Ci fail all the time, Don’t panic.
• Keep patient and humble for reviewers feedback.
Contribution feedback
• While dealing with Time Zone problem.
We should consider building in different time zone.
• In java 1.8, there is standard library provide great support to dealing
with Time issue in Java.
https://docs.oracle.com/javase/8/docs/api/java/time/package-
summary.html
https://magiclen.org/java-8-date-time-api/
Reference
• Official Apache NiFi
https://nifi.apache.org/
• All Micron nifi instance
http://nifi.micron.com/
• Hortonworks forum

NiFi Developer Guide

  • 1.
    NIFI DEVELOPER GUIDE PresenterDeon Huang 2017/7/7
  • 2.
    Agenda • NiFi RESTAPI • NiFi In Depth • NiFi developer Guide • Custom Processor • Contribution Sharing
  • 3.
    NiFi REST API •The Rest API provides programmatic access to command and control a NiFi instance in real time. • Start and stop processors, monitor queues, query provenance data, and more.
  • 4.
  • 5.
    NiFi REST API We’vesend a REST request to NiFi instance
  • 6.
    NiFi REST API RequestURL Component ID Request body we actually send
  • 7.
    NiFi REST API •Every component in NiFi actually has a unique ID. • Every operation to component is actually REST request to NiFi instance. • Most of operation need to specify component ID • https:// /nifi-api/process-groups /015d1045-0b88-1db2-da38-cb71ac006792/process-groups NiFi Instance URL REST API Usage REST Path Unique Component ID
  • 8.
    NiFi REST API •RevisionDTO
  • 9.
    NiFi REST API •RevisionDTO
  • 10.
    NiFi REST API •RevisionDTO
  • 11.
    NiFi REST API •RevisionDTO
  • 12.
    NiFi REST API •RevisionDTO – indentify component version view to client ProcessGroupDTO – Component body of ProcessGroup PositionDTO – Position in canvas • All DTO, Entity are provided. <dependency> <groupId>org.apache.nifi</groupId> <artifactId>nifi-client-dto</artifactId> <version>1.1.2</version> </dependency>
  • 13.
    REST API Recap •Every component in NiFi actually has a unique ID. • Every operation to component is actually REST request to NiFi instance. • Most of operation need to specify component ID
  • 14.
    NiFi in Depth •Repositories • Life of FlowFile FlowFile Mechanism in Depth
  • 15.
  • 16.
    NiFi Architecture Attribute 1. HashMapin JVM 2. WAL in FlowFile Repository Content Immutable in disk
  • 17.
    NiFi in Depth •FlowFile are the heart of NiFi and its flow-based design. • A FlowFile is a data record, Consist of a pointer to its content, attributes and associated with provenance events • Attribute are key/value pairs act as metadata for the FlowFile • Content is the actual data of the file • Provenance is a record of what has happened to the FlowFile
  • 18.
    NiFi in Depth •Repository are immutable. • The benefits of this are many, including: substantial reduction in storage space required for the typical complex graphs of processing, natural replay capability, takes advantage of OS caching, reduces random read/write performance hits, and is easy to reason over. • All three repositories actually directories on local storage to persist data.
  • 19.
    NiFi in Depth •The FlowFile repository contains metadata for all current FlowFiles in the flow • The Content Repository holds the content for current and past FlowFiles • The Provenance Repository holds the history of FlowFiles
  • 20.
    NiFi in Depth •FlowFiles are held in Map in JVM memory • FlowFile metadata include - Attributes - A pointer to the actual contet of FlowFile - State (Which Connection/Queue belonged in) • FlowFile Repository act as NiFi’s “Write-Ahead Log” • Each change happens as a transactional unit of work
  • 21.
    NiFi in Depth •NiFi recover a FlowFile by restoring a snapshot of the FlowFile • A snapshot is automatically taken periodically by the system • Compute a new base checkpoint by serializing FlowFile map into disk with filename ‘.partial’ • Step by Step WAL in NiFi https://cwiki.apache.org/confluence/display/NIFI/NiFi%27s+Write- Ahead+Log+Implementation
  • 22.
    Content Repository • LargestRepositories, utilize immutability and copy-on-write to maximize speed and thread-safety • Resource Claims are Java objects that point to specific files on disk • The FlowFile has a “Content Claim” object - a reference to Resource Claims - offset of content within the file - length of the content
  • 23.
    Provenance Repository • Historyof each FlowFile, provide Data Lineage (Chain of Custody) • When a provenance event is created, it copies all the FlowFile’s attributes and content pointer and stat to one location in the Provenance Repo • Provenance Repository design decisions https://cwiki.apache.org/confluence/display/NIFI/Persistent+Provenance +Repository+Design
  • 24.
    Provenance Repository • ProvenanceEvent -CLONE -ATTIBUTES_MODIFIED -CONTENT_MODIFIED -CREATE -DROP -EXPIRE -FORK -JOIN -ROUTE …
  • 25.
    Repositories Recap • TheFlowFile repository contains metadata for all current FlowFiles in the flow • The Content Repository holds the content for current and past FlowFiles • The Provenance Repository holds the history of FlowFiles • Best practice - Analyze contents of FlowFile as few times as possible - Extract key information into attributes - Update FlowFile repository is much faster than content repository
  • 26.
    Life of FlowFile •Data Ingress → Pass by Reference → Copy-On-Write → Data Egress • Important aspect of flow-based programming is the resource- constrained relationships between the black boxes. • Route from one processor to another simply by passing a reference to FlowFile
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    Data Egress • EventuallyFlowFile will be “DROPPED”, no longer processing and is available for deletion. • Remains in the FlowFile repository until next repository checkpoint. (24 hours default) release all old content claims. • Periodically, The Content Repo ask the Resource Claim Manager which Resource Claims can be cleaned up.
  • 32.
    Developer Guide • Processor •Reporting Task • ControllerService • FlowFilePrioritizer • AuthorityProvider
  • 33.
    Supporting API • ProcessSession •ProcessContext • PropertyDesciptor • Validator • ValidationContext • PropertyValue • RelationShip • StateManager • ComponentLog
  • 34.
    Proceesor Life Cycle •Processor Initialization → • Exposing Processor’s Relationships → • Exposing Processor Properties → • Validating Processor Properties → • Triggered and Performing the Work → • ProcessSeesion finish
  • 35.
    Component Life Cycle •@OnAdded → • @OnEnabled → • @OnRemoved → • @OnScheduled → • @OnUnscheduled → • @OnStopped → • @OnShutdown
  • 36.
    Common Processor Patterns •Data Ingress • Data Egress • Route Based on Content • Route Based on Attribute • Split Content • Update Attributes Based on Content • Enrich Modify Content
  • 37.
    Error Handling • ProcessExceptionor other Exception means it is known failure and roll back session • Don’t catch general Exceptions, Throwable. • Penalization vs Yielding
  • 38.
    Session rollback • ProcessSessionprovide transactionality • Call commit() or rollback() to end session. • Best practice is to keep simplicity
  • 39.
    Testing • NiFi providemock framework for Processor testing. Use TestRunner interface • 1-AddControllerService if needed runner.addControllerService() • 2-Set Property Value Map<String, String> attributes attributes.put(‘property name’, ‘property value’); • 3-Enqueue FlowFiles runner.enqueuer(“Select ….”.getBytes(),attributes); • 4-Run the processor runner.run(); runner.assertAllFlowFilesTransferred(Success,1);
  • 40.
    Recap Developer guide •Understand life cycle of Processor • Understand supporting component API • Understand processor general pattern • Understand how to handle process failure • Understand how to test processor
  • 41.
    Contribution preparation • NiFiContributor Guide https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide • Git Feature Branch Workflow https://www.atlassian.com/git/tutorials/comparing-workflows • How to Write a Git Commit Message https://chris.beams.io/posts/git-commit/
  • 42.
    Contribution feedback • Don’tproduce trailing whitespace • GitHub Pull request procedure • Commit title start with NIFI-2829 • Open Source Ci fail all the time, Don’t panic. • Keep patient and humble for reviewers feedback.
  • 43.
    Contribution feedback • Whiledealing with Time Zone problem. We should consider building in different time zone. • In java 1.8, there is standard library provide great support to dealing with Time issue in Java. https://docs.oracle.com/javase/8/docs/api/java/time/package- summary.html https://magiclen.org/java-8-date-time-api/
  • 44.
    Reference • Official ApacheNiFi https://nifi.apache.org/ • All Micron nifi instance http://nifi.micron.com/ • Hortonworks forum