SlideShare a Scribd company logo
1 of 34
Apache NiFi Deep Dive
Bryan Bende – Member of Technical Staff
NJ Hadoop Meetup – May 10th 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simplistic View of Enterprise Data Flow
Data Flow
Process and Analyze
Data
Acquire Data
Store Data
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Different organizations/business units across different geographic locations…
Realistic View of Enterprise Data Flow
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with different business partners and customers
Realistic View of Enterprise Data Flow
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi
• Created to address the challenges of global enterprise dataflow
• Key features:
– Visual Command and Control
– Data Lineage (Provenance)
– Data Prioritization
– Data Buffering/Back-Pressure
– Control Latency vs. Throughput
– Secure Control Plane / Data Plane
– Scale Out Clustering
– Extensibility
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi
What is Apache NiFi used for?
• Reliable and secure transfer of data between systems
• Delivery of data from sources to analytic platforms
• Enrichment and preparation of data:
– Conversion between formats
– Extraction/Parsing
– Routing decisions
What is Apache NiFi NOT used for?
• Distributed Computation
• Complex Event Processing
• Joins / Complex Rolling Window Operations
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Deep Dive
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Terminology
FlowFile
• Unit of data moving through the system
• Content + Attributes (key/value pairs)
Processor
• Performs the work, can access FlowFiles
Connection
• Links between processors
• Queues that can be dynamically prioritized
Process Group
• Set of processors and their connections
• Receive data via input ports, send data via output ports
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visual Command & Control
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Provenance/Lineage
• Tracks data at each point as it flows
through the system
• Records, indexes, and makes events
available for display
• Handles fan-in/fan-out, i.e. merging
and splitting data
• View attributes and content at given
points in time
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Prioritization
• Configure a prioritizer per connection
• Determine what is important for your
data – time based, arrival order,
importance of a data set
• Funnel many connections down to a
single connection to prioritize across
data sets
• Develop your own prioritizer if needed
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Back-Pressure
• Configure back-pressure per connection
• Based on number of FlowFiles or total
size of FlowFiles
• Upstream processor no longer scheduled
to run until below threshold
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Latency vs. Throughput
• Choose between lower latency, or higher throughput on each processor
• Higher throughput allows framework to batch together all operations for
the selected amount of time for improved performance
• Processor developer determines whether to support this by using
@SupportsBatching annotation
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security
Control Plane
• Pluggable authentication
– 2-Way SSL, LDAP, Kerberos
• Pluggable authorization
– File-based authority provider out of the box
– Multiple roles to defines access controls
• Audit trail of all user actions
Data Plane
• Optional 2-Way SSL between cluster nodes
• Optional 2-Way SSL on Site-To-Site connections (NiFi-to-NiFi)
• Encryption/Decryption of data through processors
• Provenance for audit trail of data
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Extensibility
Built from the ground up with extensions in mind
Service-loader pattern for…
• Processors
• Controller Services
• Reporting Tasks
• Prioritizers
Extensions packaged as NiFi Archives (NARs)
• Deploy NiFi lib directory and restart
• Provides ClassLoader isolation
• Same model as standard components
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rapid Ecosystem Adoption: 130+ Processors
HTTP
Syslog
Email
HTML
Image
Hash Encrypt
Extract
TailMerge
Evaluate
Duplicate Execute
Scan
GeoEnrich
Replace
ConvertSplit
Translate
HL7
FTP
UDP
XML
SFTP
Route Content
Route Context
Route Text
Control Rate
Distribute Load
AMQP
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
NiFi Cluster Manager – Request Replicator
Web Server
Master
NiFi Cluster
Manager (NCM)
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Slaves
NiFi Nodes
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Architecture – Repositories - Pass by reference
FlowFile Content Provenance
F1 C1 C1 P1 F1
Excerpt of demo flow… What’s happening inside the repositories…
BEFORE
AFTER
F2 C1 C1 P3 F2 – Clone (F1)
F1 C1 P2 F1 – Route
P1 F1 – Create
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Architecture – Repositories – Copy on Write
FlowFile Content Provenance
F1 C1 C1 P1 F1 - CREATE
Excerpt of demo flow… What’s happening inside the repositories…
BEFORE
AFTER
F1 C1
F1.1 C2 C2 (encrypted)
C1 (plaintext)
P2 F1.1 - MODIFY
P1 F1 - CREATE
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance & Scaling
• Optimize I/O…
• Separate partition for each repository
• Multiple partitions for content repository
• RAID configurations for redundancy & striping
• Tune JVM Memory, GC, and # of threads
• Scale up with a cluster
• 100s of thousands of events per second per node
• Scale down to a Raspberry Pi
• 10s of thousands of events per second
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Site-To-Site
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Site-To-Site
• Direct communication between two NiFi instances
• Push to Input Port on receiver, or Pull from Output Port on source
• Communicate between clusters, standalone instances, or both
• Handles load balancing and reliable delivery
• Secure connections using certificates (optional)
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Site-To-Site Push
• Source connects Remote Process Group to Input Port on destination
• Site-To-Site takes care of load balancing across the nodes in the cluster
NCM
Node 1
Input Port
Node 2
Input Port
Standalone NiFi
RPG
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Site-To-Site Pull
• Destination connects Remote Process Group to Output Port on the source
• If source was a cluster, each node would pull from each node in cluster
NCM
Node 1
RPG
Node 2
RPG
Standalone NiFi
Output Port
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Site-To-Site Client
• Code for Site-To-Site broken out into reusable module
• https://github.com/apache/nifi/tree/master/nifi-commons/nifi-site-to-site-client
• Foundation for integration with stream processing platforms
Java Program
Site-To-Site Client
Node 1
Output Port
NCM
Node 2
Output Port
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current Stream Processing Integrations
Spark Streaming - NiFi Spark Receiver
• https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver
Storm – NiFi Spout & Bolt
• https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout
Flink – NiFi Source & Sink
• https://github.com/apache/flink/tree/master/flink-streaming-connectors/flink-connector-nifi
Apex - NiFi Input Operators & Output Operators
• https://github.com/apache/incubator-apex-
malhar/tree/master/contrib/src/main/java/com/datatorrent/contrib/nifi
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Bi-Directional Data Flows
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Drive Data to Core for Analysis
NiFi
Stream
Processing
NiFi
NiFi
• Drive data from sources to central data center for analysis
• Tiered collection approach at various locations, think regional data centers
Edge
Edge
Core
Batch
Analytics
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamically Adjusting Data Flows
• Push analytic results back to core NiFi
• Push results back to edge locations/devices to change behavior
NiFi
NiFi
NiFi
Edge
Edge
Core
Batch
Analytics
Stream
Processing
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Future Work
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Roadmap
• HA Control Plane
– Zero Master cluster, Web UI accessible from any node
– Auto-Election of “Cluster Coordinator” and “Primary Node” through ZooKeeper
• HA Data Plane
– Ability to replicate data across nodes in a cluster
• Multi-Tenancy
– Restrict Access to portions of a flow
– Allow people/groups with in an organization to only access their portions of the flow
• Extension Registry
– Create a central repository of NARs and Templates
– Move most NARs out of Apache NiFi distribution, ship with a minimal set
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Roadmap
• Variable Registry
– Define environment specific variables through the UI, reference through EL
– Make templates more portable across environments/instances
• Redesign of User Interface
– Modernize look & feel, improve usability, support multi-tenancy
• Continued Development of Integration Points
– New processors added continuously!
• MiNiFi
– Complimentary data collection agent to NiFi’s current approach
– Small, lightweight, centrally managed agent that integrates with NiFi for follow-on dataflow
management
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thanks!
Resources
• Apache NiFi Mailing Lists
– https://nifi.apache.org/mailing_lists.html
• Apache NiFi Documentation
– https://nifi.apache.org/docs.html
• Getting started developing extensions
– https://cwiki.apache.org/confluence/display/NIFI/Maven+Projects+for+Extensions
– https://nifi.apache.org/developer-guide.html
Contact Info:
– Email: bbende@hortonworks.com
– Twitter: @bbende
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you

More Related Content

What's hot

Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
 

What's hot (20)

Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 
Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016Apache NiFi Meetup - Princeton NJ 2016
Apache NiFi Meetup - Princeton NJ 2016
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the Enterprise
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache KafkaKafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 

Similar to NJ Hadoop Meetup - Apache NiFi Deep Dive

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 

Similar to NJ Hadoop Meetup - Apache NiFi Deep Dive (20)

Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 
Integrating Apache NiFi and Apache Apex
Integrating Apache NiFi and Apache Apex Integrating Apache NiFi and Apache Apex
Integrating Apache NiFi and Apache Apex
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & Community
 
Integrating NiFi and Apex
Integrating NiFi and ApexIntegrating NiFi and Apex
Integrating NiFi and Apex
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Apache NiFi - Flow Based Programming Meetup
Apache NiFi - Flow Based Programming MeetupApache NiFi - Flow Based Programming Meetup
Apache NiFi - Flow Based Programming Meetup
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
The Avant-garde of Apache NiFi
The Avant-garde of Apache NiFiThe Avant-garde of Apache NiFi
The Avant-garde of Apache NiFi
 
The Avant-garde of Apache NiFi
The Avant-garde of Apache NiFiThe Avant-garde of Apache NiFi
The Avant-garde of Apache NiFi
 

More from Bryan Bende

More from Bryan Bende (8)

Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
 
Apache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi Meetup - Introduction to NiFi RegistryApache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi Meetup - Introduction to NiFi Registry
 
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFiDevnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFi
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without Data
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud Computing
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014
 

Recently uploaded

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 

NJ Hadoop Meetup - Apache NiFi Deep Dive

  • 1. Apache NiFi Deep Dive Bryan Bende – Member of Technical Staff NJ Hadoop Meetup – May 10th 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Simplistic View of Enterprise Data Flow Data Flow Process and Analyze Data Acquire Data Store Data
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Different organizations/business units across different geographic locations… Realistic View of Enterprise Data Flow
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Interacting with different business partners and customers Realistic View of Enterprise Data Flow
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi • Created to address the challenges of global enterprise dataflow • Key features: – Visual Command and Control – Data Lineage (Provenance) – Data Prioritization – Data Buffering/Back-Pressure – Control Latency vs. Throughput – Secure Control Plane / Data Plane – Scale Out Clustering – Extensibility
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi What is Apache NiFi used for? • Reliable and secure transfer of data between systems • Delivery of data from sources to analytic platforms • Enrichment and preparation of data: – Conversion between formats – Extraction/Parsing – Routing decisions What is Apache NiFi NOT used for? • Distributed Computation • Complex Event Processing • Joins / Complex Rolling Window Operations
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi Deep Dive
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Terminology FlowFile • Unit of data moving through the system • Content + Attributes (key/value pairs) Processor • Performs the work, can access FlowFiles Connection • Links between processors • Queues that can be dynamically prioritized Process Group • Set of processors and their connections • Receive data via input ports, send data via output ports
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visual Command & Control • Drag and drop processors to build a flow • Start, stop, and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • Create templates of common processor & connections
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Provenance/Lineage • Tracks data at each point as it flows through the system • Records, indexes, and makes events available for display • Handles fan-in/fan-out, i.e. merging and splitting data • View attributes and content at given points in time
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Prioritization • Configure a prioritizer per connection • Determine what is important for your data – time based, arrival order, importance of a data set • Funnel many connections down to a single connection to prioritize across data sets • Develop your own prioritizer if needed
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Back-Pressure • Configure back-pressure per connection • Based on number of FlowFiles or total size of FlowFiles • Upstream processor no longer scheduled to run until below threshold
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Latency vs. Throughput • Choose between lower latency, or higher throughput on each processor • Higher throughput allows framework to batch together all operations for the selected amount of time for improved performance • Processor developer determines whether to support this by using @SupportsBatching annotation
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security Control Plane • Pluggable authentication – 2-Way SSL, LDAP, Kerberos • Pluggable authorization – File-based authority provider out of the box – Multiple roles to defines access controls • Audit trail of all user actions Data Plane • Optional 2-Way SSL between cluster nodes • Optional 2-Way SSL on Site-To-Site connections (NiFi-to-NiFi) • Encryption/Decryption of data through processors • Provenance for audit trail of data
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Extensibility Built from the ground up with extensions in mind Service-loader pattern for… • Processors • Controller Services • Reporting Tasks • Prioritizers Extensions packaged as NiFi Archives (NARs) • Deploy NiFi lib directory and restart • Provides ClassLoader isolation • Same model as standard components
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rapid Ecosystem Adoption: 130+ Processors HTTP Syslog Email HTML Image Hash Encrypt Extract TailMerge Evaluate Duplicate Execute Scan GeoEnrich Replace ConvertSplit Translate HL7 FTP UDP XML SFTP Route Content Route Context Route Text Control Rate Distribute Load AMQP
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM NiFi Cluster Manager – Request Replicator Web Server Master NiFi Cluster Manager (NCM) OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Slaves NiFi Nodes
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi Architecture – Repositories - Pass by reference FlowFile Content Provenance F1 C1 C1 P1 F1 Excerpt of demo flow… What’s happening inside the repositories… BEFORE AFTER F2 C1 C1 P3 F2 – Clone (F1) F1 C1 P2 F1 – Route P1 F1 – Create
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved NiFi Architecture – Repositories – Copy on Write FlowFile Content Provenance F1 C1 C1 P1 F1 - CREATE Excerpt of demo flow… What’s happening inside the repositories… BEFORE AFTER F1 C1 F1.1 C2 C2 (encrypted) C1 (plaintext) P2 F1.1 - MODIFY P1 F1 - CREATE
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance & Scaling • Optimize I/O… • Separate partition for each repository • Multiple partitions for content repository • RAID configurations for redundancy & striping • Tune JVM Memory, GC, and # of threads • Scale up with a cluster • 100s of thousands of events per second per node • Scale down to a Raspberry Pi • 10s of thousands of events per second
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi Site-To-Site
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Site-To-Site • Direct communication between two NiFi instances • Push to Input Port on receiver, or Pull from Output Port on source • Communicate between clusters, standalone instances, or both • Handles load balancing and reliable delivery • Secure connections using certificates (optional)
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Site-To-Site Push • Source connects Remote Process Group to Input Port on destination • Site-To-Site takes care of load balancing across the nodes in the cluster NCM Node 1 Input Port Node 2 Input Port Standalone NiFi RPG
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Site-To-Site Pull • Destination connects Remote Process Group to Output Port on the source • If source was a cluster, each node would pull from each node in cluster NCM Node 1 RPG Node 2 RPG Standalone NiFi Output Port
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Site-To-Site Client • Code for Site-To-Site broken out into reusable module • https://github.com/apache/nifi/tree/master/nifi-commons/nifi-site-to-site-client • Foundation for integration with stream processing platforms Java Program Site-To-Site Client Node 1 Output Port NCM Node 2 Output Port
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current Stream Processing Integrations Spark Streaming - NiFi Spark Receiver • https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver Storm – NiFi Spout & Bolt • https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout Flink – NiFi Source & Sink • https://github.com/apache/flink/tree/master/flink-streaming-connectors/flink-connector-nifi Apex - NiFi Input Operators & Output Operators • https://github.com/apache/incubator-apex- malhar/tree/master/contrib/src/main/java/com/datatorrent/contrib/nifi
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Bi-Directional Data Flows
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Drive Data to Core for Analysis NiFi Stream Processing NiFi NiFi • Drive data from sources to central data center for analysis • Tiered collection approach at various locations, think regional data centers Edge Edge Core Batch Analytics
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamically Adjusting Data Flows • Push analytic results back to core NiFi • Push results back to edge locations/devices to change behavior NiFi NiFi NiFi Edge Edge Core Batch Analytics Stream Processing
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Work
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi Roadmap • HA Control Plane – Zero Master cluster, Web UI accessible from any node – Auto-Election of “Cluster Coordinator” and “Primary Node” through ZooKeeper • HA Data Plane – Ability to replicate data across nodes in a cluster • Multi-Tenancy – Restrict Access to portions of a flow – Allow people/groups with in an organization to only access their portions of the flow • Extension Registry – Create a central repository of NARs and Templates – Move most NARs out of Apache NiFi distribution, ship with a minimal set
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi Roadmap • Variable Registry – Define environment specific variables through the UI, reference through EL – Make templates more portable across environments/instances • Redesign of User Interface – Modernize look & feel, improve usability, support multi-tenancy • Continued Development of Integration Points – New processors added continuously! • MiNiFi – Complimentary data collection agent to NiFi’s current approach – Small, lightweight, centrally managed agent that integrates with NiFi for follow-on dataflow management
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thanks! Resources • Apache NiFi Mailing Lists – https://nifi.apache.org/mailing_lists.html • Apache NiFi Documentation – https://nifi.apache.org/docs.html • Getting started developing extensions – https://cwiki.apache.org/confluence/display/NIFI/Maven+Projects+for+Extensions – https://nifi.apache.org/developer-guide.html Contact Info: – Email: bbende@hortonworks.com – Twitter: @bbende
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you