SlideShare a Scribd company logo
1 of 36
Apache NiFi in the
Hadoop Ecosystem
Bryan Bende – Member of Technical Staff
Hadoop Summit 2016 Dublin
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Outline
• What is Apache NiFi?
• Hadoop Ecosystem Integrations
• Use-Case/Architecture Discussion
• Future Work
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Me
• Member of Technical Staff on Hortonworks DataFlow Team
• Apache NiFi Committer & PMC Member
• Integrations for HBase, Solr, Syslog, Stream Processing
• Twitter: @bbende / Blog: bryanbende.com
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Apache NiFi?
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simplistic View of Enterprise Data Flow
Data Flow
Process and Analyze
Data
Acquire Data
Store Data
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Different organizations/business units across different geographic locations…
Realistic View of Enterprise Data Flow
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with different business partners and customers
Realistic View of Enterprise Data Flow
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi
• Created to address the challenges of global enterprise dataflow
• Key features:
– Visual Command and Control
– Data Lineage (Provenance)
– Data Prioritization
– Data Buffering/Back-Pressure
– Control Latency vs. Throughput
– Secure Control Plane / Data Plane
– Scale Out Clustering
– Extensibility
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi
What is Apache NiFi used for?
• Reliable and secure transfer of data between systems
• Delivery of data from sources to analytic platforms
• Enrichment and preparation of data:
– Conversion between formats
– Extraction/Parsing
– Routing decisions
What is Apache NiFi NOT used for?
• Distributed Computation
• Complex Event Processing
• Joins / Complex Rolling Window Operations
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Ecosystem Integrations
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS Ingest
MergeContent
• Merges into appropriately sized files for HDFS
• Based on size, number of messages, and time
UpdateAttribute
• Sets the HDFS directory and filename
• Use expression language to dynamically bin by date:
/data/${now():format('yyyy/MM/dd/HH')}/
PutHDFS
• Writes FlowFile content to HDFS
• Supports Conflict Resolution Strategy and Kerberos
authentication
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS Retrieval
ListHDFS
• Periodically perform listing on HDFS directory
• Produces FlowFile per HDFS file
• Flow only contains HDFS path & filename
FetchHDFS
• Retrieves a file from HDFS
• Use incoming FlowFiles to dynamically fetch:
HDFS Filename: ${path}/${filename}
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS Retrieval in a Cluster
NCM
Node 1 (Primary)
ListHDFS
HDFS
FetchHDFS
RPG
Input Port
Node 2
ListHDFS
FetchHDFS
RPG
Input Port
• Perform “list” operation on
primary node
• Send results to Remote
Process Group pointing back
to same cluster
• Redistributes results to all
nodes to perform “fetch” in
parallel
• Same approach for ListFile +
FetchFile and ListSFTP +
FetchSFTP
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBase Integration
• ControllerService wrapping HBase Client
• Implementation provided for HBase 1.1.2 Client
• Other implementations could be built as an extension
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBase Ingest – Single Cell
• Table, Row Id, Col Family, and Col Qualifier
provided in processor, or dynamically from
attributes
• FlowFile content becomes the cell value
• Batch Size to specify maximum number of cells
for a single ‘put’ operation
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HBase Ingest – Full Row
• Table and Column Family provided in
processor, or dynamically from attributes
• Row ID can be a field in JSON, or a FlowFile
attribute
• JSON Field/Values become Column Qualifiers
and Values
• Complex Field Strategy
– Fail
– Warn
– Ignore
– Text
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kafka Integration
PutKafka
• Provide Broker and Topic Name
• Publishes FlowFile content as one or more messages
• Ability to send large delimited content, split into
messages by NiFi
GetKafka
• Provide ZK Connection String and Topic Name
• Produces a FlowFile for each message consumed
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Stream Processing Integration
• Stream processing systems
generally pull data, then push
results
• NiFi Site-To-Site pushes and pulls
between NiFi instances
• The Site-To-Site Client can be used
from a stream processing platform
https://github.com/apache/nifi/tree/master/nifi
-commons/nifi-site-to-site-client
Stream Processing
Site-To-Site Client
NiFi
Output Port
Stream Processing
Site-To-Site Client
NiFi
Input Port
Pull
Push
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Site-to-Site Client Overview
• Push to Input Port, or Pull from Output Port
• Communicate with NiFi clusters, or standalone instances
• Handles load balancing and reliable delivery
• Secure connections using certificates (optional)
SiteToSiteClientConfig clientConfig =
new SiteToSiteClient.Builder()
.url(“http://localhost:8080/nifi”)
.portName(”My Port”)
.buildConfig();
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Site-To-Site Client Pulling
SiteToSiteClient client = ...
Transaction transaction =
client.createTransaction(TransferDirection.RECEIVE);
DataPacket dataPacket = transaction.receive();
while (dataPacket != null) {
...
}
transaction.confirm();
transaction.complete();
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Site-To-Site Client Pushing
SiteToSiteClient client = ...
Transaction transaction =
client.createTransaction(TransferDirection.SEND);
NiFiDataPacket data = ...
transaction.send(data.getContent(), data.getAttributes());
transaction.confirm();
transaction.complete();
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current Stream Processing Integrations
Spark Streaming - NiFi Spark Receiver
• https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver
Storm – NiFi Spout
• https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout
Flink – NiFi Source & Sink
• https://github.com/apache/flink/tree/master/flink-streaming-connectors/flink-connector-nifi
Apex - NiFi Input Operators & Output Operators
• https://github.com/apache/incubator-apex-
malhar/tree/master/contrib/src/main/java/com/datatorrent/contrib/nifi
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Other Relevant Integrations
• GetSolr, PutSolrContentStream
• FetchElasticSearch, PutElasticSearch
• GetMongo, PutMongo
• QueryCassandra, PutCassandraQL
• GetCouchbaseKey, PutCouchbaseKey
• QueryDatabaseTable, ExecuteSQL, PutSQL
• GetSplunk, PutSplunk
• And more! https://nifi.apache.org/docs.html
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use-Case/Architecture Discussion
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Drive Data to Core for Analysis
NiFi
Stream
Processing
NiFi
NiFi
• Drive data from sources to central data center for analysis
• Tiered collection approach at various locations, think regional data centers
Edge
Edge
Core
Batch
Analytics
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamically Adjusting Data Flows
• Push analytic results back to core NiFi
• Push results back to edge locations/devices to change behavior
NiFi
NiFi
NiFi
Edge
Edge
Core
Batch
Analytics
Stream
Processing
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1. Logs filtered by level and sent from Edge -> Core
2. Stream processing produces new filters based on rate & sends back to core
3. Edge polls core for new filter levels & updates filtering
Example: Dynamic Log Collection
Core NiFi
Streaming
Processing
Edge NiFi
Logs Logs
New Filters
Logs Output Log Input Log Output
Result Input Store Result
Service Fetch ResultPoll Service
Filter
New Filters
New
Filters
Poll
Analytic
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Log Collection – Edge NiFi
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Log Collection – Core NiFi
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Log Collection Summary
NiFi
NiFi
NiFi
Edge
Edge
Core
Stream
Processing
Logs
Logs
Logs
New FiltersNew Filters
New Filters
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Future Work
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Future – Ecosystem Integrations
• Ambari
– Support a fully managed NiFi Cluster through Ambari
– Monitoring, management, upgrades, etc.
• Ranger
– Ability to delegate authorization decisions to Ranger
– Manage authorization controls through Ranger
• Atlas
– Track lineage from the source to destination
– Apply tags to data as its acquired
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Future – Apache NiFi
• HA Control Plane
– Zero Master cluster, Web UI accessible from any node
– Auto-Election of “Cluster Coordinator” and “Primary Node” through ZooKeeper
• HA Data Plane
– Ability to replicate data across nodes in a cluster
• Multi-Tenancy
– Restrict Access to portions of a flow
– Allow people/groups with in an organization to only access their portions of the flow
• Extension Registry
– Create a central repository of NARs and Templates
– Move most NARs out of Apache NiFi distribution, ship with a minimal set
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Future – Apache NiFi
• Variable Registry
– Define environment specific variables through the UI, reference through EL
– Make templates more portable across environments/instances
• Redesign of User Interface
– Modernize look & feel, improve usability, support multi-tenancy
• Continued Development of Integration Points
– New processors added continuously!
• MiNiFi
– Complimentary data collection agent to NiFi’s current approach
– Small, lightweight, centrally managed agent that integrates with NiFi for follow-on dataflow
management
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thanks!
• Questions?
• Contact Info:
– Email: bbende@hortonworks.com
– Twitter: @bbende
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you

More Related Content

What's hot

ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
 
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Timothy Spann
 

What's hot (20)

Introduction to data flow management using apache nifi
Introduction to data flow management using apache nifiIntroduction to data flow management using apache nifi
Introduction to data flow management using apache nifi
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Transactional SQL in Apache Hive
Transactional SQL in Apache HiveTransactional SQL in Apache Hive
Transactional SQL in Apache Hive
 

Viewers also liked

Viewers also liked (20)

Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big DataHortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Aliran sesat
Aliran sesatAliran sesat
Aliran sesat
 
Apache NiFi Toronto Meetup
Apache NiFi Toronto MeetupApache NiFi Toronto Meetup
Apache NiFi Toronto Meetup
 
Beyond Messaging Enterprise Dataflow powered by Apache NiFi
Beyond Messaging Enterprise Dataflow powered by Apache NiFiBeyond Messaging Enterprise Dataflow powered by Apache NiFi
Beyond Messaging Enterprise Dataflow powered by Apache NiFi
 
Hbase trabalho final
Hbase trabalho finalHbase trabalho final
Hbase trabalho final
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
Talendビッグデータインテグレーション製品ご紹介
Talendビッグデータインテグレーション製品ご紹介Talendビッグデータインテグレーション製品ご紹介
Talendビッグデータインテグレーション製品ご紹介
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Introduction to Apache NiFi - Seattle Scalability Meetup
Introduction to Apache NiFi - Seattle Scalability MeetupIntroduction to Apache NiFi - Seattle Scalability Meetup
Introduction to Apache NiFi - Seattle Scalability Meetup
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in NutshellApache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in Nutshell
 
Design a Dataflow in 7 minutes with Apache NiFi/HDF
Design a Dataflow in 7 minutes with Apache NiFi/HDFDesign a Dataflow in 7 minutes with Apache NiFi/HDF
Design a Dataflow in 7 minutes with Apache NiFi/HDF
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of Things
 

Similar to Apache NiFi in the Hadoop Ecosystem

Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 

Similar to Apache NiFi in the Hadoop Ecosystem (20)

State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & Community
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 
The Avant-garde of Apache NiFi
The Avant-garde of Apache NiFiThe Avant-garde of Apache NiFi
The Avant-garde of Apache NiFi
 
The Avant-garde of Apache NiFi
The Avant-garde of Apache NiFiThe Avant-garde of Apache NiFi
The Avant-garde of Apache NiFi
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
Connecting the Drops with Apache NiFi & Apache MiNiFi
Connecting the Drops with Apache NiFi & Apache MiNiFiConnecting the Drops with Apache NiFi & Apache MiNiFi
Connecting the Drops with Apache NiFi & Apache MiNiFi
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Curb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure ClusterCurb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure Cluster
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Apache Argus - How do I secure my entire Hadoop cluster? Olivier Renault @ Ho...
Apache Argus - How do I secure my entire Hadoop cluster? Olivier Renault @ Ho...Apache Argus - How do I secure my entire Hadoop cluster? Olivier Renault @ Ho...
Apache Argus - How do I secure my entire Hadoop cluster? Olivier Renault @ Ho...
 

More from Bryan Bende

More from Bryan Bende (8)

Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
 
Apache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi Meetup - Introduction to NiFi RegistryApache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi Meetup - Introduction to NiFi Registry
 
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFiDevnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFi
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without Data
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
 
Integrating NiFi and Apex
Integrating NiFi and ApexIntegrating NiFi and Apex
Integrating NiFi and Apex
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud Computing
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014
 

Recently uploaded

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Recently uploaded (20)

Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 

Apache NiFi in the Hadoop Ecosystem

  • 1. Apache NiFi in the Hadoop Ecosystem Bryan Bende – Member of Technical Staff Hadoop Summit 2016 Dublin
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Outline • What is Apache NiFi? • Hadoop Ecosystem Integrations • Use-Case/Architecture Discussion • Future Work
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About Me • Member of Technical Staff on Hortonworks DataFlow Team • Apache NiFi Committer & PMC Member • Integrations for HBase, Solr, Syslog, Stream Processing • Twitter: @bbende / Blog: bryanbende.com
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is Apache NiFi?
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Simplistic View of Enterprise Data Flow Data Flow Process and Analyze Data Acquire Data Store Data
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Different organizations/business units across different geographic locations… Realistic View of Enterprise Data Flow
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Interacting with different business partners and customers Realistic View of Enterprise Data Flow
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi • Created to address the challenges of global enterprise dataflow • Key features: – Visual Command and Control – Data Lineage (Provenance) – Data Prioritization – Data Buffering/Back-Pressure – Control Latency vs. Throughput – Secure Control Plane / Data Plane – Scale Out Clustering – Extensibility
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi What is Apache NiFi used for? • Reliable and secure transfer of data between systems • Delivery of data from sources to analytic platforms • Enrichment and preparation of data: – Conversion between formats – Extraction/Parsing – Routing decisions What is Apache NiFi NOT used for? • Distributed Computation • Complex Event Processing • Joins / Complex Rolling Window Operations
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop Ecosystem Integrations
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Ingest MergeContent • Merges into appropriately sized files for HDFS • Based on size, number of messages, and time UpdateAttribute • Sets the HDFS directory and filename • Use expression language to dynamically bin by date: /data/${now():format('yyyy/MM/dd/HH')}/ PutHDFS • Writes FlowFile content to HDFS • Supports Conflict Resolution Strategy and Kerberos authentication
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Retrieval ListHDFS • Periodically perform listing on HDFS directory • Produces FlowFile per HDFS file • Flow only contains HDFS path & filename FetchHDFS • Retrieves a file from HDFS • Use incoming FlowFiles to dynamically fetch: HDFS Filename: ${path}/${filename}
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Retrieval in a Cluster NCM Node 1 (Primary) ListHDFS HDFS FetchHDFS RPG Input Port Node 2 ListHDFS FetchHDFS RPG Input Port • Perform “list” operation on primary node • Send results to Remote Process Group pointing back to same cluster • Redistributes results to all nodes to perform “fetch” in parallel • Same approach for ListFile + FetchFile and ListSFTP + FetchSFTP
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBase Integration • ControllerService wrapping HBase Client • Implementation provided for HBase 1.1.2 Client • Other implementations could be built as an extension
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBase Ingest – Single Cell • Table, Row Id, Col Family, and Col Qualifier provided in processor, or dynamically from attributes • FlowFile content becomes the cell value • Batch Size to specify maximum number of cells for a single ‘put’ operation
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBase Ingest – Full Row • Table and Column Family provided in processor, or dynamically from attributes • Row ID can be a field in JSON, or a FlowFile attribute • JSON Field/Values become Column Qualifiers and Values • Complex Field Strategy – Fail – Warn – Ignore – Text
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kafka Integration PutKafka • Provide Broker and Topic Name • Publishes FlowFile content as one or more messages • Ability to send large delimited content, split into messages by NiFi GetKafka • Provide ZK Connection String and Topic Name • Produces a FlowFile for each message consumed
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Stream Processing Integration • Stream processing systems generally pull data, then push results • NiFi Site-To-Site pushes and pulls between NiFi instances • The Site-To-Site Client can be used from a stream processing platform https://github.com/apache/nifi/tree/master/nifi -commons/nifi-site-to-site-client Stream Processing Site-To-Site Client NiFi Output Port Stream Processing Site-To-Site Client NiFi Input Port Pull Push
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Site-to-Site Client Overview • Push to Input Port, or Pull from Output Port • Communicate with NiFi clusters, or standalone instances • Handles load balancing and reliable delivery • Secure connections using certificates (optional) SiteToSiteClientConfig clientConfig = new SiteToSiteClient.Builder() .url(“http://localhost:8080/nifi”) .portName(”My Port”) .buildConfig();
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Site-To-Site Client Pulling SiteToSiteClient client = ... Transaction transaction = client.createTransaction(TransferDirection.RECEIVE); DataPacket dataPacket = transaction.receive(); while (dataPacket != null) { ... } transaction.confirm(); transaction.complete();
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Site-To-Site Client Pushing SiteToSiteClient client = ... Transaction transaction = client.createTransaction(TransferDirection.SEND); NiFiDataPacket data = ... transaction.send(data.getContent(), data.getAttributes()); transaction.confirm(); transaction.complete();
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current Stream Processing Integrations Spark Streaming - NiFi Spark Receiver • https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver Storm – NiFi Spout • https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout Flink – NiFi Source & Sink • https://github.com/apache/flink/tree/master/flink-streaming-connectors/flink-connector-nifi Apex - NiFi Input Operators & Output Operators • https://github.com/apache/incubator-apex- malhar/tree/master/contrib/src/main/java/com/datatorrent/contrib/nifi
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Other Relevant Integrations • GetSolr, PutSolrContentStream • FetchElasticSearch, PutElasticSearch • GetMongo, PutMongo • QueryCassandra, PutCassandraQL • GetCouchbaseKey, PutCouchbaseKey • QueryDatabaseTable, ExecuteSQL, PutSQL • GetSplunk, PutSplunk • And more! https://nifi.apache.org/docs.html
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use-Case/Architecture Discussion
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Drive Data to Core for Analysis NiFi Stream Processing NiFi NiFi • Drive data from sources to central data center for analysis • Tiered collection approach at various locations, think regional data centers Edge Edge Core Batch Analytics
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamically Adjusting Data Flows • Push analytic results back to core NiFi • Push results back to edge locations/devices to change behavior NiFi NiFi NiFi Edge Edge Core Batch Analytics Stream Processing
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 1. Logs filtered by level and sent from Edge -> Core 2. Stream processing produces new filters based on rate & sends back to core 3. Edge polls core for new filter levels & updates filtering Example: Dynamic Log Collection Core NiFi Streaming Processing Edge NiFi Logs Logs New Filters Logs Output Log Input Log Output Result Input Store Result Service Fetch ResultPoll Service Filter New Filters New Filters Poll Analytic
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Log Collection – Edge NiFi
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Log Collection – Core NiFi
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamic Log Collection Summary NiFi NiFi NiFi Edge Edge Core Stream Processing Logs Logs Logs New FiltersNew Filters New Filters
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Work
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Future – Ecosystem Integrations • Ambari – Support a fully managed NiFi Cluster through Ambari – Monitoring, management, upgrades, etc. • Ranger – Ability to delegate authorization decisions to Ranger – Manage authorization controls through Ranger • Atlas – Track lineage from the source to destination – Apply tags to data as its acquired
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Future – Apache NiFi • HA Control Plane – Zero Master cluster, Web UI accessible from any node – Auto-Election of “Cluster Coordinator” and “Primary Node” through ZooKeeper • HA Data Plane – Ability to replicate data across nodes in a cluster • Multi-Tenancy – Restrict Access to portions of a flow – Allow people/groups with in an organization to only access their portions of the flow • Extension Registry – Create a central repository of NARs and Templates – Move most NARs out of Apache NiFi distribution, ship with a minimal set
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The Future – Apache NiFi • Variable Registry – Define environment specific variables through the UI, reference through EL – Make templates more portable across environments/instances • Redesign of User Interface – Modernize look & feel, improve usability, support multi-tenancy • Continued Development of Integration Points – New processors added continuously! • MiNiFi – Complimentary data collection agent to NiFi’s current approach – Small, lightweight, centrally managed agent that integrates with NiFi for follow-on dataflow management
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thanks! • Questions? • Contact Info: – Email: bbende@hortonworks.com – Twitter: @bbende
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you