Data ingestion and distribution with apache NiFi

•

23 likes•8,556 views

In this session, we will cover our experience working with Apache NiFi, an easy to use, powerful, and reliable system to process and distribute a large volume of data. The first part of the session will be an introduction to Apache NiFi. We will go over NiFi main components and building blocks and functionality. In the second part of the session, we will show our use case for Apache NiFi and how it's being used inside our Data Processing infrastructure.

Data & Analytics

Data Ingestion &
Distribution with
Apache NiFi

Agenda
Introduction to NiFi
Our use case for NiFi
Demo
Q&A

History & Facts
Created by : NSA
Incubating : 2014
Available : 2015
Main contributors: Hortonworks
Current Stable Version : 1.1.1
Delivery Guarantees : at least once
Out of Order Processing : no
Windowing : no
Back-pressure : yes
Latency : configurable
Resource Management : native
API : REST (GUI)

Flow Files
Basic Abstraction
● Pointer to content
● Content Attributes (key/value)
● Connection to provenance events

Repositories
● FlowFile
● Content
● Provenance
● Immutable
● Copy-on-write

Processor
Processors actually perform the work of
data routing, transformation, or
mediation between systems. Processors
have access to attributes of a given
FlowFile and its content stream.
Processors can operate on zero or more
Flow Files in a given unit of work and
either commit that work or rollback

Processor
● Basic Work Unit
● State
● Statistics
● Settings
● Input/Output
● Provenance
● Scheduling
● Logging (bulletins)

Connection
Connections provide the actual linkage
between processors. These act as
queues and allow various processes to
interact at differing rates. These queues
can be prioritized dynamically and can
have upper bounds on load, which
enable back pressure

Connection
● Queue
● Statistics
● Settings
● Prioritization
● Details

Process Group
Specific set of processes and their
connections, which can receive data
via input ports and send data out via
output ports. In this manner, process
groups allow creation of entirely new
components simply by composition of
other components

Templates
Templates tend to be highly pattern oriented and while there are often many
different ways to solve a problem, it helps greatly to be able to share those
best practices. Templates allow subject matter experts to build and publish
their flow designs and for others to benefit and collaborate on them
● XML Based
● Reusable unit
● Versioning (versioning with Git)

Data Provenance
NiFi automatically records, indexes, and makes available
provenance data as objects flow through the system even
across fan-in, fan-out, transformations, and more. This
information becomes extremely critical in supporting
compliance, troubleshooting, optimization, and other scenarios

Data Provenance
● Details
● Attributes
● Content

Controller Service
Controller Service allows
developers to share functionality
and state across the JVM in a
clean and consistent manner
● No scheduling
● No connections
● Used by Processors,
Reporting Tasks, and other
Controller Services

Reporting Tasks
Provides a capability for reporting
status, statistics, metrics, and
monitoring information to external
services
● ElastichSearchProvenanceReporter and DataDogReportingTask

Extensibility
● Ready to use maven template
● Well defined interface for each component
● Classloader Isolation (.nar files)
● Great documentation for developers

Statistics
● 200+ built in Processors
● 10+ built Control Services
● 10+ built in Reporting Tasks

Introduction Summary
● Processor
● Connection
● Processing Group
● Template
● Controller Service
● Reporting Task

What was before
● Inhouse built file collector
● Footprint of 10 server
● Hard to manage, scale, extend

Statistics
20TB
Data Ingested Daily
250K
Files Ingested Daily
Near Real Time Data Availability
Minimum Interval :1 min
1 TB
Data Distributed Reports
1 TB
30K
Files Exported Daily

Use Cases Summary
● Web User Interface
● Configurable
● Scalable
● Easy to Manage
● Designed for Extension

What's hot

Dataflow with Apache NiFiDataWorks Summit/Hadoop Summit

NifiJulio Castro

Apache NiFi Meetup - Princeton NJ 2016Timothy Spann

Integrating Apache Spark and NiFi for Data LakesDataWorks Summit/Hadoop Summit

Apache Nifi Crash CourseDataWorks Summit

Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann

Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Introduction to HBaseAvkash Chauhan

Introduction to Apache NiFi 1.11.4Timothy Spann

Integrating NiFi and FlinkBryan Bende

Apache RangerRommel Garcia

Apache NiFi User GuideDeon Huang

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit

From Data Warehouse to LakehouseModern Data Stack France

An Overview of AmbariChicago Hadoop Users Group

Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit

What's hot (20)

Dataflow with Apache NiFi

Nifi

Apache NiFi Meetup - Princeton NJ 2016

Integrating Apache Spark and NiFi for Data Lakes

Apache Nifi Crash Course

Introduction to Apache NiFi dws19 DWS - DC 2019

Apache NiFi- MiNiFi meetup Slides

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Introduction to HBase

Introduction to Apache NiFi 1.11.4

Integrating NiFi and Flink

Apache Ranger

Apache NiFi User Guide

File Format Benchmark - Avro, JSON, ORC & Parquet

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

From Data Warehouse to Lakehouse

An Overview of Ambari

Apache Beam: A unified model for batch and stream processing data

Similar to Data ingestion and distribution with apache NiFi

Automate your data flows with Apache NIFIAdam Doyle

Prefect Paris Airflow Meetup Jeff Hale April 2023.pdfJeff Hale

NetflixOSS Meetup S6E1 - Titus & Containersaspyker

The Fn Project: A Quick Introduction (December 2017)Oracle Developers

Integração de Dados com Apache NIFI - Marco Garcia CetaxMarco Garcia

Census Bureau PBOCSTolu A Williams

3-2-1 Action! Running OpenStack Shared File System Service in ProductionSean Cohen

Apache airflowPurna Chander

Data Engineer's Lunch #44: PrefectAnant Corporation

Data Pipelines with Python - NWA TechFest 2017Casey Kinsey

Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Data Con LA

Bringing SDN to the Management PlaneAnees Shaikh

AI made easy with Flink AI FlowJiangjie Qin

Introduction to data flow management using apache nifiAnshuman Ghosh

Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA

Open shift and docker - october,2014Hojoong Kim

Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022HostedbyConfluent

Cloud lunch and learn real-time streaming in azureTimothy Spann

Music city data Hail Hydrate! from stream to lakeTimothy Spann

DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...DevOps_Fest

Similar to Data ingestion and distribution with apache NiFi (20)

Automate your data flows with Apache NIFI

Prefect Paris Airflow Meetup Jeff Hale April 2023.pdf

NetflixOSS Meetup S6E1 - Titus & Containers

The Fn Project: A Quick Introduction (December 2017)

Integração de Dados com Apache NIFI - Marco Garcia Cetax

Census Bureau PBOCS

3-2-1 Action! Running OpenStack Shared File System Service in Production

Apache airflow

Data Engineer's Lunch #44: Prefect

Data Pipelines with Python - NWA TechFest 2017

Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...

Bringing SDN to the Management Plane

AI made easy with Flink AI Flow

Introduction to data flow management using apache nifi

Data Con LA 2018 - Streaming and IoT by Pat Alwell

Open shift and docker - october,2014

Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022

Cloud lunch and learn real-time streaming in azure

Music city data Hail Hydrate! from stream to lake

DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...

Recently uploaded

IBEF report on the Insurance market in IndiaManalVerma4

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics

Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Principles and Practices of Data VisualizationKianJazayeri1

What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

Digital Marketing Plan, how digital marketing worksdeepakthakur548787

modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole

Networking Case Study prepared by teacher.pptxHimangsuNath

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Real-Time AI Streaming - AI Max PrincetonTimothy Spann

SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

Recently uploaded (20)

IBEF report on the Insurance market in India

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...

Rithik Kumar Singh codealpha pythohn.pdf

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx

Semantic Shed - Squashing and Squeezing.pptx

Principles and Practices of Data Visualization

What To Do For World Nature Conservation Day by Slidesgo.pptx

Student Profile Sample report on improving academic performance by uniting gr...

Digital Marketing Plan, how digital marketing works

modul pembelajaran robotic Workshop _ by Slidesgo.pptx

Student profile product demonstration on grades, ability, well-being and mind...

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...

Networking Case Study prepared by teacher.pptx

Data Factory in Microsoft Fabric (MsBIP #82)

Real-Time AI Streaming - AI Max Princeton

SMOTE and K-Fold Cross Validation-Presentation.pptx

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024

Insurance Churn Prediction Data Analysis Project

Data ingestion and distribution with apache NiFi

1. Data Ingestion & Distribution with Apache NiFi

2. Agenda Introduction to NiFi Our use case for NiFi Demo Q&A

3. Introduction to NiFi

4. History & Facts Created by : NSA Incubating : 2014 Available : 2015 Main contributors: Hortonworks Current Stable Version : 1.1.1 Delivery Guarantees : at least once Out of Order Processing : no Windowing : no Back-pressure : yes Latency : configurable Resource Management : native API : REST (GUI)

5. Ecosystem Stream ProcessingData Moving

6. Architecture

7. Flow Files Basic Abstraction ● Pointer to content ● Content Attributes (key/value) ● Connection to provenance events

8. Repositories ● FlowFile ● Content ● Provenance ● Immutable ● Copy-on-write

9. Processor Processors actually perform the work of data routing, transformation, or mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more Flow Files in a given unit of work and either commit that work or rollback

10. Processor ● Basic Work Unit ● State ● Statistics ● Settings ● Input/Output ● Provenance ● Scheduling ● Logging (bulletins)

11. Connection Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues can be prioritized dynamically and can have upper bounds on load, which enable back pressure

12. Connection ● Queue ● Statistics ● Settings ● Prioritization ● Details

13. Process Group Specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner, process groups allow creation of entirely new components simply by composition of other components

14. Templates Templates tend to be highly pattern oriented and while there are often many different ways to solve a problem, it helps greatly to be able to share those best practices. Templates allow subject matter experts to build and publish their flow designs and for others to benefit and collaborate on them ● XML Based ● Reusable unit ● Versioning (versioning with Git)

15. Data Provenance NiFi automatically records, indexes, and makes available provenance data as objects flow through the system even across fan-in, fan-out, transformations, and more. This information becomes extremely critical in supporting compliance, troubleshooting, optimization, and other scenarios

16. Data Provenance ● Details ● Attributes ● Content

17. Controller Service Controller Service allows developers to share functionality and state across the JVM in a clean and consistent manner ● No scheduling ● No connections ● Used by Processors, Reporting Tasks, and other Controller Services

18. Reporting Tasks Provides a capability for reporting status, statistics, metrics, and monitoring information to external services ● ElastichSearchProvenanceReporter and DataDogReportingTask

19. Extensibility ● Ready to use maven template ● Well defined interface for each component ● Classloader Isolation (.nar files) ● Great documentation for developers

20. Statistics ● 200+ built in Processors ● 10+ built Control Services ● 10+ built in Reporting Tasks

21. Introduction Summary ● Processor ● Connection ● Processing Group ● Template ● Controller Service ● Reporting Task

22. Our use case for NiFi

23. What was before ● Inhouse built file collector ● Footprint of 10 server ● Hard to manage, scale, extend

24. DWH Real Time

25. DWH Batch

26. Reports Distribution

27. Statistics 20TB Data Ingested Daily 250K Files Ingested Daily Near Real Time Data Availability Minimum Interval :1 min 1 TB Data Distributed Reports 1 TB 30K Files Exported Daily

28. AWS - Hadoop Ingestion

29. AWS - Hadoop Ingestion

30. Kafka Reprocessing

31. sFTP - HDFS Ingestion

32. Let’s break something ;)

33. Use Cases Summary ● Web User Interface ● Configurable ● Scalable ● Easy to Manage ● Designed for Extension

34. Q & A

35. THANK YOU