Data ingestion and distribution with apache NiFi

564 views

Published on

In this session, we will cover our experience working with Apache NiFi, an easy to use, powerful, and reliable system to process and distribute a large volume of data. The first part of the session will be an introduction to Apache NiFi. We will go over NiFi main components and building blocks and functionality.
In the second part of the session, we will show our use case for Apache NiFi and how it's being used inside our Data Processing infrastructure.

Published in: Data & Analytics
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
564
On SlideShare
0
From Embeds
0
Number of Embeds
49
Actions
Shares
0
Downloads
14
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Data ingestion and distribution with apache NiFi

  1. 1. Data Ingestion & Distribution with Apache NiFi
  2. 2. Agenda Introduction to NiFi Our use case for NiFi Demo Q&A
  3. 3. Introduction to NiFi
  4. 4. History & Facts Created by : NSA Incubating : 2014 Available : 2015 Main contributors: Hortonworks Current Stable Version : 1.1.1 Delivery Guarantees : at least once Out of Order Processing : no Windowing : no Back-pressure : yes Latency : configurable Resource Management : native API : REST (GUI)
  5. 5. Ecosystem Stream ProcessingData Moving
  6. 6. Architecture
  7. 7. Flow Files Basic Abstraction ● Pointer to content ● Content Attributes (key/value) ● Connection to provenance events
  8. 8. Repositories ● FlowFile ● Content ● Provenance ● Immutable ● Copy-on-write
  9. 9. Processor Processors actually perform the work of data routing, transformation, or mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more Flow Files in a given unit of work and either commit that work or rollback
  10. 10. Processor ● Basic Work Unit ● State ● Statistics ● Settings ● Input/Output ● Provenance ● Scheduling ● Logging (bulletins)
  11. 11. Connection Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues can be prioritized dynamically and can have upper bounds on load, which enable back pressure
  12. 12. Connection ● Queue ● Statistics ● Settings ● Prioritization ● Details
  13. 13. Process Group Specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner, process groups allow creation of entirely new components simply by composition of other components
  14. 14. Templates Templates tend to be highly pattern oriented and while there are often many different ways to solve a problem, it helps greatly to be able to share those best practices. Templates allow subject matter experts to build and publish their flow designs and for others to benefit and collaborate on them ● XML Based ● Reusable unit ● Versioning (versioning with Git)
  15. 15. Data Provenance NiFi automatically records, indexes, and makes available provenance data as objects flow through the system even across fan-in, fan-out, transformations, and more. This information becomes extremely critical in supporting compliance, troubleshooting, optimization, and other scenarios
  16. 16. Data Provenance ● Details ● Attributes ● Content
  17. 17. Controller Service Controller Service allows developers to share functionality and state across the JVM in a clean and consistent manner ● No scheduling ● No connections ● Used by Processors, Reporting Tasks, and other Controller Services
  18. 18. Reporting Tasks Provides a capability for reporting status, statistics, metrics, and monitoring information to external services ● ElastichSearchProvenanceReporter and DataDogReportingTask
  19. 19. Extensibility ● Ready to use maven template ● Well defined interface for each component ● Classloader Isolation (.nar files) ● Great documentation for developers
  20. 20. Statistics ● 200+ built in Processors ● 10+ built Control Services ● 10+ built in Reporting Tasks
  21. 21. Introduction Summary ● Processor ● Connection ● Processing Group ● Template ● Controller Service ● Reporting Task
  22. 22. Our use case for NiFi
  23. 23. What was before ● Inhouse built file collector ● Footprint of 10 server ● Hard to manage, scale, extend
  24. 24. DWH Real Time
  25. 25. DWH Batch
  26. 26. Reports Distribution
  27. 27. Statistics 20TB Data Ingested Daily 250K Files Ingested Daily Near Real Time Data Availability Minimum Interval :1 min 1 TB Data Distributed Reports 1 TB 30K Files Exported Daily
  28. 28. AWS - Hadoop Ingestion
  29. 29. AWS - Hadoop Ingestion
  30. 30. Kafka Reprocessing
  31. 31. sFTP - HDFS Ingestion
  32. 32. Let’s break something ;)
  33. 33. Use Cases Summary ● Web User Interface ● Configurable ● Scalable ● Easy to Manage ● Designed for Extension
  34. 34. Q & A
  35. 35. THANK YOU

×