SlideShare a Scribd company logo
Real-Time Data Flows
withApache NiFi
June 2016
Manish Gupta
1. Data Flow Challenges in an Enterprise
2. Introduction to Apache NiFi
3. Core Features
4. Architecture
5. Demo – Simple Lambda Architecture
6. Use Cases
7. Q & A
Data Flow Challenges in an Enterprise
Connected Enterprises in a Distributed World
Evolution of DataProjects inanEnterprise
Simple VB/C++
applications
working directly on
Spreadsheet /
Access databases.
OLTP applications
(mostly web) using
a RDBMS at
backend, with
some operational
reporting
capabilities.
Multiple OLTP
applications
exchanging data
between
themselves and
trying to do BI
reporting. But
failed miserably in
providing single
version of truth
Emergence of
EDW, ETL, BI, OLAP,
MDM, Data
Governance.
Issues in Scalability,
Availability,
Maintainability,
Performance etc.
started showing up
at totally different
level.
Emergence of MPP
data warehouses,
grid computing,
Self-service BI tools
and technology.
Big Data,
Cloud/Hybrid
Architecture, Real-
Time analytics /
Stream Processing,
Document
indexing/search, Log
Capture & Analysis,
Data Virtualization,
NoSQL (Key-Value,
Document Oriented,
Column Family,
Graph), In-memory
database.
Application ArchitecturePatternSilos
Apps &
Services
RDBMS
OLTP Workload
Spreadsheets /
Access
RDBMS
App App App
Caches & Derived Store
Cache
Poll For Changes
App App App
NFS
Logs
Log Aggregator /
Flume
H A D O O P
Transform
RDBMS
ODS
Data Guard /
Golden Gate /
Log Shipping
Etc.
EDWETL
ELT
App App App
Splunk
H A D O O P
EDW
CSV Dump
/ Sqoop
App App App
NoSQL / Key-Value
Store
App
ActiveMQ
Stream
Processor
Kafka
NoSQL / Key-Value
Store
Cloud
App
Andwe endupin a “Giant Hairball” Arch…
Apps &
Services
RDBMS
OLTP Workload
App App App
Caches & Derived Store
Cache
Poll For Changes
ODS
Data Guard /
Golden Gate /
Log Shipping
Etc.
H A D O O P
EDW
ELT
ETL
CSV Dump
/ SqoopNFS
Logs
Log Aggregator /
Flume
Transform
App App App
NoSQL / Key-Value
Store
ETL
Load /
Refresh
Splunk
Spreadsheets /
Access
App
ActiveMQ
App
Cloud
Stream
Processor
MemSQL
Log Aggregator / Flume
Elastic/
Solr
Logstash
REST
Graph DB
DataIngestion Frameworks(excludingTraditionalETLTools)
Amazon
Kinesis
Apache
Chukwa
Apache
Flume
Apache
Kafka
Apache
Sqoop
Cloudera
Morphlines
Facebook
Scribe
Fluentd
Google
Photon
Mozilla
Heka
Hadoop In,
Hadoop Out
Twitter
Kestrel
Linkedin
Databus
Elastic
Logstash
Netflix
Suro
Linkedin
Gobblin
InfoSphere
Streams
Apache
Beam
- Purpose built ( Not designed with Universal
Applicability)
- Induces lot of complexity in project architecture
- Hard to extend
Common Data Integration Problems
Size and Velocity
Messages in Streaming manner
Tiny to small files in micro batches time.
Small files in mini batches.
Medium to large files in batches.
Formats
CSV, TSV, PSV, TEXT, JSON, XML, XLS, XLSX, PDF, BMP, PRN
Avro, Protocol Buffer, Parquet, RC, ORC, Sequence File
Zip, GZIP, TAR, LZIP, 7z, RAR
Mediums
File Share, FTP, REST, HTTP, TCP, UDP
Schedule
Once, Every day, every hour, every minute, every second,
continuous.
Mode
 Push / Pull / Poll
Asynchronous Operation Challenges
 Fast edge consumers + slow processors = everything breaks
 Process Message A first, all others can take a backseat.
Security
 Data should be secure – not just at rest, but in motion too.
Miscellaneous
 Can you route a copy of this to our NoSQL store as well after
converting it to JSON.
 Ability to run from failure (checkpoint / rerun / replay)
 Merge small files to large files for Hadoop
 Break large files into smaller manageable chunks for NoSQL
Introduction
What is Apache NiFi, it’s History, and some terminology.
Whatis Apache NiFi
NiFi (short for “Niagara Files”) is a powerful enterprise grade dataflow tool that
can collect, route enrich, transform and Process data in a scalable manner.
NiFi is based on the concepts of flow-based programming (FBP). FBP is a
programming paradigm that defines applications as networks of "black box"
processes, which exchange data across predefined connections by message
passing, where the connections are specified externally to the processes.
Single combined platform for
 Data acquisition
 Simple event processing
 Transport and delivery
 Designed to accommodate highly diverse and
complicated dataflows
• It has Visual command and control interface which allows you to define and
manipulate data flows in real-time and with great agility.
ShortHistory
 Developed by the National Security Agency (NSA) for over 8 years
 Open sourced in Nov 2014 (Apache). Major contributors were ex-NSA
who formed a company named Onyara. Lead – Joe Witt.
 Become Apache Top Level Project in July 2015
 In August 2015, Hortonworks acquired Onyara
 In September 2015, Hortonworks released HDF 1.0 powered by NiFi.
Current version is HDF 1.2
 HDF has got solid backing of Hortonworks.
Terminology
FlowFile (Information Packet)
Unit of data (each object) moving through the system
Content + Attributes (key/value pairs)
Processor (Black Box)
Performs the work, can access FlowFiles
Currently there are 135 different processors
Connection (Bounded Buffer)
Links between processors
Queues that can be dynamically prioritized
Process Group (Subnet)
Set of processors and their connections
Receive data via input ports, send data via output ports
Flow Controller (Scheduler)
Maintains the knowledge of how processes are connected and
manages the threads and their allocation.
Header
<< UUID, Name, Size, In time, Attribute Map
>>
Content
Flow File
Types (Events, Files,
Objects, Messages etc.),
Formats (JSON, XML,
AVRO, Text, Proprietary
etc.), Size (B to GBs)
Processor
Routing (Context/Content),
Transformation (enrich, filter,
convert, split, aggregate,
custom), Mediation (push /
pull), Scheduling
Connection
Queuing, back-pressure,
expiration, prioritization
ApacheNiFi is not
• NiFi is not a distributed computation Engine
• An engine to do CEP (Complex event processing)
• A computational framework to do distributed Joins or Rolling Window Aggregations the way
Spark/Storm/Flink does.
• Hence it’s not based on Map Reduce / Spark or any other framework.
• NiFi doesn’t have any dependency on any big data tool like Hadoop or zookeeper etc. All it needs is Java.
• It’s not a full fledge ETL tool like Informatica / Pentaho / Talend / SSIS as of now. But it will be - eventually.
• It’s not a long term Data storage tool. It only holds data temporarily for re-run / data provenance purposes.
• It’s not a document indexer. It’s indexing capabilities are only to help in troubleshooting / debugging.
Core Features
What are the core features and benefits of Apache NiFi?
GuaranteedDataDelivery
• Even at very high scale, delivery is guaranteed
• Persistent Write Ahead Log (Flow File Repository) and Data Partitioning
(Content Repository) ensures this. They are together designed in a way that
they allow:
• Very high transaction rates
• Effective load spreading
• Copy-on-write scheme (for every change in data)
• Pass-by-reference
DataBufferingw/BackPressureandPressureRelease
• Supports buffering of all queued data.
• Ability to back-pressure (Even if there is no load balancing, nodes can say “Back-Off”
and other nodes in the pipeline pick up the slack.
• When backpressure is applied to a connection, it will cause the processor that is the
source of the connection to stop being scheduled to run until the queue clears out.
However, data will still queue up in that processor's incoming connections.
Prioritized Queuing
• NiFi allows the setting of one or more prioritization schemes for how data is
retrieved from a queue.
• Oldest First, Newest first, Largest first, Smallest First, or custom scheme
• The default is oldest first
DesignedforExtension
• NiFi by design is Highly Extensible.
• One can write custom:
 Processor
 Controller Service
 Reporting Tasks
 Prioritizer
 User Interface
• These extensions are bundles in something called as NAR Files (NiFi
Archives).
Visual Interface forCommand andControl
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
DataProvenance(Not justLineage)
• View attributes and content at given points in time
(before and after each processor) !!!
• Records, indexes, and makes events available for
display
BenefitsofApacheNiFi
Single data-source agnostic collection platform
Intuitive, real-time visual user interface with drag-and-drop capabilities
Powerful Data security capabilities from source to storage
Highly granular data sharing policies
Ability to react in real time by leveraging bi-directional data flows and prioritized data feeds
Extremely scalable, extensible platform
Architecture
High level architecture (single machine), Primary components
Host
F PC
Single Node
Host NiFi's HTTP-based command and
control API.
Real Brain. Provide and manage threads.
Scheduling.
Runs within JVM. Processor / Controller
Service / Reporting Service / U I/
Prioritizer.
State of about a given FlowFile which is
presently active in the flow. WAL.
Actual content bytes of a given FlowFile.
Blocks of data in FS. More than 1 FS
(partitions)
All provenance event data is stored.
Saved on FS. Indexed / Searchable.
Demo
Simple λArchitecture
Demo(SimpleλArchitectureusingNiFi)
1. Start NiFi (or HDF). ElasticSearch, Kibana and MongoDB on Docker. Create HDFS destination table.
2. Explore NiFi UI
3. Pull data from twitter
4. Route & Deliver to ElasticSearch, Mongo and Hadoop in real time
5. Explore NiFi capabilities
6. Design Dashboard on real-time data
Serving Layer
Speed Layer
Batch Layer
NiFiTwitter
Hadoop
ElasticSearch
MongoDB
Spark SQL
Kibana
Docker
Query
WYSIWYG…!!!
Use Cases
Some Scenarios
Some UseCases
Building Ingestion and Delivery layers in IoT Solutions
Ingestion tier in Lambda Architecture (for feeding both speed and batch layers)
Ingestion tier in Data Lake Architectures
Cross Geography Data Replication in a secure manner
Integrating on premise system to on cloud system (Hybrid Cloud Architecture)
Simplifying existing Big Data architectures which are currently using Flume, Kafka, Logstash, Scribe etc. or
custom connectors.
Developing Edge nodes for Trade repositories.
Enterprise Data Integration platform
And many more…
Q & A
Reference
https://nifi.apache.org/
http://hortonworks.com/products/data-center/hdf/
https://github.com/apache/nifi
https://twitter.com/apachenifi
Thank You
@manishpedia
https://in.linkedin.com/in/manishgforce

More Related Content

What's hot

Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
DataWorks Summit
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Timothy Spann
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
DataWorks Summit
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
Yifeng Jiang
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
Timothy Spann
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources
confluent
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
Aparna Pillai
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Sumit Maheshwari
 

What's hot (20)

Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources Automate Your Kafka Cluster with Kubernetes Custom Resources
Automate Your Kafka Cluster with Kubernetes Custom Resources
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 

Viewers also liked

Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
Bryan Bende
 
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
DataWorks Summit
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
SoftServe
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
Cloudera, Inc.
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4
Hortonworks
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 

Viewers also liked (9)

Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
 
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 

Similar to Real-Time Data Flows with Apache NiFi

Integração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia CetaxIntegração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia Cetax
Marco Garcia
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Big data
Big dataBig data
Big data
Abilash Mavila
 
Hadoop
HadoopHadoop
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
Spotle.ai
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth
 
Big data frameworks
Big data frameworksBig data frameworks
Big data frameworks
Cuelogic Technologies Pvt. Ltd.
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
avenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
BlibBlobb
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
darach
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Hortonworks
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the Enterprise
Gregory Keys
 

Similar to Real-Time Data Flows with Apache NiFi (20)

Integração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia CetaxIntegração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia Cetax
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Big data frameworks
Big data frameworksBig data frameworks
Big data frameworks
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the Enterprise
 
paper
paperpaper
paper
 

Recently uploaded

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 

Real-Time Data Flows with Apache NiFi

  • 1. Real-Time Data Flows withApache NiFi June 2016 Manish Gupta
  • 2. 1. Data Flow Challenges in an Enterprise 2. Introduction to Apache NiFi 3. Core Features 4. Architecture 5. Demo – Simple Lambda Architecture 6. Use Cases 7. Q & A
  • 3. Data Flow Challenges in an Enterprise Connected Enterprises in a Distributed World
  • 4. Evolution of DataProjects inanEnterprise Simple VB/C++ applications working directly on Spreadsheet / Access databases. OLTP applications (mostly web) using a RDBMS at backend, with some operational reporting capabilities. Multiple OLTP applications exchanging data between themselves and trying to do BI reporting. But failed miserably in providing single version of truth Emergence of EDW, ETL, BI, OLAP, MDM, Data Governance. Issues in Scalability, Availability, Maintainability, Performance etc. started showing up at totally different level. Emergence of MPP data warehouses, grid computing, Self-service BI tools and technology. Big Data, Cloud/Hybrid Architecture, Real- Time analytics / Stream Processing, Document indexing/search, Log Capture & Analysis, Data Virtualization, NoSQL (Key-Value, Document Oriented, Column Family, Graph), In-memory database.
  • 5. Application ArchitecturePatternSilos Apps & Services RDBMS OLTP Workload Spreadsheets / Access RDBMS App App App Caches & Derived Store Cache Poll For Changes App App App NFS Logs Log Aggregator / Flume H A D O O P Transform RDBMS ODS Data Guard / Golden Gate / Log Shipping Etc. EDWETL ELT App App App Splunk H A D O O P EDW CSV Dump / Sqoop App App App NoSQL / Key-Value Store App ActiveMQ Stream Processor Kafka NoSQL / Key-Value Store Cloud App
  • 6. Andwe endupin a “Giant Hairball” Arch… Apps & Services RDBMS OLTP Workload App App App Caches & Derived Store Cache Poll For Changes ODS Data Guard / Golden Gate / Log Shipping Etc. H A D O O P EDW ELT ETL CSV Dump / SqoopNFS Logs Log Aggregator / Flume Transform App App App NoSQL / Key-Value Store ETL Load / Refresh Splunk Spreadsheets / Access App ActiveMQ App Cloud Stream Processor MemSQL Log Aggregator / Flume Elastic/ Solr Logstash REST Graph DB
  • 7. DataIngestion Frameworks(excludingTraditionalETLTools) Amazon Kinesis Apache Chukwa Apache Flume Apache Kafka Apache Sqoop Cloudera Morphlines Facebook Scribe Fluentd Google Photon Mozilla Heka Hadoop In, Hadoop Out Twitter Kestrel Linkedin Databus Elastic Logstash Netflix Suro Linkedin Gobblin InfoSphere Streams Apache Beam - Purpose built ( Not designed with Universal Applicability) - Induces lot of complexity in project architecture - Hard to extend
  • 8. Common Data Integration Problems Size and Velocity Messages in Streaming manner Tiny to small files in micro batches time. Small files in mini batches. Medium to large files in batches. Formats CSV, TSV, PSV, TEXT, JSON, XML, XLS, XLSX, PDF, BMP, PRN Avro, Protocol Buffer, Parquet, RC, ORC, Sequence File Zip, GZIP, TAR, LZIP, 7z, RAR Mediums File Share, FTP, REST, HTTP, TCP, UDP Schedule Once, Every day, every hour, every minute, every second, continuous. Mode  Push / Pull / Poll Asynchronous Operation Challenges  Fast edge consumers + slow processors = everything breaks  Process Message A first, all others can take a backseat. Security  Data should be secure – not just at rest, but in motion too. Miscellaneous  Can you route a copy of this to our NoSQL store as well after converting it to JSON.  Ability to run from failure (checkpoint / rerun / replay)  Merge small files to large files for Hadoop  Break large files into smaller manageable chunks for NoSQL
  • 9. Introduction What is Apache NiFi, it’s History, and some terminology.
  • 10. Whatis Apache NiFi NiFi (short for “Niagara Files”) is a powerful enterprise grade dataflow tool that can collect, route enrich, transform and Process data in a scalable manner. NiFi is based on the concepts of flow-based programming (FBP). FBP is a programming paradigm that defines applications as networks of "black box" processes, which exchange data across predefined connections by message passing, where the connections are specified externally to the processes. Single combined platform for  Data acquisition  Simple event processing  Transport and delivery  Designed to accommodate highly diverse and complicated dataflows • It has Visual command and control interface which allows you to define and manipulate data flows in real-time and with great agility.
  • 11. ShortHistory  Developed by the National Security Agency (NSA) for over 8 years  Open sourced in Nov 2014 (Apache). Major contributors were ex-NSA who formed a company named Onyara. Lead – Joe Witt.  Become Apache Top Level Project in July 2015  In August 2015, Hortonworks acquired Onyara  In September 2015, Hortonworks released HDF 1.0 powered by NiFi. Current version is HDF 1.2  HDF has got solid backing of Hortonworks.
  • 12. Terminology FlowFile (Information Packet) Unit of data (each object) moving through the system Content + Attributes (key/value pairs) Processor (Black Box) Performs the work, can access FlowFiles Currently there are 135 different processors Connection (Bounded Buffer) Links between processors Queues that can be dynamically prioritized Process Group (Subnet) Set of processors and their connections Receive data via input ports, send data via output ports Flow Controller (Scheduler) Maintains the knowledge of how processes are connected and manages the threads and their allocation. Header << UUID, Name, Size, In time, Attribute Map >> Content Flow File Types (Events, Files, Objects, Messages etc.), Formats (JSON, XML, AVRO, Text, Proprietary etc.), Size (B to GBs) Processor Routing (Context/Content), Transformation (enrich, filter, convert, split, aggregate, custom), Mediation (push / pull), Scheduling Connection Queuing, back-pressure, expiration, prioritization
  • 13. ApacheNiFi is not • NiFi is not a distributed computation Engine • An engine to do CEP (Complex event processing) • A computational framework to do distributed Joins or Rolling Window Aggregations the way Spark/Storm/Flink does. • Hence it’s not based on Map Reduce / Spark or any other framework. • NiFi doesn’t have any dependency on any big data tool like Hadoop or zookeeper etc. All it needs is Java. • It’s not a full fledge ETL tool like Informatica / Pentaho / Talend / SSIS as of now. But it will be - eventually. • It’s not a long term Data storage tool. It only holds data temporarily for re-run / data provenance purposes. • It’s not a document indexer. It’s indexing capabilities are only to help in troubleshooting / debugging.
  • 14. Core Features What are the core features and benefits of Apache NiFi?
  • 15. GuaranteedDataDelivery • Even at very high scale, delivery is guaranteed • Persistent Write Ahead Log (Flow File Repository) and Data Partitioning (Content Repository) ensures this. They are together designed in a way that they allow: • Very high transaction rates • Effective load spreading • Copy-on-write scheme (for every change in data) • Pass-by-reference DataBufferingw/BackPressureandPressureRelease • Supports buffering of all queued data. • Ability to back-pressure (Even if there is no load balancing, nodes can say “Back-Off” and other nodes in the pipeline pick up the slack. • When backpressure is applied to a connection, it will cause the processor that is the source of the connection to stop being scheduled to run until the queue clears out. However, data will still queue up in that processor's incoming connections.
  • 16. Prioritized Queuing • NiFi allows the setting of one or more prioritization schemes for how data is retrieved from a queue. • Oldest First, Newest first, Largest first, Smallest First, or custom scheme • The default is oldest first DesignedforExtension • NiFi by design is Highly Extensible. • One can write custom:  Processor  Controller Service  Reporting Tasks  Prioritizer  User Interface • These extensions are bundles in something called as NAR Files (NiFi Archives).
  • 17. Visual Interface forCommand andControl • Drag and drop processors to build a flow • Start, stop, and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • Create templates of common processor & connections
  • 18. DataProvenance(Not justLineage) • View attributes and content at given points in time (before and after each processor) !!! • Records, indexes, and makes events available for display
  • 19. BenefitsofApacheNiFi Single data-source agnostic collection platform Intuitive, real-time visual user interface with drag-and-drop capabilities Powerful Data security capabilities from source to storage Highly granular data sharing policies Ability to react in real time by leveraging bi-directional data flows and prioritized data feeds Extremely scalable, extensible platform
  • 20. Architecture High level architecture (single machine), Primary components
  • 21. Host F PC Single Node Host NiFi's HTTP-based command and control API. Real Brain. Provide and manage threads. Scheduling. Runs within JVM. Processor / Controller Service / Reporting Service / U I/ Prioritizer. State of about a given FlowFile which is presently active in the flow. WAL. Actual content bytes of a given FlowFile. Blocks of data in FS. More than 1 FS (partitions) All provenance event data is stored. Saved on FS. Indexed / Searchable.
  • 23. Demo(SimpleλArchitectureusingNiFi) 1. Start NiFi (or HDF). ElasticSearch, Kibana and MongoDB on Docker. Create HDFS destination table. 2. Explore NiFi UI 3. Pull data from twitter 4. Route & Deliver to ElasticSearch, Mongo and Hadoop in real time 5. Explore NiFi capabilities 6. Design Dashboard on real-time data Serving Layer Speed Layer Batch Layer NiFiTwitter Hadoop ElasticSearch MongoDB Spark SQL Kibana Docker Query
  • 26. Some UseCases Building Ingestion and Delivery layers in IoT Solutions Ingestion tier in Lambda Architecture (for feeding both speed and batch layers) Ingestion tier in Data Lake Architectures Cross Geography Data Replication in a secure manner Integrating on premise system to on cloud system (Hybrid Cloud Architecture) Simplifying existing Big Data architectures which are currently using Flume, Kafka, Logstash, Scribe etc. or custom connectors. Developing Edge nodes for Trade repositories. Enterprise Data Integration platform And many more…
  • 27. Q & A

Editor's Notes

  1. Big Data Ingestion and Streaming   What is data Ingestion? Data ingestion is the process of obtaining, importing, and processing data for later use or storage in a database. This process often involves altering individual files by editing their content and/or formatting them to fit into a larger document. An effective data ingestion methodology begins by validating the individual files, then prioritizes the sources for optimum processing, and finally validates the results. When numerous data sources exist in diverse formats (the sources may number in the hundreds and the formats in the dozens), maintaining reasonable speed and efficiency can become a major challenge. To that end, several vendors offer programs tailored to the task of data ingestion in specific applications or environments. Data Ingestion Frameworks Amazon Kinesis – real-time processing of streaming data at massive scale. Apache Chukwa – data collection system. Apache Flume – service to manage large amount of log data. Apache Kafka – distributed publish-subscribe messaging system. Apache Sqoop – tool to transfer data between Hadoop and a structured datastore. Cloudera Morphlines – framework that help ETL to Solr, HBase and HDFS. Facebook Scribe – streamed log data aggregator. Fluentd – tool to collect events and logs. Google Photon – geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency. Heka – open source stream processing software system. HIHO – framework for connecting disparate data sources with Hadoop. Kestrel – distributed message queue system. LinkedIn Databus – stream of change capture events for a database. LinkedIn Kamikaze – utility package for compressing sorted integer arrays. LinkedIn White Elephant – log aggregator and dashboard. Logstash – a tool for managing events and logs. Netflix Suro – log agregattor like Storm and Samza based on Chukwa. Pinterest Secor – is a service implementing Kafka log persistance. Linkedin Gobblin – linkedin’s universal data ingestion framework.     Flume A Flume agent is a Java virtual machine (JVM) process that hosts the components through which events flow. Each agent contains at the minimum a source, a channel and a sink. An agent can also run multiple sets of channels and sinks through a flow multiplexer that either replicates or selectively routes an event. Agents can cascade to form a multihop tiered collection topology until the final datastore is reached. A Flume event is a unit of data flow that contains a payload and an optional set of string attributes. An event is transmitted from its point of origination, normally called a client, to the source of an agent. When the source receives the event, it sends it to a channel that is a transient store for events within the agent. The associated sink can then remove the event from the channel and deliver it to the next agent or the event’s final destination. Flume is distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS. It is designed to be reliable and highly available, while providing a simple, flexible, and intuitive programming model based on streaming data flows. Flume provides extensibility for online analytic applications that process data stream in situ. Flume and Chukwa share similar goals and features. However, there are some notable differences. Flume maintains a central list of ongoing data flows, stored redundantly in Zookeeper. In contrast, Chukwa distributes this information more broadly among its services. Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machine are responsible for deciding what data to send. Resources :  Flume Tutorial How to Refine and Visualize Server Log Data How To Refine and Visualize Sentiment Data   Chukwa Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) andMapReduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying monitoring and analyzing results, in order to make the best use of this collected data. Log processing was one of the original purposes of MapReduce. Unfortunately, Hadoop is hard to use for this purpose. Writing MapReduce jobs to process logs is somewhat tedious and the batch nature of MapReduce makes it difficult to use with logs that are generated incrementally across many machines. Furthermore, HDFS stil does not support appending to existing files. Chukwa is a Hadoop subproject that bridges that gap between log handling and MapReduce. It provides a scalable distributed system for monitoring and analysis of log-based data. Some of the durability features include agent-side replying of data to recover from errors. Resources: Chukwa quick start guide    Sqoop Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. It offers two-way replication with both snapshots and incremental updates. Sqoop provides a pluggable mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems   Resources: Import from Microsoft SQL Server into the Hortonworks Sandbox using Sqoop Kafka Apache Kafka is a distributed publish-subscribe messaging system. It is designed to provide high throughput persistent messaging that’s scalable and allows for parallel data loads into Hadoop. Its features include the use of compression to optimize IO performance and mirroring to improve availability, scalability and to optimize performance in multiple-cluster scenarios. Fast A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Scalable Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers Durable Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Distributed by Design Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. Resources: Kafka Tutorial Storm and Kafka Simulating and Transporting Rea-Time Event Storm Hadoop is ideal for batch-mode processing over massive data sets, but it doesn’t support event-stream (a.k.a. message-stream) processing, i.e., responding to individual events within a reasonable time frame. (For limited scenarios, you could use a NoSQL database like HBase to capture incoming data in the form of append updates.) Storm is a general-purpose, event-processing system that is growing in popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses a cluster of services for scalability and reliability. In Storm terminology you create a topology that runs continuously over a stream of incoming data, which is analogous to a Hadoop job that runs as a batch process over a fixed data set and then terminates. An apt analogy is a continuous stream of water flowing through plumbing. The data sources for the topology are called spouts and each processing node is called a bolt. Bolts can perform arbitrarily sophisticated computations on the data, including output to data stores and other services. It is common for organizations to run a combination of Hadoop and Storm services to gain the best features of both platforms.  How Storm Works A storm cluster has three sets of nodes: Nimbus node (master node, similar to the Hadoop JobTracker): Uploads computations for execution Distributes code across the cluster Launches workers across the cluster Monitors computation and reallocates workers as needed ZooKeeper nodes – coordinates the Storm cluster Supervisor nodes – communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus Five key abstractions help to understand how Storm processes data: Tuples– an ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7) Streams – an unbounded sequence of tuples. Spouts –sources of streams in a computation (e.g. a Twitter API) Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases. Topologies – the overall calculation, represented visually as a network of spouts and bolts (as in the following diagram) Storm users define topologies for how to process the data when it comes streaming in from the spout. When the data comes in, it is processed and the results are passed into Hadoop. Resources: Ingesting and processing Real-time events with Apache Storm Process real-time big data with Twitter Storm Real time Data Ingestion in HBase & Hive using Storm Bolt   Elasticsearch Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements. Here are a few sample use-cases that Elasticsearch could be used for: You run an online web store where you allow your customers to search for products that you sell. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them. You want to collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies. In this case, you can use Logstash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate, and parse your data, and then have Logstash feed this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information that is of interest to you. You run a price alerting platform which allows price-savvy customers to specify a rule like “I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month”. In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found. You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records). In this case, you can use Elasticsearch to store your data and then use Kibana (part of the Elasticsearch/Logstash/Kibana stack) to build custom dashboards that can visualize aspects of your data that are important to you. Additionally, you can use the Elasticsearch aggregations functionality to perform complex business intelligence queries against your data. Resources: Exploring Elasticserach Logstash + Elasticsearch + Kibana Spring XD              Spring XD is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export. The Spring XD project is an open source Apache 2 License licenced project whose goal is to tackle big data complexity. Much of the complexity in building real-world big data applications is related to integrating many disparate systems into one cohesive solution across a range of use-cases. Common use-cases encountered in creating a comprehensive big data solution are High throughput distributed data ingestion from a variety of input sources into big data store such as HDFS or Splunk Real-time analytics at ingestion time, e.g. gathering metrics and counting values. Workflow management via batch jobs. The jobs combine interactions with standard enterprise systems (e.g. RDBMS) as well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive or HBase). High throughput data export, e.g. from HDFS to a RDBMS or NoSQL database. Resources: SpringXD Guide SpringXD Quick Start HDP + SpringXD   Solr Apache Solr is a powerful search server, which supports REST like API. Solr is powered by Lucene which enables powerful matching capabilities like phrases, wildcards, joins, grouping and many more across various data types. It is highly optimized for high traffic using Apache Zookeeper. Apache Solr comes with a wide set of features and we have listed a subset of  high impact features. Advanced Full-Text search capabilities. Standards based on Open Interfaces – XML, JSON and Http. Highly scalable and fault tolerant. Supports both Schema and Schemaless configuration. Faceted Search and Filtering. Support major languages like English, German, Chinese, Japanese, French and many more Rich Document Parsing. Resources: Solr for beginners Solr Tutorial HDP + Solr Additional Resources: Distcp Guide Real time Data Ingestion in HBase & Hive Elasticsearch Vs Solr