Beyond Messaging
Enterprise Dataflow powered by Apache NiFi
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Aldrin Piri
3 November 2015
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
About me
Senior Member of Technical Staff
Project Management Committee and Committer
@aldrinpiri
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Simplistic View of Enterprise Data Flow
The Data Flow Thing
Process and
Analyze Data
Acquire Data
Store Data
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Global interactions with customers, business partners, and things
spanning different volume, velocity, bandwidth, and latency needs
Realistic View of Data Flow
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Meeting Edge Requirements
GATHER
DELIVER
PRIORITIZE
Track from the edge Through to the datacenter
Small Footprints
operate with very little power
Limited Bandwidth
can create high latency
Data Availability
exceeds transmission bandwidth
Data Must Be Secured
throughout its journey
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
• Remote sensor delivery (Internet of Things - IoT)
• Intra-site / Inter-site / global distribution (Enterprise)
• Ingest for driving analytics (Big Data)
• Data Processing (Simple Event Processing)
Where do we find data flow?
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Basics of Connecting Systems
For every connection,
these must agree:
1. Protocol
2. Format
3. Schema
4. Priority
5. Size of event
6. Frequency of event
7. Authorization access
8. Relevance
P1
Producer
C1
Consumer
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
• Messaging addresses only a small subset of the problem space
• Needed to understand the big picture
• Needed the ability to make immediate changes
• Must maintain chain of custody for data
• Rigorous security and compliance requirements
Challenges of dataflow in the enterprise
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Great options including:
• Kafka
• ActiveMQ
• Tibco
Let us consider the perfect messaging system for this talk:
• It has zero latency
• It has perfect data durability
• It supports unlimited consumers and producers
Messaging Systems as Dataflow
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
“But my system needs…”
• A different format and/or schema
• To use a different protocol
• The highest priority information first
• Large objects (event batches) / Small Objects (streams)
• Authorization to the data level
• Only interested in a subset of data on a topic
• Data needs to be enriched/sanitized before it arrives
Dataflow as a messaging problem
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Using Messaging
Only a subset agree
using messaging
1. Protocol
2. Format
3. Schema
4. Priority
5. Size of event
6. Frequency of event
7. Authorization access
8. Relevance
P1
CN
C1
Messaging
More issues to consider:
• How do you know what the data flow looks like?
• How is it managed?
• How is it working – today, yesterday?
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
• Add new systems to handle the protocol differences
• Add new systems to convert the data
• Add new systems to reorder the data
• Add new systems to filter the unauthorized data
• Add new topics to represent ‘stages of the flow’
Which leads to latency, complexity, and limited retention
Ultimately, the operations teams who handle data at flow boundaries become
responsible for managing.
How these issues are typically solved
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Real-time Data Flow
It’s not just how quickly you
move data – it’s about how
quickly you can change behavior
and seize new opportunities
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Introducing Apache NiFi
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
November 2014
NiFi is donated to the Apache Software Foundation
(ASF) through NSA’s Technology Transfer Program
and enters ASF’s incubator.
2006
NiagaraFiles (NiFi) was first incepted by Joe Witt at
the National Security Agency (NSA)
A Brief History
July 2015
NiFi reaches ASF top-level project status
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing,
transformation, or mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing various
processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and manages
the threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send
data via ports. A process group allows creation of entirely new
component simply by composition of its components.
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Architecture
OS/Host
JVM
NiFi Cluster Manager – Request Replicator
Web Server
Master
NiFi Cluster
Manager (NCM)
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Slaves
NiFi Nodes
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Live Demonstration
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Learn more and join us!
Apache NiFi site
http://nifi.apache.org
Subscribe to and collaborate at
dev@nifi.apache.org
users@nifi.apache.org
Submit Ideas or Issues
https://issues.apache.org/jira/browse/NIFI
Follow us on Twitter
@apachenifi
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank you!

BigData Techcon - Beyond Messaging with Apache NiFi

  • 1.
    Beyond Messaging Enterprise Dataflowpowered by Apache NiFi © Hortonworks Inc. 2011 – 2015. All Rights Reserved Aldrin Piri 3 November 2015
  • 2.
    Page2 © HortonworksInc. 2011 – 2015. All Rights Reserved About me Senior Member of Technical Staff Project Management Committee and Committer @aldrinpiri
  • 3.
    Page3 © HortonworksInc. 2011 – 2015. All Rights Reserved Simplistic View of Enterprise Data Flow The Data Flow Thing Process and Analyze Data Acquire Data Store Data
  • 4.
    Page4 © HortonworksInc. 2011 – 2015. All Rights Reserved Global interactions with customers, business partners, and things spanning different volume, velocity, bandwidth, and latency needs Realistic View of Data Flow
  • 5.
    Page5 © HortonworksInc. 2011 – 2015. All Rights Reserved Meeting Edge Requirements GATHER DELIVER PRIORITIZE Track from the edge Through to the datacenter Small Footprints operate with very little power Limited Bandwidth can create high latency Data Availability exceeds transmission bandwidth Data Must Be Secured throughout its journey
  • 6.
    Page6 © HortonworksInc. 2011 – 2015. All Rights Reserved • Remote sensor delivery (Internet of Things - IoT) • Intra-site / Inter-site / global distribution (Enterprise) • Ingest for driving analytics (Big Data) • Data Processing (Simple Event Processing) Where do we find data flow?
  • 7.
    Page7 © HortonworksInc. 2011 – 2015. All Rights Reserved Basics of Connecting Systems For every connection, these must agree: 1. Protocol 2. Format 3. Schema 4. Priority 5. Size of event 6. Frequency of event 7. Authorization access 8. Relevance P1 Producer C1 Consumer
  • 8.
    Page8 © HortonworksInc. 2011 – 2015. All Rights Reserved • Messaging addresses only a small subset of the problem space • Needed to understand the big picture • Needed the ability to make immediate changes • Must maintain chain of custody for data • Rigorous security and compliance requirements Challenges of dataflow in the enterprise
  • 9.
    Page9 © HortonworksInc. 2011 – 2015. All Rights Reserved Great options including: • Kafka • ActiveMQ • Tibco Let us consider the perfect messaging system for this talk: • It has zero latency • It has perfect data durability • It supports unlimited consumers and producers Messaging Systems as Dataflow
  • 10.
    Page10 © HortonworksInc. 2011 – 2015. All Rights Reserved “But my system needs…” • A different format and/or schema • To use a different protocol • The highest priority information first • Large objects (event batches) / Small Objects (streams) • Authorization to the data level • Only interested in a subset of data on a topic • Data needs to be enriched/sanitized before it arrives Dataflow as a messaging problem
  • 11.
    Page11 © HortonworksInc. 2011 – 2015. All Rights Reserved Using Messaging Only a subset agree using messaging 1. Protocol 2. Format 3. Schema 4. Priority 5. Size of event 6. Frequency of event 7. Authorization access 8. Relevance P1 CN C1 Messaging More issues to consider: • How do you know what the data flow looks like? • How is it managed? • How is it working – today, yesterday?
  • 12.
    Page12 © HortonworksInc. 2011 – 2015. All Rights Reserved • Add new systems to handle the protocol differences • Add new systems to convert the data • Add new systems to reorder the data • Add new systems to filter the unauthorized data • Add new topics to represent ‘stages of the flow’ Which leads to latency, complexity, and limited retention Ultimately, the operations teams who handle data at flow boundaries become responsible for managing. How these issues are typically solved
  • 13.
    Page13 © HortonworksInc. 2011 – 2015. All Rights Reserved Real-time Data Flow It’s not just how quickly you move data – it’s about how quickly you can change behavior and seize new opportunities
  • 14.
    Page14 © HortonworksInc. 2011 – 2015. All Rights Reserved Introducing Apache NiFi • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Recovery/recording a rolling log of fine- grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  • 15.
    Page15 © HortonworksInc. 2011 – 2015. All Rights Reserved November 2014 NiFi is donated to the Apache Software Foundation (ASF) through NSA’s Technology Transfer Program and enters ASF’s incubator. 2006 NiagaraFiles (NiFi) was first incepted by Joe Witt at the National Security Agency (NSA) A Brief History July 2015 NiFi reaches ASF top-level project status
  • 16.
    Page16 © HortonworksInc. 2011 – 2015. All Rights Reserved Flow Based Programming (FBP) FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
  • 17.
    Page17 © HortonworksInc. 2011 – 2015. All Rights Reserved OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Architecture OS/Host JVM NiFi Cluster Manager – Request Replicator Web Server Master NiFi Cluster Manager (NCM) OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Slaves NiFi Nodes
  • 18.
    Page18 © HortonworksInc. 2011 – 2015. All Rights Reserved Live Demonstration
  • 19.
    Page19 © HortonworksInc. 2011 – 2015. All Rights Reserved Learn more and join us! Apache NiFi site http://nifi.apache.org Subscribe to and collaborate at dev@nifi.apache.org users@nifi.apache.org Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI Follow us on Twitter @apachenifi
  • 20.
    Page20 © HortonworksInc. 2011 – 2015. All Rights Reserved Thank you!

Editor's Notes

  • #15  ----- Meeting Notes (18Sep15 13:08) ----- Take a pause part way through.
  • #17 Introduce Flow Based Programming fundamentals, why they matter, and how NiFi adopts them