Apache NiFi
Crash Course Intro
Rafael Coss - @racoss
Hadoop Summit – Tokyo
Oct 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Data Flow & Streaming Fundamentals
What is dataflow and what are the challenges?
Apache NiFi
Architecture
Lab
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Flow & Streaming Fundamentals
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Connected Data World
 Internet of Anything (IoAT)
– Wind Turbines, Oil Rigs, Cars
– Weather Stations, Smart Grids
– RFID Tags, Beacons, Wearables
 User Generated Content (Web & Mobile)
– Twitter, Facebook, Snapchat, YouTube
– Clickstream, Ads, User Engagement
– Payments: Paypal, Venmo
44ZB in 2020
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Let’s Connect A to B
Producers A.K.A Things
Anything
AND
Everything
Internet!
Consumers
• User
• Storage
• System
• …More Things
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Stream Processing?
Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Modern Data Applications
Custom or Off the Shelf
Real-Time Cyber Security
protects systems with superior threat
detection
Smart Manufacturing
dramatically improves yields by managing
more variables in greater detail
Connected, Autonomous Cars
drive themselves and improve road safety
Future Farming
optimizing soil, seeds and equipment to
measured conditions on each square foot
Automatic Recommendation Engines
match products to preferences in milliseconds
DATA AT
REST
DATA IN
MOTION
ACTIONABLE
INTELLIGENCE
Modern Data Applications
Hortonworks
DataFlow
Hortonworks
Data Platform
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Store Data
Process and Analyze
Data
Acquire Data
Simplistic View of DataFlows: Easy, Definitive
Dataflow
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Unassuming Line: A Case Study
We’ve seen a few lines show up in the wild thus far
Internet! Inter- & Intra- connections in
our global courier enterprise
Spotlight: Arthur Lacôte, https://thenounproject.com/turo/
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataflow Line Anatomy 101
Let’s dissect what this line typically represents
Fig 1. Lineus Worldwidewebus. Common Name: Internet!
Script or
Application
Script or
Application
Data Data
Disparate Transport
Mechanisms
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataflow Line Anatomy 201
Sometimes that transport is just more lines
Fig 1. Lineus Worldwidewebus. Common Name: Internet!
Script or
Application
Script or
Application
Line Inception
Data Data
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Realistic View of Dataflows: Complex, Convoluted
Store Data
Process and Analyze
Data
Acquire Data
Store DataStore Data
Store Data
Store Data
Acquire Data
Acquire Data
Acquire Data
Dataflow
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Architecture
Ingestion
Simple Event Processing
Engine
Stream Processing
DestinationData Bus
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
High-Level Overview
IoT Edge
(single node)
IoT Edge
(single node)
IoT Devices
IoT Devices
NiFi Hub Data Broker
Column
DB
Data
Store
Live Dashboard
Data Center
(on premises/cloud)
HDFS/S3 HBase/Cassandra
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
What is dataflow and what are the challenges?
Apache NiFi
Architecture
Live Demo
Community
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Moving data effectively is hard
Standards: http://xkcd.com/927/
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why is moving data effectively hard?
 Standards
 Formats
 “Exactly Once” Delivery
 Protocols
 Veracity of Information
 Validity of Information
 Ensuring Security
 Overcoming Security
 Compliance
 Schemas
 Consumers Change
 Credential Management
 “That [person|team|group]”
 Network
 “Exactly Once” Delivery
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕs
Let’s consider the needs of a courier service
Physical Store
Gateway
Server
Mobile Devices
Registers
Server Cluster
Distribution Center Core Data Center at HQ
Server Cluster
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/
Deliverer: Rigo Peter, https://thenounproject.com/rigo/
Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/
Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Great! I am collecting all this data! Let’s use it!
Finding our needles in the haystack
Physical Store
Gateway
Server
Mobile Devices
Registers
Server Cluster
Distribution Center
Kafka
Core Data Center at HQ
Server Cluster
Others
Storm / Spark /
Flink / Apex
Kafka
Storm / Spark / Flink / Apex
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/
Deliverer: Rigo Peter, https://thenounproject.com/rigo/
Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/
Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕs
Oh, that courier service is global
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
What is dataflow and what are the challenges?
Apache NiFi
Architecture
Live Demo
Community
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Capabilities
/Gaps
Use cases collected from the field since last release (HDF 1.2)
Major business drivers behind the use case
Problems, challenges and major pain points
How does NiFi help solve the problems
What are the remaining gaps
Use Cases
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi High Level Capabilities
 Web-based user interface
– Design, control, feedback & monitoring
 Highly configurable
– Loss tolerant vs guaranteed delivery
– Low latency vs high throughput
– Dynamic prioritization
– Flow can be modified at runtime
– Back pressure
 Data provenance
– Track dataflow from beginning to end
 Designed for extension
– Build your own processors
 Secure
– SSL, SSH, HTTPS, etc.
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi
Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Deeper Ecosystem Integration: 170+ Processors
HTTP
Syslog
Email
HTML
Image
Hash Encrypt
Extract
TailMerge
Evaluate
Duplicate Execute
Scan
GeoEnrich
Replace
ConvertSplit
Translate
HL7
FTP
UDP
XML
SFTP
Route Content
Route Context
Route Text
Control Rate
Distribute Load
AMQP
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Revisit: Courier service from the perspective of NiFi
Physical Store
Gateway
Server
Mobile Devices
Registers
Server Cluster
Distribution Center Core Data Center at HQ
Server Cluster
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/
Deliverer: Rigo Peter, https://thenounproject.com/rigo/
Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/
Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
NiFi NiFi NiFi NiFi NiFi NiFi
On Delivery Routes
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Courier service from the perspective of NiFi & MiNiFi
Physical Store
Gateway
Server
Mobile Devices
Registers
Server Cluster
Distribution Center Core Data Center at HQ
Server Cluster
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/
Deliverer: Rigo Peter, https://thenounproject.com/rigo/
Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/
Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
Client
Libraries
Client
Libraries
MiNiFi
MiNiFi
NiFi NiFi NiFi NiFi NiFi NiFi
Client
Libraries
On Delivery Routes
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Subproject: MiNiFi
 Let me get the key parts of NiFi close to where data begins and provide bi-directional
communication
 NiFi lives in the data center. Give it an enterprise server or a cluster of them.
 MiNiFi lives as close to where data is born and is a guest on that device or system
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visual Command and Control
vs.
Design and Deploy
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Managed Dataflow
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi is based on Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing, transformation,
or mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing various
processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and manages the
threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send data via
ports. A process group allows creation of entirely new component simply by
composition of its components.
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FlowFiles & Data Agnosticism
 NiFi is data agnostic!
 But, NiFi was designed understanding that users
can care about specifics and provides tooling
to interact with specific formats, protocols, etc.
ISO 8601 - http://xkcd.com/1179/
Robustness principle
Be conservative in what you do,
be liberal in what you accept from others“
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FlowFiles are like HTTP data
HTTP Data FlowFile
HTTP/1.1 200 OK
Date: Sun, 10 Oct 2010 23:26:07 GMT
Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g
Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT
ETag: "45b6-834-49130cc1182c0"
Accept-Ranges: bytes
Content-Length: 13
Connection: close
Content-Type: text/html
Hello world!
Standard FlowFile Attributes
Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'
Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'
Key: 'fileSize’ Value: '23609'
FlowFile Attribute Map Content
Key: 'filename’ Value: '15650246997242'
Key: 'path’ Value: './’
Binary Content *
Header
Content
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The need for data provenance
For Operators
• Traceability, lineage
• Recovery and replay
For Compliance
• Audit trail
• Remediation
For Business / Mission
• Value sources
• Value IT investment
BEGIN
END
LINEAGE
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Provenance– Improved Navigation and Clearer Interaction
• Tracks data at each point as it flows
through the system
• Records, indexes, and makes events
available for display
• Handles fan-in/fan-out, i.e. merging
and splitting data
• View attributes and content at
given points in time
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
What is dataflow and what are the challenges?
Apache NiFi
Architecture
Live Demo
Community
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zero-master Clustering
Framework
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi vs MiNiFi Java Processes
NiFi Framework
Components
MiNiFi
NiFi Framework
User Interface
Components
NiFi
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi
Java agent
 Java implementation
 Availability
– GA HDF 2.0 (built from scratch, ~ 10MB)
Native agent
 C++ implementation
 Availability
– TP HDF 2.0
– GA post HDF 2.0
 Resource efficient (focus on memory and disk)
Near term (HDF 2.0)
 Design & deploy
– Push updates
– Config file driven/REST API (MiNiFi API – post
configurations and receive information, etc.) access
Long term
 Centralized command and control
MiNiFi Agent MiNiFi Management
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why NiFi?
 Moving data is multifaceted in its challenges and these are present in different contexts
at varying scopes
– Think of our courier example and organizations like it: inter vs intra, domestically, internationally
 Provide common tooling and extensions that are commonly needed but be flexible for
extension
– Leverage existing libraries and expansive Java ecosystem for functionality
– Allow organizations to integrate with their existing infrastructure
 Empower folks managing your infrastructure to make changes and reason about issues
that are occurring
– Data Provenance to show context and data’s journey
– User Interface/Experience a key component
NiFi Traffic Patterns Demo
NiFi Traffic Patterns Lab
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Smart Cities: Traffic Congestion
 Monitor:
 Public transportation vehicles
 Pedestrian levels
 Optimize public transit duration
and walking routes
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Our Lab for Today
 We will be exploring some examples to work through creating a dataflow with Apache
NiFi
 Use Case: An urban planning board is evaluating the need for a new highway,
dependent on current traffic patterns, particularly as other roadwork initiatives are
under way. Integrating live data poses a problem because traffic analysis has
traditionally been done using historical, aggregated traffic counts. To improve traffic
analysis, the city planner wants to leverage real-time data to get a deeper understanding
of traffic patterns. NiFi was selected for for this real-time data integration.
 Labs are available at http://tinyurl.com/nificrashcourse
Getting Started Resources
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Connected Data Architecture with HDC for AWS
C L O U D
Ideal Use Cases:
Data Science and Exploration
(Spark, Zeppelin)
ETL and Data Preparation
(Hive, Spark)
Analytics and Reporting
(Hive2 w/LLAP, Zeppelin)
Cloud Data
Processing
(HDC for AWS)
Technical Preview
hortonworks.github.io/hdp-aws
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn more and join us!
Apache NiFi site
http://nifi.apache.org
Subproject MiNiFi site
http://nifi.apache.org/minifi/
Subscribe to and collaborate at
dev@nifi.apache.org
users@nifi.apache.org
Submit Ideas or Issues
https://issues.apache.org/jira/browse/NIFI
Follow us on Twitter
@apachenifi
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Tutorials
 Get Started
– hortonworks.com/tutorials
– Apache Hadoop & Ecosystem
• tinyurl.com/hello-hdp
– Apache Spark
• tinyurl.com/hwx-spark-intro
– Apache NiFi
• tinyurl.com/nifi-intro
– Use Case
• IoT
• Social Media
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Nourishes the Community
H ORTONWOR KS
COMMUNITY CONNEC TION
HORTONWOR KS
PARTNE RWOR KS
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Want to continue the technical Introduction?
 Hadoop Summit Crash Courses
– Replays
– Free
 hadoopsummit.org/san-jose/agenda
– Apache Hadoop
– Apache Spark
– Apache NiFi
– IoT & Streaming
– Data Science
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?
rafael@hortonworks.com
@racoss
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
What is dataflow and what are the challenges?
Apache NiFi
Architecture
Demo
Community
57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Matured at NSA 2006-2014
Brief history of the Apache NiFi Community
• Contributors from Government and several commercial industries
• Releases on a 6-8 week schedule
Code developed
at NSA
2006
Today
Achieved TLP
status in just
7 months
July 2015
Code available
open source
ASL v2
November 2014
58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi Prospective Plans - Centralized Command and Control
 Design at a centralized place, deploy on the edge
– Flow deployment
– NAR deployment
– Agent deployment
 Version control of flows
 Agent status monitoring
 Bi-directional command and control
Centralized management console with a UI

Hadoop Summit Tokyo Apache NiFi Crash Course

  • 1.
    Apache NiFi Crash CourseIntro Rafael Coss - @racoss Hadoop Summit – Tokyo Oct 2016
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda Data Flow & Streaming Fundamentals What is dataflow and what are the challenges? Apache NiFi Architecture Lab
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved Data Flow & Streaming Fundamentals
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Connected Data World  Internet of Anything (IoAT) – Wind Turbines, Oil Rigs, Cars – Weather Stations, Smart Grids – RFID Tags, Beacons, Wearables  User Generated Content (Web & Mobile) – Twitter, Facebook, Snapchat, YouTube – Clickstream, Ads, User Engagement – Payments: Paypal, Venmo 44ZB in 2020
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved Let’s Connect A to B Producers A.K.A Things Anything AND Everything Internet! Consumers • User • Storage • System • …More Things
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved What is Stream Processing? Batch Processing • Ability to process and analyze data at-rest (stored data) • Request-based, bulk evaluation and short-lived processing • Enabler for Retrospective, Reactive and On-demand Analytics Stream Processing • Ability to ingest, process and analyze data in-motion in real- or near-real-time • Event or micro-batch driven, continuous evaluation and long-lived processing • Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best Action Stream Processing + Batch Processing = All Data Analytics real-time (now) historical (past)
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Modern Data Applications Custom or Off the Shelf Real-Time Cyber Security protects systems with superior threat detection Smart Manufacturing dramatically improves yields by managing more variables in greater detail Connected, Autonomous Cars drive themselves and improve road safety Future Farming optimizing soil, seeds and equipment to measured conditions on each square foot Automatic Recommendation Engines match products to preferences in milliseconds DATA AT REST DATA IN MOTION ACTIONABLE INTELLIGENCE Modern Data Applications Hortonworks DataFlow Hortonworks Data Platform
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved Store Data Process and Analyze Data Acquire Data Simplistic View of DataFlows: Easy, Definitive Dataflow
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved The Unassuming Line: A Case Study We’ve seen a few lines show up in the wild thus far Internet! Inter- & Intra- connections in our global courier enterprise Spotlight: Arthur Lacôte, https://thenounproject.com/turo/
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved Dataflow Line Anatomy 101 Let’s dissect what this line typically represents Fig 1. Lineus Worldwidewebus. Common Name: Internet! Script or Application Script or Application Data Data Disparate Transport Mechanisms
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved Dataflow Line Anatomy 201 Sometimes that transport is just more lines Fig 1. Lineus Worldwidewebus. Common Name: Internet! Script or Application Script or Application Line Inception Data Data
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved Realistic View of Dataflows: Complex, Convoluted Store Data Process and Analyze Data Acquire Data Store DataStore Data Store Data Store Data Acquire Data Acquire Data Acquire Data Dataflow
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved Streaming Architecture Ingestion Simple Event Processing Engine Stream Processing DestinationData Bus
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved High-Level Overview IoT Edge (single node) IoT Edge (single node) IoT Devices IoT Devices NiFi Hub Data Broker Column DB Data Store Live Dashboard Data Center (on premises/cloud) HDFS/S3 HBase/Cassandra
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda What is dataflow and what are the challenges? Apache NiFi Architecture Live Demo Community
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved Moving data effectively is hard Standards: http://xkcd.com/927/
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved Why is moving data effectively hard?  Standards  Formats  “Exactly Once” Delivery  Protocols  Veracity of Information  Validity of Information  Ensuring Security  Overcoming Security  Compliance  Schemas  Consumers Change  Credential Management  “That [person|team|group]”  Network  “Exactly Once” Delivery
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕs Let’s consider the needs of a courier service Physical Store Gateway Server Mobile Devices Registers Server Cluster Distribution Center Core Data Center at HQ Server Cluster On Delivery Routes Trucks Deliverers Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/ Deliverer: Rigo Peter, https://thenounproject.com/rigo/ Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/ Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved Great! I am collecting all this data! Let’s use it! Finding our needles in the haystack Physical Store Gateway Server Mobile Devices Registers Server Cluster Distribution Center Kafka Core Data Center at HQ Server Cluster Others Storm / Spark / Flink / Apex Kafka Storm / Spark / Flink / Apex On Delivery Routes Trucks Deliverers Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/ Deliverer: Rigo Peter, https://thenounproject.com/rigo/ Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/ Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/
  • 20.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕs Oh, that courier service is global
  • 21.
    21 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda What is dataflow and what are the challenges? Apache NiFi Architecture Live Demo Community
  • 22.
    22 © HortonworksInc. 2011 – 2016. All Rights Reserved
  • 23.
    23 © HortonworksInc. 2011 – 2016. All Rights Reserved Capabilities /Gaps Use cases collected from the field since last release (HDF 1.2) Major business drivers behind the use case Problems, challenges and major pain points How does NiFi help solve the problems What are the remaining gaps Use Cases
  • 24.
    24 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi High Level Capabilities  Web-based user interface – Design, control, feedback & monitoring  Highly configurable – Loss tolerant vs guaranteed delivery – Low latency vs high throughput – Dynamic prioritization – Flow can be modified at runtime – Back pressure  Data provenance – Track dataflow from beginning to end  Designed for extension – Build your own processors  Secure – SSL, SSH, HTTPS, etc.
  • 25.
    25 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi Key Features • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Recovery/recording a rolling log of fine- grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  • 26.
    26 © HortonworksInc. 2011 – 2016. All Rights Reserved Deeper Ecosystem Integration: 170+ Processors HTTP Syslog Email HTML Image Hash Encrypt Extract TailMerge Evaluate Duplicate Execute Scan GeoEnrich Replace ConvertSplit Translate HL7 FTP UDP XML SFTP Route Content Route Context Route Text Control Rate Distribute Load AMQP
  • 27.
    27 © HortonworksInc. 2011 – 2016. All Rights Reserved Revisit: Courier service from the perspective of NiFi Physical Store Gateway Server Mobile Devices Registers Server Cluster Distribution Center Core Data Center at HQ Server Cluster Trucks Deliverers Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/ Deliverer: Rigo Peter, https://thenounproject.com/rigo/ Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/ Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/ NiFi NiFi NiFi NiFi NiFi NiFi On Delivery Routes
  • 28.
    28 © HortonworksInc. 2011 – 2016. All Rights Reserved Courier service from the perspective of NiFi & MiNiFi Physical Store Gateway Server Mobile Devices Registers Server Cluster Distribution Center Core Data Center at HQ Server Cluster Trucks Deliverers Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/ Deliverer: Rigo Peter, https://thenounproject.com/rigo/ Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/ Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/ Client Libraries Client Libraries MiNiFi MiNiFi NiFi NiFi NiFi NiFi NiFi NiFi Client Libraries On Delivery Routes
  • 29.
    29 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi Subproject: MiNiFi  Let me get the key parts of NiFi close to where data begins and provide bi-directional communication  NiFi lives in the data center. Give it an enterprise server or a cluster of them.  MiNiFi lives as close to where data is born and is a guest on that device or system
  • 30.
    30 © HortonworksInc. 2011 – 2016. All Rights Reserved Visual Command and Control vs. Design and Deploy
  • 31.
    31 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi Managed Dataflow SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE
  • 32.
    32 © HortonworksInc. 2011 – 2016. All Rights Reserved NiFi is based on Flow Based Programming (FBP) FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
  • 33.
    33 © HortonworksInc. 2011 – 2016. All Rights Reserved FlowFiles & Data Agnosticism  NiFi is data agnostic!  But, NiFi was designed understanding that users can care about specifics and provides tooling to interact with specific formats, protocols, etc. ISO 8601 - http://xkcd.com/1179/ Robustness principle Be conservative in what you do, be liberal in what you accept from others“
  • 34.
    34 © HortonworksInc. 2011 – 2016. All Rights Reserved FlowFiles are like HTTP data HTTP Data FlowFile HTTP/1.1 200 OK Date: Sun, 10 Oct 2010 23:26:07 GMT Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT ETag: "45b6-834-49130cc1182c0" Accept-Ranges: bytes Content-Length: 13 Connection: close Content-Type: text/html Hello world! Standard FlowFile Attributes Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'fileSize’ Value: '23609' FlowFile Attribute Map Content Key: 'filename’ Value: '15650246997242' Key: 'path’ Value: './’ Binary Content * Header Content
  • 35.
    35 © HortonworksInc. 2011 – 2016. All Rights Reserved The need for data provenance For Operators • Traceability, lineage • Recovery and replay For Compliance • Audit trail • Remediation For Business / Mission • Value sources • Value IT investment BEGIN END LINEAGE
  • 36.
    36 © HortonworksInc. 2011 – 2016. All Rights Reserved Data Provenance– Improved Navigation and Clearer Interaction • Tracks data at each point as it flows through the system • Records, indexes, and makes events available for display • Handles fan-in/fan-out, i.e. merging and splitting data • View attributes and content at given points in time
  • 37.
    37 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda What is dataflow and what are the challenges? Apache NiFi Architecture Live Demo Community
  • 38.
    38 © HortonworksInc. 2011 – 2016. All Rights Reserved Zero-master Clustering Framework
  • 39.
    39 © HortonworksInc. 2011 – 2016. All Rights Reserved NiFi vs MiNiFi Java Processes NiFi Framework Components MiNiFi NiFi Framework User Interface Components NiFi
  • 40.
    40 © HortonworksInc. 2011 – 2016. All Rights Reserved MiNiFi Java agent  Java implementation  Availability – GA HDF 2.0 (built from scratch, ~ 10MB) Native agent  C++ implementation  Availability – TP HDF 2.0 – GA post HDF 2.0  Resource efficient (focus on memory and disk) Near term (HDF 2.0)  Design & deploy – Push updates – Config file driven/REST API (MiNiFi API – post configurations and receive information, etc.) access Long term  Centralized command and control MiNiFi Agent MiNiFi Management
  • 41.
    43 © HortonworksInc. 2011 – 2016. All Rights Reserved Why NiFi?  Moving data is multifaceted in its challenges and these are present in different contexts at varying scopes – Think of our courier example and organizations like it: inter vs intra, domestically, internationally  Provide common tooling and extensions that are commonly needed but be flexible for extension – Leverage existing libraries and expansive Java ecosystem for functionality – Allow organizations to integrate with their existing infrastructure  Empower folks managing your infrastructure to make changes and reason about issues that are occurring – Data Provenance to show context and data’s journey – User Interface/Experience a key component
  • 42.
  • 43.
  • 44.
    46 © HortonworksInc. 2011 – 2016. All Rights Reserved Smart Cities: Traffic Congestion  Monitor:  Public transportation vehicles  Pedestrian levels  Optimize public transit duration and walking routes
  • 45.
    47 © HortonworksInc. 2011 – 2016. All Rights Reserved Our Lab for Today  We will be exploring some examples to work through creating a dataflow with Apache NiFi  Use Case: An urban planning board is evaluating the need for a new highway, dependent on current traffic patterns, particularly as other roadwork initiatives are under way. Integrating live data poses a problem because traffic analysis has traditionally been done using historical, aggregated traffic counts. To improve traffic analysis, the city planner wants to leverage real-time data to get a deeper understanding of traffic patterns. NiFi was selected for for this real-time data integration.  Labs are available at http://tinyurl.com/nificrashcourse
  • 46.
  • 47.
    49 © HortonworksInc. 2011 – 2016. All Rights Reserved Connected Data Architecture with HDC for AWS C L O U D Ideal Use Cases: Data Science and Exploration (Spark, Zeppelin) ETL and Data Preparation (Hive, Spark) Analytics and Reporting (Hive2 w/LLAP, Zeppelin) Cloud Data Processing (HDC for AWS) Technical Preview hortonworks.github.io/hdp-aws
  • 48.
    50 © HortonworksInc. 2011 – 2016. All Rights Reserved Learn more and join us! Apache NiFi site http://nifi.apache.org Subproject MiNiFi site http://nifi.apache.org/minifi/ Subscribe to and collaborate at dev@nifi.apache.org users@nifi.apache.org Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI Follow us on Twitter @apachenifi
  • 49.
    51 © HortonworksInc. 2011 – 2016. All Rights Reserved Big Data Tutorials  Get Started – hortonworks.com/tutorials – Apache Hadoop & Ecosystem • tinyurl.com/hello-hdp – Apache Spark • tinyurl.com/hwx-spark-intro – Apache NiFi • tinyurl.com/nifi-intro – Use Case • IoT • Social Media
  • 50.
    52 © HortonworksInc. 2011 – 2016. All Rights Reserved Hortonworks Nourishes the Community H ORTONWOR KS COMMUNITY CONNEC TION HORTONWOR KS PARTNE RWOR KS
  • 51.
    53 © HortonworksInc. 2011 – 2016. All Rights Reserved Want to continue the technical Introduction?  Hadoop Summit Crash Courses – Replays – Free  hadoopsummit.org/san-jose/agenda – Apache Hadoop – Apache Spark – Apache NiFi – IoT & Streaming – Data Science
  • 52.
    54 © HortonworksInc. 2011 – 2016. All Rights Reserved Questions? rafael@hortonworks.com @racoss
  • 53.
    55 © HortonworksInc. 2011 – 2016. All Rights Reserved Thank you!
  • 54.
    56 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda What is dataflow and what are the challenges? Apache NiFi Architecture Demo Community
  • 55.
    57 © HortonworksInc. 2011 – 2016. All Rights Reserved Matured at NSA 2006-2014 Brief history of the Apache NiFi Community • Contributors from Government and several commercial industries • Releases on a 6-8 week schedule Code developed at NSA 2006 Today Achieved TLP status in just 7 months July 2015 Code available open source ASL v2 November 2014
  • 56.
    58 © HortonworksInc. 2011 – 2016. All Rights Reserved MiNiFi Prospective Plans - Centralized Command and Control  Design at a centralized place, deploy on the edge – Flow deployment – NAR deployment – Agent deployment  Version control of flows  Agent status monitoring  Bi-directional command and control Centralized management console with a UI

Editor's Notes

  • #13 In reality, dataflows move all over. Data is moved and stored in multiple places – sometimes interim, sometimes longterm. Data is procesed in different places, and then moved again. Complicated, convoluted, messy.
  • #14 Kafka Reads events in memory and write to  distributed log 
  • #15 NiFi: simple event processing Spark: complex event processing Build predictive model from Historical insights. Deploy predictive model for real-time insights.
  • #28 Can put NiFi on a Gateway server but probably don’t want to mess with a UI on ever single one Maybe not best fit
  • #29 Let me get the key parts of NiFi close to where data begins and provide bidrectional communication NiFi lives in the data center. Give it an enterprise server or a cluster of them. MiNiFi lives close to where data is born and may be a guest on that device or system
  • #40 Framework – put a new wrapper on the framework, or in maven terms, we kept the underlying modules and wrote minifi-framework-core replacing nifi-framework-core Talking about MiNiFi-Java, Cpp version also exists
  • #42 Initiates with ./bin/nifi.sh start
  • #43 user, only need bootstrap and config.yml nifi.properties and flow.xml are implementation details
  • #47 Smart Cities Monitor: Public transportation vehicles Pedestrian levels Optimize public transit duration and walking routes Source: http://www.libelium.com/resources/top_50_iot_sensor_applications_ranking/
  • #53 52
  • #59 Think NiFi Process Groups but MiNiFi Agents