Hadoop Summit Tokyo Apache NiFi Crash Course

Apache NiFi
Crash Course Intro
Rafael Coss - @racoss
Hadoop Summit – Tokyo
Oct 2016

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Data Flow & Streaming Fundamentals
What is dataflow and what are the challenges?
Apache NiFi
Architecture
Lab

Data Flow & Streaming Fundamentals

Connected Data World
 Internet of Anything (IoAT)
– Wind Turbines, Oil Rigs, Cars
– Weather Stations, Smart Grids
– RFID Tags, Beacons, Wearables
 User Generated Content (Web & Mobile)
– Twitter, Facebook, Snapchat, YouTube
– Clickstream, Ads, User Engagement
– Payments: Paypal, Venmo
44ZB in 2020

Let’s Connect A to B
Producers A.K.A Things
Anything
AND
Everything
Internet!
Consumers
• User
• Storage
• System
• …More Things

What is Stream Processing?
Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)

Modern Data Applications
Custom or Off the Shelf
Real-Time Cyber Security
protects systems with superior threat
detection
Smart Manufacturing
dramatically improves yields by managing
more variables in greater detail
Connected, Autonomous Cars
drive themselves and improve road safety
Future Farming
optimizing soil, seeds and equipment to
measured conditions on each square foot
Automatic Recommendation Engines
match products to preferences in milliseconds
DATA AT
REST
DATA IN
MOTION
ACTIONABLE
INTELLIGENCE
Modern Data Applications
Hortonworks
DataFlow
Hortonworks
Data Platform

Store Data
Process and Analyze
Data
Acquire Data
Simplistic View of DataFlows: Easy, Definitive
Dataflow

The Unassuming Line: A Case Study
We’ve seen a few lines show up in the wild thus far
Internet! Inter- & Intra- connections in
our global courier enterprise
Spotlight: Arthur Lacôte, https://thenounproject.com/turo/

Dataflow Line Anatomy 101
Let’s dissect what this line typically represents
Fig 1. Lineus Worldwidewebus. Common Name: Internet!
Script or
Application
Script or
Application
Data Data
Disparate Transport
Mechanisms

Dataflow Line Anatomy 201
Sometimes that transport is just more lines
Fig 1. Lineus Worldwidewebus. Common Name: Internet!
Script or
Application
Script or
Application
Line Inception
Data Data

Realistic View of Dataflows: Complex, Convoluted
Store Data
Process and Analyze
Data
Acquire Data
Store DataStore Data
Store Data
Store Data
Acquire Data
Acquire Data
Acquire Data
Dataflow

Streaming Architecture
Ingestion
Simple Event Processing
Engine
Stream Processing
DestinationData Bus

High-Level Overview
IoT Edge
(single node)
IoT Edge
(single node)
IoT Devices
IoT Devices
NiFi Hub Data Broker
Column
DB
Data
Store
Live Dashboard
Data Center
(on premises/cloud)
HDFS/S3 HBase/Cassandra

Agenda
Apache NiFi
Architecture
Live Demo
Community

Moving data effectively is hard
Standards: http://xkcd.com/927/

Why is moving data effectively hard?
 Standards
 Formats
 “Exactly Once” Delivery
 Protocols
 Veracity of Information
 Validity of Information
 Ensuring Security
 Overcoming Security
 Compliance
 Schemas
 Consumers Change
 Credential Management
 “That [person|team|group]”
 Network
 “Exactly Once” Delivery

Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕs
Let’s consider the needs of a courier service
Physical Store
Gateway
Server
Mobile Devices
Registers
Server Cluster
Distribution Center Core Data Center at HQ
Server Cluster
On Delivery Routes
Trucks Deliverers
Delivery Truck: Creative Stall, https://thenounproject.com/creativestall/
Deliverer: Rigo Peter, https://thenounproject.com/rigo/
Cash Register: Sergey Patutin, https://thenounproject.com/bdesign.by/
Hand Scanner: Eric Pearson, https://thenounproject.com/epearson001/

Great! I am collecting all this data! Let’s use it!
Finding our needles in the haystack
Physical Store
Gateway
Server
Mobile Devices
Registers
Server Cluster
Distribution Center
Kafka
Core Data Center at HQ
Server Cluster
Others
Storm / Spark /
Flink / Apex
Kafka
Storm / Spark / Flink / Apex
On Delivery Routes
Trucks Deliverers

Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕs
Oh, that courier service is global

Agenda
Apache NiFi
Architecture
Live Demo
Community

Capabilities
/Gaps
Use cases collected from the field since last release (HDF 1.2)
Major business drivers behind the use case
Problems, challenges and major pain points
How does NiFi help solve the problems
What are the remaining gaps
Use Cases

Apache NiFi High Level Capabilities
 Web-based user interface
– Design, control, feedback & monitoring
 Highly configurable
– Loss tolerant vs guaranteed delivery
– Low latency vs high throughput
– Dynamic prioritization
– Flow can be modified at runtime
– Back pressure
 Data provenance
– Track dataflow from beginning to end
 Designed for extension
– Build your own processors
 Secure
– SSL, SSH, HTTPS, etc.

Apache NiFi
Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering

Deeper Ecosystem Integration: 170+ Processors
HTTP
Syslog
Email
HTML
Image
Hash Encrypt
Extract
TailMerge
Evaluate
Duplicate Execute
Scan
GeoEnrich
Replace
ConvertSplit
Translate
HL7
FTP
UDP
XML
SFTP
Route Content
Route Context
Route Text
Control Rate
Distribute Load
AMQP

Revisit: Courier service from the perspective of NiFi
Physical Store
Gateway
Server
Mobile Devices
Registers
Server Cluster
Server Cluster
Trucks Deliverers
NiFi NiFi NiFi NiFi NiFi NiFi
On Delivery Routes

Courier service from the perspective of NiFi & MiNiFi
Physical Store
Gateway
Server
Mobile Devices
Registers
Server Cluster
Server Cluster
Trucks Deliverers
Client
Libraries
Client
Libraries
MiNiFi
MiNiFi
NiFi NiFi NiFi NiFi NiFi NiFi
Client
Libraries
On Delivery Routes

Apache NiFi Subproject: MiNiFi
 Let me get the key parts of NiFi close to where data begins and provide bi-directional
communication
 NiFi lives in the data center. Give it an enterprise server or a cluster of them.
 MiNiFi lives as close to where data is born and is a guest on that device or system

Visual Command and Control
vs.
Design and Deploy

Apache NiFi Managed Dataflow
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE

NiFi is based on Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing, transformation,
or mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing various
processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and manages the
threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send data via
ports. A process group allows creation of entirely new component simply by
composition of its components.

FlowFiles & Data Agnosticism
 NiFi is data agnostic!
 But, NiFi was designed understanding that users
can care about specifics and provides tooling
to interact with specific formats, protocols, etc.
ISO 8601 - http://xkcd.com/1179/
Robustness principle
Be conservative in what you do,
be liberal in what you accept from others“

FlowFiles are like HTTP data
HTTP Data FlowFile
HTTP/1.1 200 OK
Date: Sun, 10 Oct 2010 23:26:07 GMT
Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g
Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT
ETag: "45b6-834-49130cc1182c0"
Accept-Ranges: bytes
Content-Length: 13
Connection: close
Content-Type: text/html
Hello world!
Standard FlowFile Attributes
Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'
Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'
Key: 'fileSize’ Value: '23609'
FlowFile Attribute Map Content
Key: 'filename’ Value: '15650246997242'
Key: 'path’ Value: './’
Binary Content *
Header
Content

The need for data provenance
For Operators
• Traceability, lineage
• Recovery and replay
For Compliance
• Audit trail
• Remediation
For Business / Mission
• Value sources
• Value IT investment
BEGIN
END
LINEAGE

Data Provenance– Improved Navigation and Clearer Interaction
• Tracks data at each point as it flows
through the system
• Records, indexes, and makes events
available for display
• Handles fan-in/fan-out, i.e. merging
and splitting data
• View attributes and content at
given points in time

Agenda
Apache NiFi
Architecture
Live Demo
Community

Zero-master Clustering
Framework

NiFi vs MiNiFi Java Processes
NiFi Framework
Components
MiNiFi
NiFi Framework
User Interface
Components
NiFi

MiNiFi
Java agent
 Java implementation
 Availability
– GA HDF 2.0 (built from scratch, ~ 10MB)
Native agent
 C++ implementation
 Availability
– TP HDF 2.0
– GA post HDF 2.0
 Resource efficient (focus on memory and disk)
Near term (HDF 2.0)
 Design & deploy
– Push updates
– Config file driven/REST API (MiNiFi API – post
configurations and receive information, etc.) access
Long term
 Centralized command and control
MiNiFi Agent MiNiFi Management

Why NiFi?
 Moving data is multifaceted in its challenges and these are present in different contexts
at varying scopes
– Think of our courier example and organizations like it: inter vs intra, domestically, internationally
 Provide common tooling and extensions that are commonly needed but be flexible for
extension
– Leverage existing libraries and expansive Java ecosystem for functionality
– Allow organizations to integrate with their existing infrastructure
 Empower folks managing your infrastructure to make changes and reason about issues
that are occurring
– Data Provenance to show context and data’s journey
– User Interface/Experience a key component

Smart Cities: Traffic Congestion
 Monitor:
 Public transportation vehicles
 Pedestrian levels
 Optimize public transit duration
and walking routes

Our Lab for Today
 We will be exploring some examples to work through creating a dataflow with Apache
NiFi
 Use Case: An urban planning board is evaluating the need for a new highway,
dependent on current traffic patterns, particularly as other roadwork initiatives are
under way. Integrating live data poses a problem because traffic analysis has
traditionally been done using historical, aggregated traffic counts. To improve traffic
analysis, the city planner wants to leverage real-time data to get a deeper understanding
of traffic patterns. NiFi was selected for for this real-time data integration.
 Labs are available at http://tinyurl.com/nificrashcourse

Connected Data Architecture with HDC for AWS
C L O U D
Ideal Use Cases:
Data Science and Exploration
(Spark, Zeppelin)
ETL and Data Preparation
(Hive, Spark)
Analytics and Reporting
(Hive2 w/LLAP, Zeppelin)
Cloud Data
Processing
(HDC for AWS)
Technical Preview
hortonworks.github.io/hdp-aws

Learn more and join us!
Apache NiFi site
http://nifi.apache.org
Subproject MiNiFi site
http://nifi.apache.org/minifi/
Subscribe to and collaborate at
dev@nifi.apache.org
users@nifi.apache.org
Submit Ideas or Issues
https://issues.apache.org/jira/browse/NIFI
Follow us on Twitter
@apachenifi

Big Data Tutorials
 Get Started
– hortonworks.com/tutorials
– Apache Hadoop & Ecosystem
• tinyurl.com/hello-hdp
– Apache Spark
• tinyurl.com/hwx-spark-intro
– Apache NiFi
• tinyurl.com/nifi-intro
– Use Case
• IoT
• Social Media

Hortonworks Nourishes the Community
H ORTONWOR KS
COMMUNITY CONNEC TION
HORTONWOR KS
PARTNE RWOR KS

Want to continue the technical Introduction?
 Hadoop Summit Crash Courses
– Replays
– Free
 hadoopsummit.org/san-jose/agenda
– Apache Hadoop
– Apache Spark
– Apache NiFi
– IoT & Streaming
– Data Science

Questions?
rafael@hortonworks.com
@racoss

Thank you!

Agenda
Apache NiFi
Architecture
Demo
Community

Matured at NSA 2006-2014
Brief history of the Apache NiFi Community
• Contributors from Government and several commercial industries
• Releases on a 6-8 week schedule
Code developed
at NSA
2006
Today
Achieved TLP
status in just
7 months
July 2015
Code available
open source
ASL v2
November 2014

MiNiFi Prospective Plans - Centralized Command and Control
 Design at a centralized place, deploy on the edge
– Flow deployment
– NAR deployment
– Agent deployment
 Version control of flows
 Agent status monitoring
 Bi-directional command and control
Centralized management console with a UI

Hadoop Summit Tokyo Apache NiFi Crash Course

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Hadoop Summit Tokyo Apache NiFi Crash Course

Similar to Hadoop Summit Tokyo Apache NiFi Crash Course (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop Summit Tokyo Apache NiFi Crash Course

Editor's Notes