Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Hortonworks DataFlow
Enterprise Data Flow powered by Apache NiFi
Mats Johansson
Solutions Engineer - EMEA
© Hortonworks Inc. 2011 – 2015. All Rights Reserved

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under
development, may be under development in the future or may ultimately not be
developed.
Project capabilities are based on information that is publicly available within the Apache
Software Foundation project websites ("Apache"). Progress of the project capabilities
can be tracked from inception to release through Apache, however, technical feasibility,
market demand, user feedback and the overarching Apache Software Foundation
community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not
represent a contractual commitment, promise or obligation from Hortonworks to deliver
these features in any generally available product.
Product features and technology directions are subject to change, and must not be
included in contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans,
customers should not rely upon it when making purchasing decisions.

IoAT Data Grows Faster Than We Consume It
Much of the new data
exists in-flight, between
systems and devices as
part of the Internet of
AnythingNEW
TRADITIONAL
The Opportunity
Unlock transformational business value
from a full fidelity of data and analytics
for all data.
Geolocation
Server logs
Files & emails
ERP, CRM, SCM
Traditional Data Sources
Internet of Anything
Sensors
and machines
Clickstream
Social media

Internet of Anything is Driving New Requirements
Need trusted insights from data at the very edge to the data lake in real-
time with full-fidelity
– Data generated by sensors, machines, geo-location devices, logs, clickstreams, social feeds, etc.
Modern applications need access to both data-in-motion and data-at-rest
IoAT data flows are multi-directional and point-to-point
– Very different than existing ETL, data movement, and streaming technologies which are generally one direction
The perimeter is outside the data center and can be very jagged
– This “Jagged Edge” creates new opportunity for security, data protection, data governance and provenance

Architectural Limitations Today
• Traditional data movement software has been built for the world of
standardized data and one way flows
• Tools built for newer types of data tend to be custom, difficult to
manage, and architecturally disjoint
• Businesses can not easily collect, conduct, and curate secure multi-
directional and point-to-point IoAT data flows
• IoAT data flows are not optimized and use costly/limited bandwidth and
cannot dynamically prioritize the most valuable data
• Difficult to gain actionable insights from the combination of data-in-
motion and data-at-rest

The IoAT Data Flow
Hortonworks Data Platform
powered by Apache Hadoop
Hortonworks Data Platform
powered by Apache Hadoop
Enrich
Context
Store Data
and Metadata
Internet
of Anything
Hortonworks DataFlow
powered by Apache NiFi
Perishable
Insights
Historical
Insights
Introducing Hortonworks DataFlow
Hortonworks DataFlow and the Hortonworks Data Platform
deliver the industry’s most complete solution for management of Big Data.

Simplistic View of IoAT & Data Flow
The Data Flow Thing
Process and
Analyze Data
Acquire Data
Store Data

Global interactions with customers, business partners, and things
spanning different volume, velocity, bandwidth, and latency needs
Realistic View of IoAT and Data Flow

Meeting IoAT Edge Requirements
GATHE
R
DELIVER
PRIORITIZE
Track from the edge Through to the datacenter
Small Footprints
operate with very little power
Limited Bandwidth
can create high latency
Data Availability
exceeds transmission bandwidth
Data Must Be Secured
throughout its journey

Dataflow requirements within the Data Center
Understanding
Ability to observe precisely how systems exchange data in real-time and historically
Agility
Ability to interact with and alter live flows and iterate on new ones
Dynamic Access Controls
The entitlements of users and systems and sensitivity of data can change frequently
Cross Cutting Concerns
Address common needs once like enrichment, filtering, transformation
Enable architecture transition
Legacy vs modern is an ‘always’ event. Format, schema, protocol conversion is routine

Apache NiFi: Collect, Conduct, Curate
Aggregate all IoAT data from sensors, geo-location devices,
machines, logs, files, and feeds via a highly secure lightweight agent
Collect: Bring Together• Logs
• Files
• Feeds
• Sensors
Mediate point-to-point and bi-directional data flows, delivering data
reliably to real-time applications and storage platforms such as HDP
Conduct: Mediate the Data Flow• Deliver
• Secure
• Govern
• Audit
Parse, filter, join, transform, fork, and clone data in motion to
empower analytics and perishable insights
Curate: Gain Insights• Parse
• Filter
• Transform
• Fork
• Clone

November 2014
NiFi is donated to the Apache Software Foundation
(ASF) through NSA’s Technology Transfer Program
and enters ASF’s incubator.
2006
NiagaraFiles (NiFi) was first incepted by Joe Witt at
the National Security Agency (NSA)
A Brief History of Apache Nifi
July 2015
NiFi reaches ASF top-level project status

Apache NiFi: Three key concepts
• Manage the flow of information
• Data Provenance
• Secure the control plane and data plane

Apache NiFi – Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering

Common Apache NiFi Use Cases
Predictive Analytics
Ensure the highest value data is captured and available for analysis
Compliance
Gain full transparency into provenance and flow of data
IoT Optimization
Secure, Prioritize, Enrich and Trace data at the edge
Fraud Detection
Move sales transaction data in real time to analyze on demand
Big Data Ingest
Easily and efficiently ingest data into Hadoop
Value Resources
Gain visibility into how data sources are used to determine value

Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing,
transformation, or mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing various
processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and manages
the threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send
data via ports. A process group allows creation of entirely new
component simply by composition of its components.

Hortonworks Data Flow
Visual User Interface
HTML 5, drag and drop, for agile execution
High Throughput, Low Bandwidth
for any data, big or small
Provenance Metadata
for governance and compliance
Secure End-to-End Data Routing
with encryption and compressionPowered by
Apache NiFi

Basics of Connecting Systems
For every connection,
these must agree:
1. Protocol
2. Format
3. Schema
4. Priority
5. Size of event
6. Frequency of event
7. Authorization access
8. Relevance
P1
Producer
C1
Consumer

Using Messaging
Only a subset agree
using messaging
1. Protocol
2. Format
3. Schema
4. Priority
5. Size of event
8. Relevance
P1
CN
C1
Messaging
More issues to consider:
• How do you know what the data flow looks like?
• How is it managed?
• How is it working – today, yesterday?

Using an Enterprise Service Bus (ESB)
Still, only a subset agree
using an ESB:
1. Protocol
2. Format
3. Schema
4. Priority
5. Size of event
8. Relevance
P1
Broker
CN
C1
Messaging
Even more issues to consider:
• Remote procedure calls (RPC) and throughput issues
are introduced
• Design and deploy management – slow setup, not interactive
• You can scale out, but not up or down
• You still don’t know what the data flow looks like

OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Architecture
OS/Host
JVM
NiFi Cluster Manager – Request Replicator
Web Server
Master
NiFi Cluster
Manager (NCM)
OS/Host
JVM
Flow Controller
Web Server
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Slaves
NiFi Nodes
High Availability: Control plane vs Data plane…

Define A Hortonworks DataFlow
• Easy to use drag and drop UI
• Flexible to define the Data Flow

HDF – Powered by Apache NiFi

Add processor for data intake
1 Drag and drop processor icon from the top menu

Choose the specific processor
2 Choose one of the processors – currently 90 available – designed for extension

Example: Pick Twitter Processor

Configure the processor
3 Select processor and
choose option to Configure
4
Adjust
parameters as
required

Another processor for data output
5 Drag and drop processor icon from the top menu
6 Example: choose PutHDFS processor

Configure second processor
7 Configure 2nd processor

Connect processors, configure connection
8

Click Start to begin processing
9

See processors update with real time changes
10
As data flows, GUI interface updates in real
time.

Dynamically adjust and tune data flow as needed
11 Dynamically adjust and tune dataflow as needed, in
real time. Can also replicate data for testing and
comparison.

Understand the data path with Data Provenance
14 Select Data Provenance

Trace lineage of a particular piece of data
15
Icon for Data Lineage

Every change to data is tracked: processing, views
16
Provenance event is tracked

Updates as changes happen
17 Updates as data flows

Easily access and trace changes to dataflow

Audit trail of Hortonworks DataFlow User Actions

Nifi is complementary to Hadoop
Deployment flexibility from devices to data center. Delivers data flow
QoS across dimensions such as: loss tolerant vs. guaranteed
delivery, low latency vs. high throughput, and priority-based
queuing.
Operations
Governance
Starting at the source, captures fine-grained metadata regarding all
data received, forked, joined, cloned, modified, sent, and ultimately
dropped as data reaches its configured end-state delivering
comprehensive governance (aka provenance, chain of custody)
Security
Secures the data movement from beginning to end. Allows for fine-
grained data authorization policies to be enforced at the flow-level.

Operations
• Reporting tasks (push)
• Statistics / status (pull)
• Dynamic flow changes
- Push new business rules via REST API
(closed loop)
- Pull updates periodically from web
services
• Site-to-site
- Stay at the ‘flow level’ not suddenly
doing file transfer protocols
• Extensible
• Optimized user
experience – log hunts
should be the exception
Scale down, up, and out – in
containers and on virtual machines

The Need for Data Provenance
For Operators
• Traceability, lineage
• Recovery and replay
For Compliance
• Audit trail
For Business
• Value sources
• Value IT investment
BEGIN
END
LINEAGE

Internet of
Anything
Extending Data Governance from the Edge to Hadoop
ETL / DQ MDM
ARCHIVE
Traditional
Data Systems
Data Governance Requirements
Transparent
Governance standards and
protocols must be clearly defined
and available to all
Reproducible
Recreate the relevant data
landscape at a given point in time
Auditable
Trace all relevant events and assets
with appropriate historical lineage
Consistent
Compliance practices must be
consistent
Hadoop Data
Platform
Must snap into existing
data governance
frameworks and openly
exchange metadata
SCM
CRM
ERP
Holistic Data
Governance
Business
Analytics
Visualization
& Dashboards

The Need for Fine-grained Security and Compliance
It’s not enough to say you have
encrypted communications
• Enterprise authorization
services –entitlements
change often
• People and systems with
different roles require
difference access levels
• Tagged/classified data

Security
Administration
Central management and
consistent security
• NiFi Cluster Manager
Authentication
Authenticate users and systems
• 2-Way SSL support out of the box;; additional types coming
Authorization
Provision access to data
• Pluggable authorization designed to fit any Identity and Access Management (IAM) scheme
• File-based authority provider out of the box
• Multi-role
Audit
Maintain a record of data access
• Detailed logging of all user actions
• Detailed logging of key system behaviors
• Data Provenance enables unparalleled tracking from the edge through the Lake
Data Protection
Protect data at rest and in motion
• Support a variety of SSL/encrypted protocols
• Tag and utilize tags on data for fine grained access controls
• Encrypt/decrypt content using pre-shared key mechanisms
Administrator Configure system threads, user
accounts, and flow audit history
Data Flow Manager Manipulate the dataflow
Read Only View the dataflow only
+NiFi Configure system threads, user
accounts, and flow audit history
Proxy Manipulate the dataflow
Provenance Query the provenance
repository and
download content

Operations: Planned

Planned Apache NiFi Enhancements
IN PROGRESS Enhanced Configuration management of flows
STARTED Extension and template registry
TARGETTED TONIFI 0.4.0 RELEASE First-class Avro support1
STARTED Interactive queue management
STARTED Multi-tenant data flow
FUTURE Pluggable authentication
FUTURE Reference-able process groups
FUTURE Variable registry
https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals

Page 51 © Hortonworks Inc. 2011 – 2015. All Rights ReservedPage 51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow
Try It Yourself,
Download Nifi and HDP Sandbox from
hortonworks.com/sandbox
Tweet: #hadooproadshow

Thank you!
Mats Johansson
mjohansson@hortonworks.com
@matsjo66
https://se.linkedin.com/in/matsjo66

Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Similar to Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data (20)

Recently uploaded

Recently uploaded (20)

Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data