HDF Powered by Apache NiFi
Intro
Milind Pandit
Solutions Engineer
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda HDF 2.0: Flow Management
– NiFi basics
– NiFi use cases
– NiFi demos
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simplistic View of Enterprise Data Flow
Data Flow
Process and Analyze
Data
Acquire Data
Store Data
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with different business partners and customers
Realistic View of Enterprise Data Flow
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Connected Data Platforms
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Stream Processing
Flow Management
Enterprise Services
At the edge
Security
Visualization
On premises In the cloud
Registries/Catalogs Governance (Security/Compliance) Operations
HDF 2.0 – Data in Motion Platform
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks DataFlow (HDF)
 Constrained
 High-latency
 Localized context
 Hybrid – cloud/on-premises
 Low-latency
 Global context
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
• For agile and immediate creation, configuration, control of dataflowsVisual Command and Control
• Ensures trust of your dataData Lineage (Provenance)
• Because not all data is of equal importanceData Prioritization
• Since not all senders/receivers/connections work perfectly all the timeData Buffering/Back-Pressure
• Adapt to different situations with different requirementsControl Latency vs Throughput
• Security of data, and data accessSecure Control Plane/Data Plane
• ScalabilityScale out Clustering
• Ecosystem flexibility and growthExtensibility
Apache NiFi: Designed for 8 challenges of global enterprise dataflow
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi: Three key concepts
• Manage the flow of information
• Data Provenance
• Secure the control plane and data plane
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi – Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Recovery/recording
a rolling log of fine-grained
history
• Visual command and control
• Flow templates
• Pluggable/multi-role security
• Designed for extension
• Clustering
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Common Apache NiFi Use Cases
Predictive Analytics
Ensure the highest value data is captured and available for analysis
Compliance
Gain full transparency into provenance and flow of data
IoT Optimization
Secure, Prioritize, Enrich and Trace data at the edge
Fraud Detection
Move sales transaction data in real time to analyze on demand
Big Data Ingest
Easily and efficiently ingest data into Hadoop
Value Resources
Gain visibility into how data sources are used to determine value
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Apache NiFi used for?
• Reliable and secure transfer of data between systems
• Delivery of data from sources to analytic platforms
• Enrichment and preparation of data:
– Conversion between formats
– Extraction/Parsing
– Routing decisions
What is Apache NiFi NOT used for?
• Distributed Computation
• Complex Event Processing
• Joins / Complex Rolling Window Operations
Use Cases for Apache NiFi
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FlowFile
• Unit of data moving through the system
• Content + Attributes (key/value pairs)
Processor
• Performs the work, can access FlowFiles
Connection
• Links between processors
• Queues that can be dynamically prioritized
Terminology
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HTTP Data FlowFile
HTTP/1.1 200 OK
Date: Sun, 10 Oct 2010 23:26:07 GMT
Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g
Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT
Content-Type: text/html
Hello world XXXXXXXXXXXXXXXXXXXXXXXXXXXX
Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'
Key: 'fileSize’ Value: '23609'
Key: 'filename’ Value: '15650246997242'
Key: 'path’ Value: './’
0101010101110101010101010101 (Binary)
Header
Content
Analogy: FlowFiles are like HTTP Data
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
1. Drag and drop processors to build a flow
2. Start, stop, and configure components in real time
3. View errors and corresponding error messages
4. View statistics and health of data flow
5. Create templates of common processor & connections
Create, Run, View, Start, Stop, Change, Fix, Dataflows in Real-Time
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi Demo: Tail Logs, Route on Content, Buffer in Kafka,
Deliver to HDFS
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Data Provenance and Why is it Important?
BEGIN
END
LINEAGE
IT and Cloud Operators
• Understand traceability, lineage
• Enable recovery and replay
Compliance Regulations
• Provide an audit trail
• Remediation capabilities
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Provenance Enables Easy Access and Traceability of Changes
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Need Fine-Grained Security and Compliance?
Security
• Secured authentication
• Enterprise authorization services –
entitlements change often
• Encrypted content, encrypted
communications
• People and systems with different roles
require difference access levels
• Tagged/classified data
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Repositories - Pass by reference
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Repositories – Copy on Write
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Architecture
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Edge Intelligence with Apache MiNiFi
 Guaranteed delivery
 Data buffering
‒ Backpressure
‒ Pressure release
 Prioritized queuing
 Flow specific QoS
‒ Latency vs. throughput
‒ Loss tolerance
 Data provenance
 Recovery / recording a rolling log
of fine-grained history
 Designed for extension
Different from Apache NiFi
 Design and Deploy
 Warm re-deploys
Key Features
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi vs. MiNiFi Java Agent
NiFi Framework
Components
MiNiFi
NiFi Framework
User Interface
Components
NiFi
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Company X provides alerting services when users’ resting heart rate higher
than a threshold
Real-Time Insights Require DataFlow Mgmt and Stream Processing
Acquire
Data
Company X Cloud
Instance 1
Acquire
Data
Company X Cloud
Instance 2
Acquire
Data
Company X Cloud
Instance 3
Acquire Data
Across Cloud
Instances
Parse, Filter,
Validate, Enrich
and Route
Core Data Center
Analytics/Pattern
Match
Data
Store
Alerts
Dashboards/Visualization
Flow Management Stream ProcessingLegend:
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data in Motion Needs Dataflow Management and Stream Processing
 Acquire data from various Wearable Device’s Cloud Instances
 Move Data from Customer Cloud Instances to on-premise instance
 Perform Intelligent Routing & Filtering of data. The routing and filtering rules will be often
changed at run-time.
 Deliver the data data to various downstream systems. New downstream apps should will always
appear and the data should be fed to it when it comes online.
 Parse the device data to standardized format that downstream sysem can understand
 Enrich the data with contextual information including patient/customer info (age, sex, etc..)
 Recognize the Pattern when the resting heart rate exceeds a certain threshold (the insight),
and then create an alert/notification.
 Run a Outlier detection model on streaming heart rate that comes in. If the score is above
certain threshold, alert on the heart rate.
Flow
Management
(NiFi, MiNiFi)
Stream
Processing
(Storm, Kafka)
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data in
Motion
(Cloud)
Data in
Motion
(on-premises)
Data at
Rest
(on-premises)
Edge
Data
Data in
Motion
Edge
Analytics
Data at
Rest
(Cloud)
Edge
Data
Data at
Rest
(on-premises)
Closed Loop
Analytics
Machine
Learning
Deep
Historical
Analysis
The Future of Data
Architectural Transformation Enabled By Connected Data Platforms
On PremCloud
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases for Data in Motion
Use Cases for Data-in-Motion Using DataFlow Mgmt
• Data Ingestion
• Edge Intelligence
• First Mile Problem
• Physical Data Movement
• Simple event processing such as Route, Filter, Enrich,
Transform, etc.
When Only DataFlow
Management is
Required
Use Cases for Data-in-Motion Using DataFlow Mgmt and
Steam Processing
• Flow Management to deliver data for Stream Processing
• PLUS: Complex pattern matching on unbounded streams of
data.
When Both DataFlow
Management and
Stream Processing
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Flow management
D A T A I N M O T I O N D A T A A T R E S T
IoT Data Sources AWS
Azure
Google Cloud
Hadoop
NiFi
Kafka
Storm
Others…
NiFi
NiFi NiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
NiFi
HDF 2.0: Data-in-Motion Platform
Enterprise Services
Ambari Ranger Other services
Flow management + Stream Processing
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New Stream Processing Features HDF 2.0
 New Storm Connectors
 Storm-Kafka Spout using new
client APIs
 Storm Distributed Log Search
 Storm Dynamic Worker
Profiling
 Kafka Grafana Integration
 Storm Grafana Integration
 Improved Nimbus HA
 Storm Automatic Back
Pressure
 Storm Distributed cache
 Storm Windowing and State
Management
 Storm Performance
improvements
 Improved Kafka SASL
 Storm Topology Event inspector
 Storm Resource Aware
Scheduling
 Storm Dynamic Log Levels
 Pacemaker Storm Daemon
 Kafka Rack Awareness
Developer Productivity EnterpriseReadiness Operational Simplicity
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Information, Resources
Hortonworks Community Connection:
Data Ingestion and Streaming
https://community.hortonworks.com
 Partnerworks: http://hortonworks.com/partners/
 HDF Certification:
http://hortonworks.com/partners/product-integration-certification/
 Webinars: http://hortonworks.com/events-webcasts/
 Sandbox: http://hortonworks.com/events-webcasts/
 HDF: http://hortonworks.com/hdf/
 HDP: http://hortonworks.com/hdp/

HDF Powered by Apache NiFi Introduction

  • 1.
    HDF Powered byApache NiFi Intro Milind Pandit Solutions Engineer
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved Agenda HDF 2.0: Flow Management – NiFi basics – NiFi use cases – NiFi demos
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved Simplistic View of Enterprise Data Flow Data Flow Process and Analyze Data Acquire Data Store Data
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Interacting with different business partners and customers Realistic View of Enterprise Data Flow
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved Connected Data Platforms
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved Stream Processing Flow Management Enterprise Services At the edge Security Visualization On premises In the cloud Registries/Catalogs Governance (Security/Compliance) Operations HDF 2.0 – Data in Motion Platform
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Hortonworks DataFlow (HDF)  Constrained  High-latency  Localized context  Hybrid – cloud/on-premises  Low-latency  Global context SOURCES REGIONAL INFRASTRUCTURE CORE INFRASTRUCTURE
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved • For agile and immediate creation, configuration, control of dataflowsVisual Command and Control • Ensures trust of your dataData Lineage (Provenance) • Because not all data is of equal importanceData Prioritization • Since not all senders/receivers/connections work perfectly all the timeData Buffering/Back-Pressure • Adapt to different situations with different requirementsControl Latency vs Throughput • Security of data, and data accessSecure Control Plane/Data Plane • ScalabilityScale out Clustering • Ecosystem flexibility and growthExtensibility Apache NiFi: Designed for 8 challenges of global enterprise dataflow
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi: Three key concepts • Manage the flow of information • Data Provenance • Secure the control plane and data plane
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi – Key Features • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Recovery/recording a rolling log of fine-grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved Common Apache NiFi Use Cases Predictive Analytics Ensure the highest value data is captured and available for analysis Compliance Gain full transparency into provenance and flow of data IoT Optimization Secure, Prioritize, Enrich and Trace data at the edge Fraud Detection Move sales transaction data in real time to analyze on demand Big Data Ingest Easily and efficiently ingest data into Hadoop Value Resources Gain visibility into how data sources are used to determine value
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved What is Apache NiFi used for? • Reliable and secure transfer of data between systems • Delivery of data from sources to analytic platforms • Enrichment and preparation of data: – Conversion between formats – Extraction/Parsing – Routing decisions What is Apache NiFi NOT used for? • Distributed Computation • Complex Event Processing • Joins / Complex Rolling Window Operations Use Cases for Apache NiFi
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved FlowFile • Unit of data moving through the system • Content + Attributes (key/value pairs) Processor • Performs the work, can access FlowFiles Connection • Links between processors • Queues that can be dynamically prioritized Terminology
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved HTTP Data FlowFile HTTP/1.1 200 OK Date: Sun, 10 Oct 2010 23:26:07 GMT Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT Content-Type: text/html Hello world XXXXXXXXXXXXXXXXXXXXXXXXXXXX Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'fileSize’ Value: '23609' Key: 'filename’ Value: '15650246997242' Key: 'path’ Value: './’ 0101010101110101010101010101 (Binary) Header Content Analogy: FlowFiles are like HTTP Data
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved 1. Drag and drop processors to build a flow 2. Start, stop, and configure components in real time 3. View errors and corresponding error messages 4. View statistics and health of data flow 5. Create templates of common processor & connections Create, Run, View, Start, Stop, Change, Fix, Dataflows in Real-Time
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache NiFi Demo: Tail Logs, Route on Content, Buffer in Kafka, Deliver to HDFS
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved What is Data Provenance and Why is it Important? BEGIN END LINEAGE IT and Cloud Operators • Understand traceability, lineage • Enable recovery and replay Compliance Regulations • Provide an audit trail • Remediation capabilities
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved Provenance Enables Easy Access and Traceability of Changes
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved Need Fine-Grained Security and Compliance? Security • Secured authentication • Enterprise authorization services – entitlements change often • Encrypted content, encrypted communications • People and systems with different roles require difference access levels • Tagged/classified data
  • 20.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved Repositories - Pass by reference
  • 21.
    21 © HortonworksInc. 2011 – 2016. All Rights Reserved Repositories – Copy on Write
  • 22.
    22 © HortonworksInc. 2011 – 2016. All Rights Reserved NiFi Architecture
  • 23.
    23 © HortonworksInc. 2011 – 2016. All Rights Reserved Edge Intelligence with Apache MiNiFi  Guaranteed delivery  Data buffering ‒ Backpressure ‒ Pressure release  Prioritized queuing  Flow specific QoS ‒ Latency vs. throughput ‒ Loss tolerance  Data provenance  Recovery / recording a rolling log of fine-grained history  Designed for extension Different from Apache NiFi  Design and Deploy  Warm re-deploys Key Features
  • 24.
    24 © HortonworksInc. 2011 – 2016. All Rights Reserved NiFi vs. MiNiFi Java Agent NiFi Framework Components MiNiFi NiFi Framework User Interface Components NiFi
  • 25.
    25 © HortonworksInc. 2011 – 2016. All Rights Reserved Example: Company X provides alerting services when users’ resting heart rate higher than a threshold Real-Time Insights Require DataFlow Mgmt and Stream Processing Acquire Data Company X Cloud Instance 1 Acquire Data Company X Cloud Instance 2 Acquire Data Company X Cloud Instance 3 Acquire Data Across Cloud Instances Parse, Filter, Validate, Enrich and Route Core Data Center Analytics/Pattern Match Data Store Alerts Dashboards/Visualization Flow Management Stream ProcessingLegend:
  • 26.
    26 © HortonworksInc. 2011 – 2016. All Rights Reserved Data in Motion Needs Dataflow Management and Stream Processing  Acquire data from various Wearable Device’s Cloud Instances  Move Data from Customer Cloud Instances to on-premise instance  Perform Intelligent Routing & Filtering of data. The routing and filtering rules will be often changed at run-time.  Deliver the data data to various downstream systems. New downstream apps should will always appear and the data should be fed to it when it comes online.  Parse the device data to standardized format that downstream sysem can understand  Enrich the data with contextual information including patient/customer info (age, sex, etc..)  Recognize the Pattern when the resting heart rate exceeds a certain threshold (the insight), and then create an alert/notification.  Run a Outlier detection model on streaming heart rate that comes in. If the score is above certain threshold, alert on the heart rate. Flow Management (NiFi, MiNiFi) Stream Processing (Storm, Kafka)
  • 27.
    27 © HortonworksInc. 2011 – 2016. All Rights Reserved Data in Motion (Cloud) Data in Motion (on-premises) Data at Rest (on-premises) Edge Data Data in Motion Edge Analytics Data at Rest (Cloud) Edge Data Data at Rest (on-premises) Closed Loop Analytics Machine Learning Deep Historical Analysis The Future of Data Architectural Transformation Enabled By Connected Data Platforms On PremCloud
  • 28.
    28 © HortonworksInc. 2011 – 2016. All Rights Reserved Use Cases for Data in Motion Use Cases for Data-in-Motion Using DataFlow Mgmt • Data Ingestion • Edge Intelligence • First Mile Problem • Physical Data Movement • Simple event processing such as Route, Filter, Enrich, Transform, etc. When Only DataFlow Management is Required Use Cases for Data-in-Motion Using DataFlow Mgmt and Steam Processing • Flow Management to deliver data for Stream Processing • PLUS: Complex pattern matching on unbounded streams of data. When Both DataFlow Management and Stream Processing
  • 29.
    29 © HortonworksInc. 2011 – 2016. All Rights Reserved Flow management D A T A I N M O T I O N D A T A A T R E S T IoT Data Sources AWS Azure Google Cloud Hadoop NiFi Kafka Storm Others… NiFi NiFi NiFi MiNiFi MiNiFi MiNiFi MiNiFi MiNiFi MiNiFi MiNiFi NiFi HDF 2.0: Data-in-Motion Platform Enterprise Services Ambari Ranger Other services Flow management + Stream Processing
  • 30.
    30 © HortonworksInc. 2011 – 2016. All Rights Reserved New Stream Processing Features HDF 2.0  New Storm Connectors  Storm-Kafka Spout using new client APIs  Storm Distributed Log Search  Storm Dynamic Worker Profiling  Kafka Grafana Integration  Storm Grafana Integration  Improved Nimbus HA  Storm Automatic Back Pressure  Storm Distributed cache  Storm Windowing and State Management  Storm Performance improvements  Improved Kafka SASL  Storm Topology Event inspector  Storm Resource Aware Scheduling  Storm Dynamic Log Levels  Pacemaker Storm Daemon  Kafka Rack Awareness Developer Productivity EnterpriseReadiness Operational Simplicity
  • 31.
    31 © HortonworksInc. 2011 – 2016. All Rights Reserved More Information, Resources Hortonworks Community Connection: Data Ingestion and Streaming https://community.hortonworks.com  Partnerworks: http://hortonworks.com/partners/  HDF Certification: http://hortonworks.com/partners/product-integration-certification/  Webinars: http://hortonworks.com/events-webcasts/  Sandbox: http://hortonworks.com/events-webcasts/  HDF: http://hortonworks.com/hdf/  HDP: http://hortonworks.com/hdp/

Editor's Notes

  • #6 Hortonworks: Powering the Future of Data
  • #8 TALK TRACK Hortonworks DataFlow is powered by Apache NiFI, Kafka, and Stor) – all key components of any streaming data architecture. MiNiFi/NiFi : dynamic, configurable data pipelines Kafka to adapt to differing rates of data creation and delivery Storm for real-time streaming analytics to create immediate insights at a massive scale. Only Hortonworks offers all of this as part of a Connected Data Platform that optimizes for delivery into HDP (HDFS, Hive, Spark, Hbase, etc…) There are scenarios where NiFI will provide all that you you need, but you will notice the orange and blue horizontal triangles provide a continium of capability from edge to core, that indicates varying degrees of need for the different products.
  • #12 Focus on predictive analytics case – use the uptake/cat/etc.. Case but generified.
  • #23 Introduce the architecture of NiFi, describe major system components, and describe the single node and clustering models. For each component describe its available (and potential)deployment models (relate it to Hadoop). Focus on the two deployment models (single node & cluster) roughly think of this as ‘edge’ vs ‘data center’
  • #25 24
  • #31 30