Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Harnessing Data-in-Motion
with Hortonworks DataFlow
Apache NiFi, Kafka and Storm
Better Together
Bryan Bende
Sr. Software Engineer
Haimo Liu
Product Manager

Agenda
• Introduction to Hortonworks Data Flow
• Introduction to Apache projects
• Better together
• Best Practices
• Demo

Connected Data Platforms

Stream Processing
Flow Management
Enterprise Services
At the edge
Security
Visualization
On premises In the cloud
Registries/Catalogs Governance (Security/Compliance) Operations
HDF 2.0 – Data in Motion Platform

Flow Management Flow management + Stream Processing
D A T A I N M O T I O N D A T A A T R E S T
IoT Data Sources AWS
Azure
Google Cloud
Hadoop
NiFi
Kafka
Storm
Others…
NiFi
NiFi NiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
NiFi
HDF 2.0 – Data in Motion Platform
Enterprise Services
Ambari Ranger Other services

Introduction to
Apache Projects

What is Apache NiFi?
• Created to address the challenges of global enterprise dataflow
• Key features:
– Visual Command and Control
– Data Lineage (Provenance)
– Data Prioritization
– Data Buffering/Back-Pressure
– Control Latency vs. Throughput
– Secure Control Plane / Data Plane
– Scale Out Clustering
– Extensibility

Apache NiFi
What is Apache NiFi used for?
• Reliable and secure transfer of data between systems
• Delivery of data from sources to analytic platforms
• Enrichment and preparation of data:
– Conversion between formats
– Extraction/Parsing
– Routing decisions
What is Apache NiFi NOT used for?
• Distributed Computation
• Complex Event Processing
• Complex Rolling Window Operations

NiFi Terminology
FlowFile
• Unit of data moving through the system
• Content + Attributes (key/value pairs)
Processor
• Performs the work, can access FlowFiles
Connection
• Links between processors
• Queues that can be dynamically prioritized

What is Apache Kafka? APACHE
KAFKA
• Distributed streaming platform that
allows publishing and subscribing to
streams of records
• Streams of records are organized into
categories called topics
• Topics can be partitioned and/or
replicated
• Records consist of a key, value, and
timestamp
http://kafka.apache.org/intro
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer

Kafka: Anatomy of a Topic
Partition
0
Partition
1
Partition
2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Writes
Old
New
 Partitioning allows topics to
scale beyond a single
machine/node
 Topics can also be replicated,
for high availability.
APACHE
KAFKA

NiFi and Kafka Are Complementary
NiFi
Provide dataflow solution
• Centralized management, from edge to core
• Great traceability, event level data provenance
starting when data is born
• Interactive command and control – real time
operational visibility
• Dataflow management, including prioritization,
back pressure, and edge intelligence
• Visual representation of global dataflow
Kafka
Provide durable stream store
• Low latency
• Distributed data durability
• Decentralized management of producers &
consumers
+

What is Apache Storm?
• Distributed, low-latency, fault-tolerant, Stream Processing platform.
• Provides processing guarantees.
• Key concepts include:
• Tuples
• Streams
• Spouts
• Bolts
• Topology

Storm - Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm
–Named list of values that can be of any data type
•What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm

Storm - Spouts
• What is a Spout?
–Source of data
–E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust as needed

Storm - Bolts
• What is a Bolt?
–Processes any number of input streams and produces output streams
–Common processing in bolts are functions, aggregations, joins, R/W to data stores, alerting logic
–Can spin up multiple instances of a Bolt and dynamically adjust as needed

Storm - Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream

+
NiFi and Storm Are Complementary
NiFi
Simple event processing
• Manages flow of data between producers and
consumers across the enterprise
• Data enrichment, splitting, aggregation,
format conversion, schema translation…
• Scale out to handle gigabytes per second, or
scale down to a Raspberry PI handling tens of
thousands of events per second
Storm
Complex and distributed processing
• Complex processing from multiple streams (JOIN
operations)
• Analyzing data across time windows (rolling window
aggregation, standard deviation, etc.)
• Scale out to thousands of nodes if needed
+

Key Integration Points
• NiFi - Kafka
– NiFi Kafka Producer
– NiFi Kafka Consumer
• Storm - Kafka
– Storm Kafka Consumer
– Storm Kafka Producer
+ +

Key Integration Points – NiFi & Kafka
NiFi
MiNiFi
MiNiFi
MiNiFi
Kafka
Consumer 1
Consumer 2
Consumer N
• Producer Processors
• PutKafka (0.8 Kafka Client)
• PublishKafka (0.9 Kafka Client)
• PublishKafka_0_10 (0.10 Kafka Client)
+

Key Integration Points – NiFi & Kafka
Kafka
Producer 1
Producer 2
Producer N
NiFi
Destination 1
Destination 2
Destination 3
• Consumer Processors
• GetKafka (0.8 Kafka Client)
• ConsumeKafka (0.9 Kafka Client)
• ConsumeKafka_0_10 (0.10 Kafka Client)
+

Key Integration Points – Storm & Kafka
• storm-kafka module
– KafkaSpout (Core & Trident) & KafkaBolt
– Compatible with Kafka 0.8 and 0.9 client
– Kafka client declared by topology developer
• storm-kafka-client module
– KafkaSpout & KafkaSpoutTuplesBuilder
– Compatible with Kafka 0.9 and 0.10 client
– Kafka client declared by topology developer
Kafka Storm
Incoming Topic
Results Topic
KafkaSpout
KafkaBolt
+

Better Together
NiFiMiNiFi
Kafka
Storm
Incoming Topic
Results Topic
PublishKafka
ConsumeKafka
Destinations
MiNiFi
• MiNiFi – Collection, filtering, and prioritization at the edge
• NiFi - Central data flow management, routing, enriching, and transformation
• Kafka - Central messaging bus for subscription by downstream consumers
• Storm - Streaming analytics focused on complex event processing
+ +

NiFi PublishKafka
Apache NiFi - Node 1
Apache Kafka
Topic 1 - Partition 1
PublishKafka
Apache NiFi – Node 2
PublishKafka
= Concurrent Task
• Each NiFi node runs an
instance of PublishKafka
• Each instance has one or
more concurrent tasks
(threads)
• Each concurrent task is an
independent producer,
sends data round-robin to
partitions of a topic
+

NiFi ConsumeKafka – Nodes = Partitions
Apache Kafka
ConsumeKafka
(consumer group 1)
ConsumeKafka
(consumer group 1)
= Concurrent Task
• Each NiFi node runs an
instance of ConsumeKafka
• Each instance has one or
more concurrent tasks
(threads)
• Each concurrent task is a
consumer assigned to a
single partition
• Kafka Client ensures a given
partition can only have one
consumer/thread in a
consumer group
+

NiFi ConsumeKafka – Nodes > Partitions
Apache Kafka
ConsumeKafka
(consumer group 1)
ConsumeKafka
(consumer group 1)
= Concurrent Task
ConsumeKafka
(consumer group 1)
• Remember… each partition
can only have one consumer
from the same group
• When there are more NiFi
nodes than partitions, some
nodes won’t consume
anything
+

NiFi ConsumeKafka – Nodes < Partitions
Apache Kafka
ConsumeKafka
(consumer group 1)
ConsumeKafka
(consumer group 1)
= Concurrent Task
• When there are less NiFi
nodes/tasks than partitions,
multiple partitions will be
assigned to each node/task

NiFi ConsumeKafka – Tasks = Partitions
Apache Kafka
ConsumeKafka
(consumer group 1)
ConsumeKafka
(consumer group 1)
= Concurrent Task
• When there are less NiFi
nodes than partitions, we
can increase the concurrent
tasks on each node
• Kafka Client will
automatically rebalance
partition assignment
• Improves throughput

NiFi ConsumeKafka – Tasks > Partitions
ConsumeKafka
(consumer group 1)
ConsumeKafka
(consumer group 1)
= Concurrent Task
Apache Kafka
• Increasing concurrent tasks
only makes sense when the
number of partitions is
greater than the number of
nodes
• Otherwise we end up with
some tasks not consuming
anything
+

Kafka Processors & Batching Messages
• PublishKafka - ‘Message Demarcator’
• If not specified, flow file content sent as a single message
• If specified, flow file content separated into multiple messages based on demarcator
• Ex: Sending 1 million messages to Kafka – significantly better performance with 1 flow file
containing 1 million demarcated messages vs. 1 million flow files with a single message
• ConsumeKafka - ‘Message Demarcator’
• If not specified, a flow file is produced for each message consumed
• If specified, multiple messages written to a single flow file separated by the demarcator
• Maximum # of messages written to a single flow file equals ‘Max Poll Records’

Best Practice Summary
• PublishKafka
• Each concurrent task is an independent producer
• Scale number of concurrent tasks according to data flow
• ConsumeKafka
• Kafka client assigns one thread per-partition with in a consumer group
• Create optimal alignment between # of partitions and # of consumer tasks
• Avoid having more tasks than partitions
• Batching
• Message Demarcator property on PublishKafka and ConsumeKafka
• Can achieve significantly better performance

Summary of the Demo Scenario
Truck Sensors
NiFi
MiNiFi
Kafka Storm
Speed Events
Average Speed
PublishKafka
ConsumeKafka
Dashboard
Windowed
Avg. Speed
• MiNiFi – Collects data from truck sensors
• NiFi – Filter/enrich truck data, deliver to Kafka, consume results
• Kafka - Central messaging bus, Storm consumes from and publishes to
• Storm – Computes average speed over a time window per driver & route
+ ++

Demo – Data Generator
 Geo Event
2016-11-07 10:34:52.922|truck_geo_event|73|10|George
Vetticaden|1390372503|Saint Louis to Tulsa|Normal|38.14|-
91.3|1|
 Speed Event
2016-11-07 10:34:52.922|truck_speed_event|73|10|George
Vetticaden|1390372503|Saint Louis to Tulsa|70|

Demo – MiNiFi
Processors:
- name: TailFile
class: org.apache.nifi.processors.standard.TailFile
...
Properties:
File Location: Local
File to Tail: /tmp/truck-sensor-data/truck-1.txt
...
Connections:
- name: TailFile/success/2042214b-0158-1000-353d-654ef72c7307
source name: TailFile
...
Remote Processing Groups:
- name: http://localhost:9090/nifi
url: http://localhost:9090/nifi
...
Input Ports:
- id: 2042214b-0158-1000-353d-654ef72c7307
name: Truck Events
...

Demo - NiFi

Demo - Storm

Demo - Dashboard

Questions?
Hortonworks Community Connection:
Data Ingestion and Streaming
https://community.hortonworks.com/

Kerberized interaction w/Kafka GetKafka PutKafka
Kafka broker 0.8 (HDP 2.3.2) Supported Supported
Kafka broker 0.9 (HDP 2.3.4 +) Supported Supported
Kafka broker 0.8 (Apache) N/A N/A
Kafka broker 0.9 (Apache) Not Supported Not Supported
Non-Kerberized interaction w/Kafka GetKafka PutKafka
Kafka broker 0.8 (HDP 2.3.2) Supported Supported
Kafka broker 0.9 (HDP 2.3.4 +) Supported Supported
Kafka broker 0.8 (Apache) Supported Supported
Kafka broker 0.9 (Apache) Supported Supported
SSL Interaction w/ Kafka GetKafka PutKafka
Kafka broker 0.8 (HDP 2.3.2) N/A N/A
Kafka broker 0.9 (HDP 2.3.4 +) Not Supported Not Supported
HDF Kafka Processor Compatibility

Kerberized interaction w/Kafka ConsumeKafka (2 sets) PublishKafka (2 sets)
Kafka broker 0.8 (HDP 2.3.2) Not Supported Not Supported
Kafka broker 0.9/0.10 (HDP 2.3.4 +) Supported Supported
Kafka broker 0.9/0.10 (Apache) Supported Supported
Non-Kerberized interaction w/Kafka ConsumeKafka (2 sets) PublishKafka (2 sets)
Kafka broker 0.8 (HDP 2.3.2) Not Supported Not Supported
SSL Interaction w/ Kafka ConsumeKafka (2 sets) PublishKafka (2 sets)
Kafka broker 0.8 (HDP 2.3.2) N/A N/A
HDF Kafka Processor Compatibility

Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

Similar to Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together (20)

More from Hortonworks

More from Hortonworks (20)

Recently uploaded

Recently uploaded (20)

Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together

Editor's Notes