More Related Content Similar to Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together (20) More from Hortonworks (20) Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Together1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Harnessing Data-in-Motion
with Hortonworks DataFlow
Apache NiFi, Kafka and Storm
Better Together
Bryan Bende
Sr. Software Engineer
Haimo Liu
Product Manager
2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Introduction to Hortonworks Data Flow
• Introduction to Apache projects
• Better together
• Best Practices
• Demo
3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Connected Data Platforms
4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Stream Processing
Flow Management
Enterprise Services
At the edge
Security
Visualization
On premises In the cloud
Registries/Catalogs Governance (Security/Compliance) Operations
HDF 2.0 – Data in Motion Platform
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Flow Management Flow management + Stream Processing
D A T A I N M O T I O N D A T A A T R E S T
IoT Data Sources AWS
Azure
Google Cloud
Hadoop
NiFi
Kafka
Storm
Others…
NiFi
NiFi NiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
NiFi
HDF 2.0 – Data in Motion Platform
Enterprise Services
Ambari Ranger Other services
7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Apache NiFi?
• Created to address the challenges of global enterprise dataflow
• Key features:
– Visual Command and Control
– Data Lineage (Provenance)
– Data Prioritization
– Data Buffering/Back-Pressure
– Control Latency vs. Throughput
– Secure Control Plane / Data Plane
– Scale Out Clustering
– Extensibility
8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi
What is Apache NiFi used for?
• Reliable and secure transfer of data between systems
• Delivery of data from sources to analytic platforms
• Enrichment and preparation of data:
– Conversion between formats
– Extraction/Parsing
– Routing decisions
What is Apache NiFi NOT used for?
• Distributed Computation
• Complex Event Processing
• Complex Rolling Window Operations
9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Terminology
FlowFile
• Unit of data moving through the system
• Content + Attributes (key/value pairs)
Processor
• Performs the work, can access FlowFiles
Connection
• Links between processors
• Queues that can be dynamically prioritized
10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Apache Kafka? APACHE
KAFKA
• Distributed streaming platform that
allows publishing and subscribing to
streams of records
• Streams of records are organized into
categories called topics
• Topics can be partitioned and/or
replicated
• Records consist of a key, value, and
timestamp
http://kafka.apache.org/intro
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer
11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kafka: Anatomy of a Topic
Partition
0
Partition
1
Partition
2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Writes
Old
New
Partitioning allows topics to
scale beyond a single
machine/node
Topics can also be replicated,
for high availability.
APACHE
KAFKA
12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi and Kafka Are Complementary
NiFi
Provide dataflow solution
• Centralized management, from edge to core
• Great traceability, event level data provenance
starting when data is born
• Interactive command and control – real time
operational visibility
• Dataflow management, including prioritization,
back pressure, and edge intelligence
• Visual representation of global dataflow
Kafka
Provide durable stream store
• Low latency
• Distributed data durability
• Decentralized management of producers &
consumers
+
13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Apache Storm?
• Distributed, low-latency, fault-tolerant, Stream Processing platform.
• Provides processing guarantees.
• Key concepts include:
• Tuples
• Streams
• Spouts
• Bolts
• Topology
14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storm - Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm
–Named list of values that can be of any data type
•What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm
15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storm - Spouts
• What is a Spout?
–Source of data
–E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust as needed
16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storm - Bolts
• What is a Bolt?
–Processes any number of input streams and produces output streams
–Common processing in bolts are functions, aggregations, joins, R/W to data stores, alerting logic
–Can spin up multiple instances of a Bolt and dynamically adjust as needed
17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storm - Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream
18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
+
NiFi and Storm Are Complementary
NiFi
Simple event processing
• Manages flow of data between producers and
consumers across the enterprise
• Data enrichment, splitting, aggregation,
format conversion, schema translation…
• Scale out to handle gigabytes per second, or
scale down to a Raspberry PI handling tens of
thousands of events per second
Storm
Complex and distributed processing
• Complex processing from multiple streams (JOIN
operations)
• Analyzing data across time windows (rolling window
aggregation, standard deviation, etc.)
• Scale out to thousands of nodes if needed
+
20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Integration Points
• NiFi - Kafka
– NiFi Kafka Producer
– NiFi Kafka Consumer
• Storm - Kafka
– Storm Kafka Consumer
– Storm Kafka Producer
+ +
21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Integration Points – NiFi & Kafka
NiFi
MiNiFi
MiNiFi
MiNiFi
Kafka
Consumer 1
Consumer 2
Consumer N
• Producer Processors
• PutKafka (0.8 Kafka Client)
• PublishKafka (0.9 Kafka Client)
• PublishKafka_0_10 (0.10 Kafka Client)
+
22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Integration Points – NiFi & Kafka
Kafka
Producer 1
Producer 2
Producer N
NiFi
Destination 1
Destination 2
Destination 3
• Consumer Processors
• GetKafka (0.8 Kafka Client)
• ConsumeKafka (0.9 Kafka Client)
• ConsumeKafka_0_10 (0.10 Kafka Client)
+
23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Integration Points – Storm & Kafka
• storm-kafka module
– KafkaSpout (Core & Trident) & KafkaBolt
– Compatible with Kafka 0.8 and 0.9 client
– Kafka client declared by topology developer
• storm-kafka-client module
– KafkaSpout & KafkaSpoutTuplesBuilder
– Compatible with Kafka 0.9 and 0.10 client
– Kafka client declared by topology developer
Kafka Storm
Incoming Topic
Results Topic
KafkaSpout
KafkaBolt
+
24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Better Together
NiFiMiNiFi
Kafka
Storm
Incoming Topic
Results Topic
PublishKafka
ConsumeKafka
Destinations
MiNiFi
• MiNiFi – Collection, filtering, and prioritization at the edge
• NiFi - Central data flow management, routing, enriching, and transformation
• Kafka - Central messaging bus for subscription by downstream consumers
• Storm - Streaming analytics focused on complex event processing
+ +
26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi PublishKafka
Apache NiFi - Node 1
Apache Kafka
Topic 1 - Partition 1
Topic 1 - Partition 2
PublishKafka
Apache NiFi – Node 2
PublishKafka
= Concurrent Task
• Each NiFi node runs an
instance of PublishKafka
• Each instance has one or
more concurrent tasks
(threads)
• Each concurrent task is an
independent producer,
sends data round-robin to
partitions of a topic
+
27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi ConsumeKafka – Nodes = Partitions
Apache NiFi - Node 1
Apache Kafka
Topic 1 - Partition 1
Topic 1 - Partition 2
ConsumeKafka
(consumer group 1)
Apache NiFi – Node 2
ConsumeKafka
(consumer group 1)
= Concurrent Task
• Each NiFi node runs an
instance of ConsumeKafka
• Each instance has one or
more concurrent tasks
(threads)
• Each concurrent task is a
consumer assigned to a
single partition
• Kafka Client ensures a given
partition can only have one
consumer/thread in a
consumer group
+
28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi ConsumeKafka – Nodes > Partitions
Apache NiFi - Node 1
Apache Kafka
Topic 1 - Partition 1
Topic 1 - Partition 2
ConsumeKafka
(consumer group 1)
Apache NiFi – Node 2
ConsumeKafka
(consumer group 1)
= Concurrent Task
Apache NiFi – Node 3
ConsumeKafka
(consumer group 1)
• Remember… each partition
can only have one consumer
from the same group
• When there are more NiFi
nodes than partitions, some
nodes won’t consume
anything
+
29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi ConsumeKafka – Nodes < Partitions
Apache NiFi - Node 1
Apache Kafka
Topic 1 - Partition 1
Topic 1 - Partition 2
ConsumeKafka
(consumer group 1)
Apache NiFi – Node 2
ConsumeKafka
(consumer group 1)
= Concurrent Task
Topic 1 - Partition 3
Topic 1 - Partition 4
• When there are less NiFi
nodes/tasks than partitions,
multiple partitions will be
assigned to each node/task
30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi ConsumeKafka – Tasks = Partitions
Apache NiFi - Node 1
Apache Kafka
Topic 1 - Partition 1
Topic 1 - Partition 2
ConsumeKafka
(consumer group 1)
Apache NiFi – Node 2
ConsumeKafka
(consumer group 1)
= Concurrent Task
Topic 1 - Partition 3
Topic 1 - Partition 4
• When there are less NiFi
nodes than partitions, we
can increase the concurrent
tasks on each node
• Kafka Client will
automatically rebalance
partition assignment
• Improves throughput
31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi ConsumeKafka – Tasks > Partitions
Apache NiFi - Node 1
ConsumeKafka
(consumer group 1)
Apache NiFi – Node 2
ConsumeKafka
(consumer group 1)
= Concurrent Task
Apache Kafka
Topic 1 - Partition 1
Topic 1 - Partition 2
• Increasing concurrent tasks
only makes sense when the
number of partitions is
greater than the number of
nodes
• Otherwise we end up with
some tasks not consuming
anything
+
32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kafka Processors & Batching Messages
• PublishKafka - ‘Message Demarcator’
• If not specified, flow file content sent as a single message
• If specified, flow file content separated into multiple messages based on demarcator
• Ex: Sending 1 million messages to Kafka – significantly better performance with 1 flow file
containing 1 million demarcated messages vs. 1 million flow files with a single message
• ConsumeKafka - ‘Message Demarcator’
• If not specified, a flow file is produced for each message consumed
• If specified, multiple messages written to a single flow file separated by the demarcator
• Maximum # of messages written to a single flow file equals ‘Max Poll Records’
33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Best Practice Summary
• PublishKafka
• Each concurrent task is an independent producer
• Scale number of concurrent tasks according to data flow
• ConsumeKafka
• Kafka client assigns one thread per-partition with in a consumer group
• Create optimal alignment between # of partitions and # of consumer tasks
• Avoid having more tasks than partitions
• Batching
• Message Demarcator property on PublishKafka and ConsumeKafka
• Can achieve significantly better performance
35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary of the Demo Scenario
Truck Sensors
NiFi
MiNiFi
Kafka Storm
Speed Events
Average Speed
PublishKafka
ConsumeKafka
Dashboard
Windowed
Avg. Speed
• MiNiFi – Collects data from truck sensors
• NiFi – Filter/enrich truck data, deliver to Kafka, consume results
• Kafka - Central messaging bus, Storm consumes from and publishes to
• Storm – Computes average speed over a time window per driver & route
+ ++
36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo – Data Generator
Geo Event
2016-11-07 10:34:52.922|truck_geo_event|73|10|George
Vetticaden|1390372503|Saint Louis to Tulsa|Normal|38.14|-
91.3|1|
Speed Event
2016-11-07 10:34:52.922|truck_speed_event|73|10|George
Vetticaden|1390372503|Saint Louis to Tulsa|70|
37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo – MiNiFi
Processors:
- name: TailFile
class: org.apache.nifi.processors.standard.TailFile
...
Properties:
File Location: Local
File to Tail: /tmp/truck-sensor-data/truck-1.txt
...
Connections:
- name: TailFile/success/2042214b-0158-1000-353d-654ef72c7307
source name: TailFile
...
Remote Processing Groups:
- name: http://localhost:9090/nifi
url: http://localhost:9090/nifi
...
Input Ports:
- id: 2042214b-0158-1000-353d-654ef72c7307
name: Truck Events
...
41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?
Hortonworks Community Connection:
Data Ingestion and Streaming
https://community.hortonworks.com/
42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberized interaction w/Kafka GetKafka PutKafka
Kafka broker 0.8 (HDP 2.3.2) Supported Supported
Kafka broker 0.9 (HDP 2.3.4 +) Supported Supported
Kafka broker 0.8 (Apache) N/A N/A
Kafka broker 0.9 (Apache) Not Supported Not Supported
Non-Kerberized interaction w/Kafka GetKafka PutKafka
Kafka broker 0.8 (HDP 2.3.2) Supported Supported
Kafka broker 0.9 (HDP 2.3.4 +) Supported Supported
Kafka broker 0.8 (Apache) Supported Supported
Kafka broker 0.9 (Apache) Supported Supported
SSL Interaction w/ Kafka GetKafka PutKafka
Kafka broker 0.8 (HDP 2.3.2) N/A N/A
Kafka broker 0.9 (HDP 2.3.4 +) Not Supported Not Supported
Kafka broker 0.8 (Apache) N/A N/A
Kafka broker 0.9 (Apache) Not Supported Not Supported
HDF Kafka Processor Compatibility
43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberized interaction w/Kafka ConsumeKafka (2 sets) PublishKafka (2 sets)
Kafka broker 0.8 (HDP 2.3.2) Not Supported Not Supported
Kafka broker 0.9/0.10 (HDP 2.3.4 +) Supported Supported
Kafka broker 0.8 (Apache) N/A N/A
Kafka broker 0.9/0.10 (Apache) Supported Supported
Non-Kerberized interaction w/Kafka ConsumeKafka (2 sets) PublishKafka (2 sets)
Kafka broker 0.8 (HDP 2.3.2) Not Supported Not Supported
Kafka broker 0.9/0.10 (HDP 2.3.4 +) Supported Supported
Kafka broker 0.8 (Apache) Not Supported Not Supported
Kafka broker 0.9/0.10 (Apache) Supported Supported
SSL Interaction w/ Kafka ConsumeKafka (2 sets) PublishKafka (2 sets)
Kafka broker 0.8 (HDP 2.3.2) N/A N/A
Kafka broker 0.9/0.10 (HDP 2.3.4 +) Supported Supported
Kafka broker 0.8 (Apache) N/A N/A
Kafka broker 0.9/0.10 (Apache) Supported Supported
HDF Kafka Processor Compatibility
Editor's Notes Hortonworks: Powering the Future of Data Since each ConsumeKafka is part of the same group, and there are more ConsumeKafka instances than partitions, one of them doesn’t have anything to do. If we increase the concurrent tasks greater than the number of partitions, then some tasks have nothing to do.