FKafka
360 degree overview
Tanuj Mehta
What is Flume?
Apache Flume is a data ingestion mechanism for transporting large amounts of
streaming data such as log data, events etc. from various sources like web servers,
Kafka etc. to a centralized data store like HDFS
Data
Generator
Centralized
storage
HDFS
HBase
FlumeLog/Event data Log/Event data
Flume Event
An event is the basic unit of the data transported inside Flume. It contains a
payload of byte array that is to be transported from the source to the destination
accompanied by optional headers
Event
Header Byte Payload
Flume Agent
An agent is an independent daemon process (JVM) in Flume. It receives the data
(events) from clients or other agents and forwards it to its next destination (sink or
agent).
Source channel Sink
Flume Components
Source Interceptor Selector Channel Sink
Multi-hop Flow
Source 1
Source 2
Source 3
Channel 1
Channel 2 Centralized
storage
Channel 3
Fan-in
Fan-out
Sink Processor
● Default
● Load Balancing (Multi-Threading)
● Failover
Kafka-Flume Integration
Kafka Broker
Topic 1
Topic 2
Flume Agent
Channel Sink
Kafka Channel 1
HBase
SOLR
Kafka Channel 2
HDFS
No Data
Loss
Data
DUPLICATION
Benefits
● Ability to store data to any centralized stores (HDFS, HBase)
● Contextual routing
● Effectively handle peak hour loads
● Allow read and write to operate at different rate
● Reliable, fault tolerant, scalable, manageable, and customizable
Q&A

Avvo fkafka

  • 1.
  • 2.
    What is Flume? ApacheFlume is a data ingestion mechanism for transporting large amounts of streaming data such as log data, events etc. from various sources like web servers, Kafka etc. to a centralized data store like HDFS Data Generator Centralized storage HDFS HBase FlumeLog/Event data Log/Event data
  • 3.
    Flume Event An eventis the basic unit of the data transported inside Flume. It contains a payload of byte array that is to be transported from the source to the destination accompanied by optional headers Event Header Byte Payload
  • 4.
    Flume Agent An agentis an independent daemon process (JVM) in Flume. It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent). Source channel Sink
  • 5.
  • 6.
    Multi-hop Flow Source 1 Source2 Source 3 Channel 1 Channel 2 Centralized storage Channel 3 Fan-in Fan-out
  • 7.
    Sink Processor ● Default ●Load Balancing (Multi-Threading) ● Failover
  • 8.
    Kafka-Flume Integration Kafka Broker Topic1 Topic 2 Flume Agent Channel Sink Kafka Channel 1 HBase SOLR Kafka Channel 2 HDFS
  • 9.
  • 10.
  • 11.
    Benefits ● Ability tostore data to any centralized stores (HDFS, HBase) ● Contextual routing ● Effectively handle peak hour loads ● Allow read and write to operate at different rate ● Reliable, fault tolerant, scalable, manageable, and customizable
  • 12.