1. Apache Flume
Loading Big Data into Hadoop Cluster using Flume
Swapnil Dubey
Big Data Hacker
GoDataDriven
2. Agenda
➢ What is Apache Flume?
➢ Problem statement
➢ Use Case : Collecting web server logs
➢ Overview/Architecture of Flume
➢ Demos
3. What is Flume?
Collection & Aggregation of Streaming Data
- Typically used for log data.
Advantages over other solutions:-
➢ Scalable, Reliable, Customizable
➢ Declarative and Dynamic Configuration
➢ Contextual Routing
➢ Feature Rich and Fully Extensible
➢ Open source
15. Core Concept:Event
An Event is the basic unit of data transported by Flume from
source to destination.
➢ Payload is opaque to Flume.
➢ Events are accompanied by optional headers.
Headers:
- Headers are collection of unique Key-Value pairs
- Headers are used for contextual routing
Events
Client
16. Core Concept: Client
Entity that simulates event generation, passed to one or more
agents.
➢ Example: Flume log4j Appender
➢ Decouples Flume from the system where event data is generated.
Events
Client
18. Core Concepts: Source
Component that receives events and places it onto one or more
channels.
➢ Different types of sources:
- Specialized sources for integrating with well known systems.
For example -Syslog, Netcat
- Auto generating Sources-Exec,SEQ
- IPC Sources for Agent to Agent communication: Avro, Thrift
➢ Requires at least one Channel to function.
19. Core Concept: Channel
Component that buffers incoming events which are ultimately
consumed by Sinks.
➢ Different channels:- Memory, File, Database
➢ Channels are fully transactional.
20. Core Concepts: Sink
Component that takes events from channel and transmits them to
next hop destination.
Different type of Sinks:
- Terminal Sinks: HDFS,Hbase
- Auto consuming Sinks: Null Sink
- IPC sink : Agent to Agent communication-Avro, Thrift
21. Core Concepts:Interceptor
Interceptors are applied to sources in a predetermined fashion to
enable adding information and filtering of events.
➢ Built in Interceptors: Allows adding headers such as timestamps, static markers
etc.
➢ Custom Interceptors: Create headers by
inspecting the Event.
22. Channel Selector
It facilitates selection of one or more Channels, based on preset
criteria.
➢ Built in Channel Selectors:
- Replicating: for duplicating events
- Multiplexing: for routing based based on headers.
➢ Custom selectors can be written for dynamic criteria.
23. Sink Processor
Sink Processor is responsible for invoking one sink from a
specified group of Sinks.
➢ Built in Sink Processors:
- Load Balancing Sink Processor.
- Failover Sink Processor
- Default Sink Processor.
25. Data Drain
➢ Event Removal from Channel is transactional.
Sink
Runner
Sink
Sink
Processor
Channels
Sink selection
n invocation
Send events
to next hop
Next Hop
27. Assured Delivery
Agents use transactional exchange to guarantee
delivery across hops.
Start
Transaction
take Events
end
transaction
SinkChannel Source Channel
Start
Transaction
take Events
end
transaction
Send events