Loading Big Data into Hadoop Cluster using Flume
Big Data Hacker
➢ What is Apache Flume?
➢ Problem statement
➢ Use Case : Collecting web server logs
➢ Overview/Architecture of Flume
What is Flume?
Collection & Aggregation of Streaming Data
- Typically used for log data.
Advantages over other solutions:-
➢ Scalable, Reliable, Customizable
➢ Declarative and Dynamic Configuration
➢ Contextual Routing
➢ Feature Rich and Fully Extensible
➢ Open source
An Event is the basic unit of data transported by Flume from
source to destination.
➢ Payload is opaque to Flume.
➢ Events are accompanied by optional headers.
- Headers are collection of unique Key-Value pairs
- Headers are used for contextual routing
Core Concept: Client
Entity that simulates event generation, passed to one or more
➢ Example: Flume log4j Appender
➢ Decouples Flume from the system where event data is generated.
Core Concepts: Agent
Container for hosting Sources, Channels, sinks and other
Core Concepts: Source
Component that receives events and places it onto one or more
➢ Different types of sources:
- Specialized sources for integrating with well known systems.
For example -Syslog, Netcat
- Auto generating Sources-Exec,SEQ
- IPC Sources for Agent to Agent communication: Avro, Thrift
➢ Requires at least one Channel to function.
Core Concept: Channel
Component that buffers incoming events which are ultimately
consumed by Sinks.
➢ Different channels:- Memory, File, Database
➢ Channels are fully transactional.
Core Concepts: Sink
Component that takes events from channel and transmits them to
next hop destination.
Different type of Sinks:
- Terminal Sinks: HDFS,Hbase
- Auto consuming Sinks: Null Sink
- IPC sink : Agent to Agent communication-Avro, Thrift
Interceptors are applied to sources in a predetermined fashion to
enable adding information and filtering of events.
➢ Built in Interceptors: Allows adding headers such as timestamps, static markers
➢ Custom Interceptors: Create headers by
inspecting the Event.
It facilitates selection of one or more Channels, based on preset
➢ Built in Channel Selectors:
- Replicating: for duplicating events
- Multiplexing: for routing based based on headers.
➢ Custom selectors can be written for dynamic criteria.
Sink Processor is responsible for invoking one sink from a
specified group of Sinks.
➢ Built in Sink Processors:
- Load Balancing Sink Processor.
- Failover Sink Processor
- Default Sink Processor.
Events filtered Events
➢ Event Removal from Channel is transactional.
to next hop
Agents use transactional exchange to guarantee
delivery across hops.
SinkChannel Source Channel