Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume

Large Scale Data Ingest Using NOT USE PUBLICLY
DO
Apache Flume PRIOR TO 10/23/12
Headline Goes Here
Hari Shreedharan
Speaker Name or Subhead Goes Here
Software Engineer , Cloudera
Apache Flume PMC member / committer
February 2013

1

Why event streaming with Flume is awesome
• Couldn’t I just do this with a shell script?
• What year is this, 2001? There is a better way!
• Scalable collection, aggregation of event data (i.e. logs)
• Dynamic, contextual event routing
• Low latency, high throughput
• Declarative configuration
• Productive out of the box, yet powerfully extensible
• Open source software

2

Lessons learned from Flume OG
• Hard to get predictable performance without decoupling tier
impedance
• Hard to scale-out without multiple threads at the sink level
• A lot of functionality doesn’t work well as a decorator
• People need a system that keeps the data flowing when there is
a network partition (or downed host in the critical path)

3

Topology: Connecting agents together

[Client]+  Agent [ Agent]*  Destination

5

Basic Concepts

• Client • Valid Configuration
• Log4j Appender • Must have at least one
• Client SDK Channel
• Clientless Operation • Must have at least one
source or sink
• Agent
• Any number of sources
• Source
• Any number of channels
• Channel
• Any number of Sinks
• Sink

6

Concepts in Action

• Source: Puts events into the Channel
• Sink: Drains events from the Channel
• Channel: Store the events until drained

7

Flow Reliability

success

Reliability based on:
• Transactional Exchange between Agents
• Persistence Characteristics of Channels in the Flow

Also Available:
• Built-in Load balancing Support
• Built-in Failover Support

8

Reliability
• Transactional guarantees from channel
• External client needs handle retry
• Built in avro-client to read streams
• Avro source for multi-hop flows
• Use Flume Client SDK for customization

9

Hierarchical Namespace
agent1.properties:

# Active components
agent1.sources = src1
agent1.channels = ch1
agent1.sinks = sink1

# Define and configure src1
agent1.sources.src1.type = netcat
agent1.sources.src1.channels = ch1
agent1.sources.src1.bind = 127.0.0.1
agent1.sources.src1.port = 10112

# Define and configure sink1
agent1.sinks.sink1.type = logger
agent1.sinks.sink1.channel = ch1

# Define and configure ch1
agent1.channels.ch1.type = memory

11

Basic Configuration Rules
# Active components
agent1.sources = src1 • Only the named agents’ configuration loaded
agent1.channels = ch1
agent1.sinks = sink1
• Only active components’ configuration
# Define and configure src1 loaded within the agents’ configuration
agent1.sources.src1.type = netcat
agent1.sources.src1.channels = ch1
agent1.sources.src1.bind = 127.0.0.1
• Every Agent must have at least one channel
agent1.sources.src1.port = 10112
• Every Source must have at least one channel
# Define and configure sink1
agent1.sinks.sink1.type = logger • Every Sink must have exactly one channel
agent1.sinks.sink1.channel = ch1
• Every component must have a type
# Define and configure ch1
agent1.channels.ch1.type = memory

# Some other Agents’ configuration
agent2.sources = src1 src2

12

Deployment

Steady state inflow == outflow

4 Tier 1 agents at 100 events/sec (batch-size)
 1 Tier 2 agent at 400 eps

13

Source
• Event Driven
• Supports Batch Processing
• Source Types:
• AVRO – RPC source – other Flume agents can send data to this source port
• THRIFT – RPC source (available in next Flume release)
• SPOOLDIR – pick up rotated log files
• HTTP – post to a REST service (extensible)
• JMS – ingest from Java Message Service
• SYSLOGTCP, SYSLOGUDP
• NETCAT
• EXEC

14

How Does a Source Work?
• Read data from external clients/other sinks
• Stores events in configured channel(s)
• Asynchronous to the other end of channel
• Transactional semantics for storing data

15

Begin
Source Txn Channel

Event Event

Event Event

Event Event
Transaction
batch Event
Event
Event
Event
Commit
Txn

Source Features
• Event driven or Pollable
• Supports Batching
• Fanout of flow
• Interceptors

17

Fanout
Transaction
Interceptor handling Flow 2

Channel Channel2
Processor
Source
Channel
Selector Channel1

Fanout
processing Flow 1

18

Channel Selector
• Replicating selector
• Replicate events to all channels
• Multiplexing selector
• Contextual routing
agent1.sources.sr1.selector.type = multiplexing
agent1.sources.sr1.selector.mapping.foo = channel1
agent1.sources.sr1.selector.mapping.bar = channel2
agent1.sources.sr1.selector.default = channel1
agent1.sources.sr1.selector.header = yourHeader

19

Built-in Sources in Flume
• Asynchronous sources
• Client don't handle failures
• Exec, Syslog
• Synchronous sources
• Client handles failures
• Avro, Scribe, HTTP, JMS
• Flume 0.9x Source
• AvroLegacy, ThriftLegacy

20

RPC Sources – Avro and Thrift
• Reading events from external client
• Only TCP
• Connecting two agents in a distributed flow
• Based on IPC thus failure notification is enabled
• Configuration

agent_foo.sources.rpcsource-1.type = avro/thrift
agent_foo.sources.rpcsource-1.bind = <host>
agent_foo.sources.rpcsource-1.port = <port>

21

Spooling Directory Source
• Parses rotated log files out of a “spool” directory
• Watches for new files, renames or deletes them when done
• The files must be immutable before being placed into the
watched directory

agent.sources.spool.type = spooldir
agent.sources.spool.spoolDir = /var/log/spooled-files
agent.sources.spool.deletePolicy = never OR immediate

22

HTTP Source
• Runs a web server that handles HTTP requests
• The handler is pluggable (can roll your own)
• Out of the box, an HTTP client posts a JSON array of events to
the server. Server parses the events and puts them on the
channel.

agent.sources.http.type = http
agent.sources.http.port = 8081

23

HTTP Source, cont’d.
• Default handler supports events that look like this:
[{
"headers" : {
"timestamp" : "434324343",
"host" : ”host1.example.com"
},
"body" : ”arbitrary data in body string"
},
{
"headers" : {
"namenode" : ”nn01.example.com",
"datanode" : ”dn102.example.com"
},
"body" : ”some other arbitrary data in body string"
}]

24

Exec Source
• Reading data from a output of a command
• Can be used for ‘tail –F ..’
• Doesn’t handle failures ..

Configuration:
agent_foo.sources.execSource.type = exec
agent_foo.sources.execSource.command = 'tail -F /var/log/weblog.out’

25

JMS Source
• Reads messages from a JMS queue or topic, converts them to Flume events
and puts those events onto the channel.
• Pluggable Converter that by default converrts Bytes, Text, and Object
messages into Flume Events.
• So far, tested with ActiveMQ. We’d like to hear about experiences with any
other JMS implementations.
agent.sources.jms.type = jms
agent.sources.jms.initialContextFactory =
org.apache.activemq.jndi.ActiveMQInitialContextFactory
agent.sources.jms.providerURL = tcp://mqserver:61616
agent.sources.jms.destinationName = BUSINESS_DATA
agent.sources.jms.destinationType = QUEUE

26

Interceptor
• Applied to Source configuration element
• One source can have many interceptors
• Chain-of-responsibility
• Can be used for tagging, filtering, routing*
• Built-in interceptors:
• TIMESTAMP
• HOST
• STATIC
• REGEX EXTRACTOR

27

Writing a custom interceptor
• Configuration:

# Declare interceptors
agent1.sources.src1.interceptors = int1 int2 …
# Define each interceptor

agent1.sources.src1.interceptors.int1.type = <type>
agent1.sources.src1.interceptors.int1.foo = bar

• Custom Interceptors:
org.apache.flume.interceptor.Interceptor:
void close()
void initialize()
Event intercept(Event)
List<Event> intercept(List<Event> events)

org.apache.flume.interceptor.Interceptor.Builder
Interceptor build()
void configure(Context)

28

Channel Selector
• Applied to Source, at most one.
• Not a Named Component
• Built-in Channel Selectors:
• REPLICATING (Default)
• MULTIPLEXING
• Multiplexing Channel Selector:
• Contextual Routing
• Must have a default set of channels
agent1.sources.src1.selector.type = MULTIPLEXING
agent1.sources.src1.selector.mapping.foo = ch1
agent1.sources.src1.selector.mapping.bar = ch2
agent1.sources.src1.selector.mapping.baz = ch1 ch2
agent1.sources.src1.selector.default = ch5 ch6

29

Custom Channel Selector
• Configuration:
agent1.sources.src1.selector.type = <type>
agent1.sources.src1.selector.prop1 = value1
agent1.sources.src1.selector.prop2 = value2

• Interface:
org.apache.flume.ChannelSelector
void setChannels(List<Channel>)
List<Channel> getRequiredChannels(Event)
List<Channel> getOptionalChannels(Event)
List<Channel> getAllChannels()
void configure(Context)

30

Channel
• Passive Component
• Determines the reliability of a flow
• “Stock” channels that ship with Flume
• FILE – provides durability; most people use this
• MEMORY – lower latency for small writes, but not durable
• JDBC – provides full ACID support, but has performance issues

31

File Channel
• Write Ahead Log implementation
• Configuration:
agent1.channels.ch1.type = FILE
agent1.channels.ch1.checkpointDir = <dir>
agent1.channels.ch1.dataDirs = <dir1> <dir2>…
agent1.channels.ch1.capacity = N (100k)
agent1.channels.ch1.transactionCapacity = n
agent1.channels.ch1.checkpointInterval = n (30000)
agent1.channels.ch1.maxFileSize = N (1.52G)
agent1.channels.ch1.write-timeout = n (10s)
agent1.channels.ch1.checkpoint-timeout = n (600s)

32

File Channel
Flume Event Queue
• In memory representation of the
channel
• Maintains queue of pointers to
the data on disk in various log
files. Reference counts log files.
• Is memory mapped to a check
point file

Log Files
• On disk representation of actions
(Puts/Takes/Commits/Rollbacks)
• Maintains actual data
• Log files with 0 refs get deleted
33

Sink
• Polling Semantics
• Supports Batch Processing
• Specialized Sinks
• HDFS (Write to HDFS – highly configurable)
• HBASE, ASYNCHBASE (Write to Hbase)
• AVRO (IPC Sink – Avro Source as IPC source at next hop)
• THRIFT (IPC Sink – Thrift Source as IPC source at next hop)
• FILE_ROLL (Local disk, roll files based on size, # of events etc)
• NULL, LOGGER (For Testing Purposes)
• ElasticSearch
• IRC

34

HDFS Sink
• Writes events to HDFS (what!)
• Configuring (taken from Flume User Guide):

35

HDFS Sink
• Supports dynamic directory naming using tags
• Use event headers : %{header}
• Eg: hdfs://namenode/flume/%{header}
• Use timestamp from the event header
• Use various options to use this.
• Eg: hdfs://namenode/flume/%{header}/%Y-%m-%D/
• Use roundValue and roundUnit to round down the timestamp to use
separate directories.
• Within a directory – files rolled based on:
• rollInterval – time since last event was written
• rollSize – max size of the file
• rollCount – max # of events per file

36

AsyncHBase Sink

• Insert events and increments into Hbase
• Writes events asynchronously at very high rate.
• Easy to configure:
• table
• columnFamily
• batchSize - # events per txn.
• timeout - how long to wait for success callback
• serializer/serializer.* - Custom serializer can decide how and where the events
are written out.

37

IPC Sinks (Avro/Thrift)

• Sends events to the next hop’s IPC Source 
• Configuring:
• hostname
• port
• batch-size - # events per txn/batch sent to next hop
• request-timeout – how long to wait for success of batch

38

Serializers
• Supported by HDFS, Hbase and File_Roll sink
• Convert the event into a format of user’s choice.
• In case of Hbase, convert an event into Puts and Increments.

39

Sink Group
• Top-level element, needed to declare sink processors
• A sink can be at most in one group at anytime
• By default all sinks are in their individual default sink group
• Default sink group is a pass-through
• Deactivating sink-group does not deactivate the sink!!

40

Sink Processor
• Acts as a Sink Proxy
• Can work with multiple Sinks
• Built-in Sink Processors:
• DEFAULT
• FAILOVER
• LOAD_BALANCE

• Applied via Groups!
• A Top-Level Component

41

Application integration: Client SDK
• Factory:
org.apache.flume.api.RpcClientFactory:
RpcClient getInstance(Properties)
org.apache.flume.api.RpcClient:
void append(Event)
void appendBatch(List<Event>)
boolean isActive()
• Supports:
• Failover client
• Load balancing client with ROUND_ROBIN, RANDOM, and custom selectors.
• Avro
• Thrift

42

Clients: Embedded agent
• More advanced RPC client. Integrates a channel.
• Minimal example:
properties.put("channel.type", "memory");
properties.put("channel.capacity", "200");
properties.put("sinks", "sink1");
properties.put("sink1.type", "avro");
properties.put("sink1.hostname", "collector1.example.com");
properties.put("sink1.port", "5564");
EmbeddedAgent agent = new EmbeddedAgent("myagent");
agent.configure(properties);
agent.start();
List<Event> events = new ArrayList<Event>();
events.add(event);
agent.putAll(events);
agent.stop();

• See Flume Developer Guide for more details and examples.
43

General Caveats
• Reliability = function of channel type, capacity, and system
redundancy
• Carefully size the channels for needed capacity
• Set batch sizes based on projected drain requirements
• Number of cores should be ½ total # of sources & sinks
combined in an agent

44

A common topology
App Tier Flume Agent Tier 1 Flume Agent Tier 2 Storage Tier
avro agent11 avro
Flume src sink
App-1
SDK avro
file
sink avro agent21 hdfs
ch
src sink

file
avro agent12 avro ch
Flume sink
App-2 src
SDK
avro HDFS
file agent22
sink avro hdfs
ch
src sink

agent13 avro file
Flume avro
App-3 sink ch
SDK src
avro
file
sink
ch

.
..
.
..

LB LB
.
..

+ +
failover failover

Summary
• Clients send Events to Agents
• Each agent hosts Flume components: Source, Interceptors, Channel
Selectors, Channels, Sink Processors & Sinks
• Sources & Sinks are active components, Channels are passive
• Source accepts Events, passes them through Interceptor(s), and if not
filtered, puts them on channel(s) selected by the configured Channel
Selector
• Sink Processor identifies a sink to invoke, that can take Events from a
Channel and send it to its next hop destination
• Channel operations are transactional to guarantee one-hop delivery
semantics
• Channel persistence provides end-to-end reliability

46

Reference docs (1.3.1 release)
User Guide:
flume.apache.org/FlumeUserGuide.html

Dev Guide:
flume.apache.org/FlumeDeveloperGuide.html

47

Blog posts
• Flume performance tuning
https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1
• Flume and Hbase
https://blogs.apache.org/flume/entry/streaming_data_into_apache_hbase
• File Channel Innards
https://blogs.apache.org/flume/entry/apache_flume_filechannel
• Architecture of Flume NG
https://blogs.apache.org/flume/entry/flume_ng_architecture

48

Contributing: How to get involved!
• Join the mailing lists:
• user-subscribe@flume.apache.org
• dev-subscribe@flume.apache.org
• Look at the code
• github.com/apache/flume – Mirror of the Apache Flume git repo
• File or fix a JIRA
• issues.apache.org/jira/browse/FLUME
• More on how to contribute:
• cwiki.apache.org/confluence/display/FLUME/How+to+Contribute

49

DO NOT USE PUBLICLY
Thank you PRIOR TO 10/23/12
Headline Goes Here
Reach out on the mailing lists!
Speaker Name or Subhead Goes Here
Follow me on Twitter: @harisr1234

51

Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume

Similar to Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume

Editor's Notes