Event Ingestion Tools Flume and Camus Compared

#backdaybyxebia
Sylvain Lequeux
@slequeux
Event Ingestion in HDFS

#backdaybyxebia
Back To Basics

#backdaybyxebia
Basics : Event ?
Asynchronysm …
… to message systems …
… to event systems

#backdaybyxebia
Basics : Kafka
A messaging system

#backdaybyxebia
Basics : Kafka
A distributed messaging system

#backdaybyxebia
Basics : Kafka
… multi-queues (called “topics”)
splitted in partitions

#backdaybyxebia
Basics : Kafka
… multi-queues (called “topics”)
splitted in partitions
… multi-clients

#backdaybyxebia
Basics : Hadoop Distributed FileSystem
Distributed & scalable
Highly fault-tolerant
Standard support for BigData jobs to run
“Moving computation is cheaper than moving data”

#backdaybyxebia
Flume
http://flume.apache.org/

#backdaybyxebia
Flume
Concepts

#backdaybyxebia
Flume
➔ Top level Apache project
➔ “Item” streaming based on data
flow

#backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files

#backdaybyxebia
Flume
2. A source is a way to transform
this data into Flume events

#backdaybyxebia
Flume
2. A source is a way to transform
this data into Flume events
3. A channel is a way to transport
data (memory, file)

#backdaybyxebia
Flume
2. A source is a way to pull and
transform this data into Flume
events
3. A channel is a way to transport
data (memory, file)
4. A sink is a way to put a Flume
event somewhere

#backdaybyxebia
Flume + Kafka = Flafka

#backdaybyxebia
Flume
How it works

#backdaybyxebia
Simple flume configuration file
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in
memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
1 - Define an agent
2 - Explicit the sources
3 - Explicit the sinks
4 - Explicit the channels
5 - Connect the pieces

#backdaybyxebia
Flafka source configuration
# Mandatory config
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.zookeeperConnect = localhost:2181
a1.sources.r1.topic = MyTopic
# Optional config
a1.sources.r1.batchSize = 1000
a1.sources.r1.batchDurationMillis = 1000
a1.sources.r1.consumer.timeout.ms = 10
a1.sources.r1.auto.commit.enabled = false
a1.sources.r1.groupId = flume

#backdaybyxebia
HDFS Sink configuration
# Mandatory config
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://localhost:54310/data/flume/
%{topic}/%y-%m-%d
# A log of optional configs inluding :
# Compression
# File types : SequenceFile, Avro, etc.
# Possibility to use a custom Writable
# Kerberos configuration

#backdaybyxebia
Flume : data transformation

#backdaybyxebia
Flume Interceptors
# Mandatory config
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = ...
..
➔ Transformation executed
◆ After event is generated
◆ Before sending it to channel
➔ Some predefined interceptors
◆ Timestamp
◆ UUID
◆ Filtering
◆ Morphline
◆ ...
➔ Could write your own (pure Java)

#backdaybyxebia
Flume : how to run it ?
Command line : Included in distribs :
flume-ng agent
-n a1
-c /usr/lib/flume-ng/conf/
-f /usr/lib/flume-ng/conf/flume-kafka.conf &

#backdaybyxebia
Camus
linkedin/camus

#backdaybyxebia
Camus
Concepts

#backdaybyxebia
Camus
➔ OpenSource project developped
by LinkedIn
➔ Based entirely on MapReduce

#backdaybyxebia
Camus
A batch consists in three steps :
- P1 : Gets metadata : topic & partitions, latest offsets
- P2 : Pulls new events
- P3 : Updates local metadatas

#backdaybyxebia
Camus
How it works

#backdaybyxebia
Time to write some code
Just explain how to transform
INTO

#backdaybyxebia
Time to write some code

#backdaybyxebia
Round 1 : getting started
Flume Camus
Just a simple configuration file make it
works
Need a complete dev environment
(included maven) to use it
Morphline interceptor’syntax is quite
complex
Dev should understand MapReduce
concepts

#backdaybyxebia
Round 2 : running time
Flume Camus
Flume events are ingested with no delay MapReduce setup adds uncompressable
time
31 sec uncompressable
+
~ 1 sec / 500 messages / node
=> 111 sec for 1 message
=> 117 sec for 50 messages
=> 116 sec for 1.000 mesages
=> 127 sec for 10.000 messages

#backdaybyxebia
Round 3 : maintainability
Flume Camus
When used by CM, server is easy to
maintain, but config is not
Full Maven project. Just use a version
control system (Git, SVN, aso.)

#backdaybyxebia
Round 4 : customization
Flume Camus
Interceptors are fully customizable Morphing data could be done easily
Event headers make HDFS path highly
modulable

#backdaybyxebia
Round 5 : deployment
Flume Camus
When used by CM, just include your conf,
that’s it
MapReduce jobs may be included in any
MR orchestrator (Oozie for instance)
Without a manager, everything needs to be
done manually

#backdaybyxebia
Round 6 : state of the project
Flume Camus
Released 1.0.0 in 2012 Currently in v 0.1.0-SNAPSHOT
Included by default with Hadoop
distributions
Highly used by LinkedIn in production
Almost no documentation

#backdaybyxebia
Summary
Flume Camus
Getting Started
Running time
Maintainability
Customization
Deployment
State of the project

#backdaybyxebia
Global Feedback

#backdaybyxebia
Debugging
Debugging on Flume is quite complex
Some really critical bugs like [FLUME-2578]

#backdaybyxebia
Documentation
Flume has really good quality doc
Camus only has a readme file and not up to date !

#backdaybyxebia
Camus & M/R
Camus suffers the use of Map/Reduce.
Maybe using some other concept like Spark may result in better perfs.

#backdaybyxebia
Flume quantity of files
Flume needs a very precise configuration not to generate a bunch of file.
It is easy to get it generate a lot of little files, which is problematic in term of
BigData.

#backdaybyxebia
Thank you
Questions ?

Event Ingestion Tools Flume and Camus Compared

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Event Ingestion Tools Flume and Camus Compared

Similar to Event Ingestion Tools Flume and Camus Compared (20)

More from Publicis Sapient Engineering

More from Publicis Sapient Engineering (20)

Recently uploaded

Recently uploaded (20)

Event Ingestion Tools Flume and Camus Compared