This document discusses Flume and Camus, two tools for ingesting event data into HDFS. Flume is described as an event streaming tool that uses sources to pull data, channels to transport it, and sinks to output the data. Camus is a LinkedIn project that uses MapReduce jobs to periodically pull Kafka data in batches and write it to HDFS. The document provides configuration examples and compares the two in terms of ease of setup, performance, customizability, and deployment. It notes that Flume is easier to maintain but has a more complex configuration, while Camus has poor documentation and performance can be impacted by its use of MapReduce.
10. #backdaybyxebia
Basics : Hadoop Distributed FileSystem
Distributed & scalable
Highly fault-tolerant
Standard support for BigData jobs to run
“Moving computation is cheaper than moving data”
17. #backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
2. A source is a way to transform
this data into Flume events
18. #backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
2. A source is a way to transform
this data into Flume events
3. A channel is a way to transport
data (memory, file)
19. #backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
2. A source is a way to pull and
transform this data into Flume
events
3. A channel is a way to transport
data (memory, file)
4. A sink is a way to put a Flume
event somewhere
26. #backdaybyxebia
Flume Interceptors
# Mandatory config
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = ...
..
➔ Transformation executed
◆ After event is generated
◆ Before sending it to channel
➔ Some predefined interceptors
◆ Timestamp
◆ UUID
◆ Filtering
◆ Morphline
◆ ...
➔ Could write your own (pure Java)
27. #backdaybyxebia
Flume : how to run it ?
Command line : Included in distribs :
flume-ng agent
-n a1
-c /usr/lib/flume-ng/conf/
-f /usr/lib/flume-ng/conf/flume-kafka.conf &
37. #backdaybyxebia
Round 1 : getting started
Flume Camus
Just a simple configuration file make it
works
Need a complete dev environment
(included maven) to use it
Morphline interceptor’syntax is quite
complex
Dev should understand MapReduce
concepts
38. #backdaybyxebia
Round 2 : running time
Flume Camus
Flume events are ingested with no delay MapReduce setup adds uncompressable
time
31 sec uncompressable
+
~ 1 sec / 500 messages / node
=> 111 sec for 1 message
=> 117 sec for 50 messages
=> 116 sec for 1.000 mesages
=> 127 sec for 10.000 messages
39. #backdaybyxebia
Round 3 : maintainability
Flume Camus
When used by CM, server is easy to
maintain, but config is not
Full Maven project. Just use a version
control system (Git, SVN, aso.)
40. #backdaybyxebia
Round 4 : customization
Flume Camus
Interceptors are fully customizable Morphing data could be done easily
Event headers make HDFS path highly
modulable
41. #backdaybyxebia
Round 5 : deployment
Flume Camus
When used by CM, just include your conf,
that’s it
MapReduce jobs may be included in any
MR orchestrator (Oozie for instance)
Without a manager, everything needs to be
done manually
42. #backdaybyxebia
Round 6 : state of the project
Flume Camus
Released 1.0.0 in 2012 Currently in v 0.1.0-SNAPSHOT
Included by default with Hadoop
distributions
Highly used by LinkedIn in production
Almost no documentation
47. #backdaybyxebia
Camus & M/R
Camus suffers the use of Map/Reduce.
Maybe using some other concept like Spark may result in better perfs.
48. #backdaybyxebia
Flume quantity of files
Flume needs a very precise configuration not to generate a bunch of file.
It is easy to get it generate a lot of little files, which is problematic in term of
BigData.