How to collect Big Data into Hadoop

How to collect Big Data

into Hadoop
Big Data processing to collect Big Data

fluentd.org
Sadayuki Furuhashi

Self-introduction

> Sadayuki Furuhashi
> Treasure Data, Inc.
Founder & Software Architect

> Open source projects
MessagePack - efficient serializer (original author)
Fluentd - event collector (original author)

We’re Hiring!
sf@treasure-data.com

Big Data

Collect Store Process Visualize

Report &
Monitor

easier & shorter time


Cloudera Excel
Horton Works Tableau
MapR R

How to shorten here? easier & shorter time


Cloudera Excel
Horton Works Tableau
MapR R

Poor man’s data collection

1. Copy files from servers using rsync

2. Create a RegExp to parse the files

3. Parse the files and generate a 10GB CSV file

4. Put it into HDFS

Problems to collect “big data”
> Includes broken values
> needs error handling & retrying
> Time-series data are changing and uncler
> parse logs before storing
> Takes time to read/write
> tools have to be optimized and parallelized
> Takes time for trial & error
> Causes network traffic spikes

Problem of poor man’s data collection

> Wastes time to implement error handling
> Wastes time to maintain a parser
> Wastes time to debug the tool
> Not reliable
> Not efficient

Basic theories
to collect big data

Divide & Conquer

error

error

Divide & Conquer & Retry

error retry

retry

error retry retry

Streaming

Don’t handle big ﬁles here Do it here

Apache Flume

access logs Agent
app logs Agent Collector

system logs Agent Collector
Agent
...

Apache Flume - network topology
Master
Agent ack

Agent Collector
Flume OG Collector
Agent Collector send
Agent

Agent
send/ack
Agent Collector
Flume NG Collector
Agent Collector
Agent

Apache Flume - pipeline

Source Sink
Flume OG

plugin

Source Channel Sink
Flume NG

Apache Flume - configuration

Master

Master manages
all conﬁguration
(optional)
Agent
Agent Collector
Flume NG Collector
Agent Collector
Agent

Apache Flume - configuration
# source
host1.sources = avro-source1
host1.sources.avro-source1.type = avro
host1.sources.avro-source1.bind = 0.0.0.0
host1.sources.avro-source1.port = 41414
host1.sources.avro-source1.channels = ch1

# channel
host1.channels = ch_avro_log
host1.channels.ch_avro_log.type = memory

# sink
host1.sinks = log-sink1
host1.sinks.log-sink1.type = logger
host1.sinks.log-sink1.channel = ch1

Fluentd - network topology

Agent
send/ack
Agent Collector
Flume NG Collector
Agent Collector
Agent

fluentd
send/ack
fluentd fluentd
Fluentd fluentd
fluentd fluentd
fluentd

Fluentd - pipeline

Source Channel Sink
Flume NG

plugin

Input Buffer Output
Fluentd

Fluentd - configuration

fluentd
fluentd fluentd
Fluentd fluentd
fluentd fluentd
fluentd

Use chef, puppet, etc. for configuration
(they do things better)
No central node - keep things simple


<source>
type forward
port 24224
</source>

<match **>
type file
path /var/log/logs
</match>

# source
host1.sources = avro-source1
host1.sources.avro-source1.type = avro
<source> host1.sources.avro-source1.bind = 0.0.0.0
type forward host1.sources.avro-source1.port = 41414
port 24224 host1.sources.avro-source1.channels = ch1
</source>
# channel
<match **> host1.channels = ch_avro_log
type file host1.channels.ch_avro_log.type = memory
path /var/log/logs
</match> # sink
host1.sinks = log-sink1
host1.sinks.log-sink1.type = logger
host1.sinks.log-sink1.channel = ch1

Fluentd - plugin distribution platform

$ fluent-gem search -rd fluent-plugin

$ fluent-gem install fluent-plugin-mongo

Fluentd - plugin distribution platform

$ fluent-gem search -rd fluent-plugin

$ fluent-gem install fluent-plugin-mongo

94 plugins!

Concept of Fluentd

Customization is essential
> small core + many plugins

Fluentd core helps to implement plugins
> common features are already implemented

Fluentd core Plugins

Divide & Conquer
Retrying
read / receive data
Parallelize
write / send data
Error handling
Message routing

in_tail

apache
ﬂuentd

access.log

✓ read a log ﬁle
✓ custom regexp
✓ custom parser in Ruby

out_mongo

apache
ﬂuentd
in_tail

access.log buffer

out_mongo

apache
ﬂuentd
in_tail

access.log buffer

✓ retry automatically
✓ exponential retry wait
✓ persistent on a ﬁle

out_s3

apache
fluentd
in_tail

access.log buffer Amazon S3

✓ slice files based on time
2013-01-01/01/access.log.gz ✓ exponential retry wait
2013-01-01/02/access.log.gz ✓ persistent on a file
2013-01-01/03/access.log.gz
...

out_hdfs ✓ custom text formater

apache
ﬂuentd
in_tail

access.log buffer HDFS

...

out_hdfs ✓ automatic fail-over
✓ load balancing

fluentd
apache
fluentd fluentd
in_tail
fluentd

access.log buffer

...

Fluentd at Treasure Data - REST API logs
fluent-logger-ruby
API servers + in_forward

Rails app
fluentd

Rails app
fluentd

out_forward

watch server fluentd

Fluentd at Treasure Data - backend logs
fluent-logger-ruby
API servers + in_forward worker servers

Rails app Ruby app
fluentd fluentd

Rails app Ruby app
fluentd fluentd

out_forward

watch server fluentd

Fluentd at Treasure Data - monitoring
fluent-logger-ruby
API servers + in_forward worker servers

Rails app Queue Ruby app
fluentd fluentd

PerfectQueue
Rails app Ruby app
fluentd fluentd

out_forward script
in_exec
fluentd watch server

Fluentd at Treasure Data - Hadoop logs

✓ resource consumption
statistics for each user Hadoop
✓ capacity monitoring JobTracker

thrift API call

script
in_exec

Fluentd at Treasure Data - store & analyze


out_tdlog out_metricsense
✓ streaming aggregation

Treasure Data Librato Metrics
for historical analysis for realtime analysis

class SomeInput < Fluent::Input
Fluent::Plugin.register_input('myin', self)

config_param :tag, :string

def start
Thread.new {
while true
time = Engine.new
record = {“user”=>1, “size”=>1}
Engine.emit(@tag, time, record)
end
}
end

def shutdown
...
end
end

<source>
type myin
tag myapp.api.heartbeat
</source>

class SomeOutput < Fluent::BufferedOutput
Fluent::Plugin.register_output('myout', self)

config_param :myparam, :string

def format(tag, time, record)
[tag, time, record].to_json + "n"
end

def write(chunk)
puts chunk.read
end
end

<match **>
type myout
myparam foobar
</match>

class MyTailInput < Fluent::TailInput
Fluent::Plugin.register_input('mytail', self)

def configure_parser(conf)
...
end

def parse_line(line)
array = line.split(“t”)
time = Engine.now
record = {“user”=>array[0], “item”=>array[1]}
return time, record
end
end

<source>
type mytail
</source>

Fluentd v11

Error stream
Streaming processing
Better DSL
Multiprocess

How to collect Big Data into Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How to collect Big Data into Hadoop

Similar to How to collect Big Data into Hadoop (20)

More from Sadayuki Furuhashi

More from Sadayuki Furuhashi (20)

How to collect Big Data into Hadoop