Fluentd and Embulk Game Server 4

Masahiro Nakagawa
Apr 18, 2015
Game Server meetup #4
Fluentd /
Embulk
For reliable transfer

Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> Living at OSS :)
> D language - Phobos committer
> Fluentd - Main maintainer
> MessagePack / RPC - D and Python (only RPC)
> The organizer of several meetups (Presto, DTM, etc…)
> etc…

Structured logging

!
Reliable forwarding

!
Pluggable architecture
http://ﬂuentd.org/

What’s Fluentd?
> Data collector for unified logging layer
> Streaming data transfer based on JSON
> Written in Ruby
> Gem based various plugins
> http://www.fluentd.org/plugins
> Working in production
> http://www.fluentd.org/testimonials

Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring

Data Analytics Flow
Store Process
Cloudera
Horton Works
Treasure Data
Collect Visualize
Tableau
Excel
R
easier & shorter time
???

TD Service Architecture
Time to Value
Send query result
Result Push
Acquire
Analyze
Store
Plazma DB
Flexible, Scalable,
Columnar Storage
Web Log
App Log
Censor
CRM
ERP
RDBMS
Treasure Agent(Server)
SDK(JS, Android, iOS, Unity)
Streaming Collector
Batch /
Reliability
Ad-hoc / 
Low latency
KPI$
KPI Dashboard
BI Tools
Other Products
RDBMS, Google Docs,
AWS S3, FTP Server, etc.
Metric Insights
Tableau,
Motion Board etc.
POS
REST API
ODBC / JDBC
SQL, Pig
Bulk Uploader
Embulk, 
TD Toolbelt
SQL-based query
@AWS or @IDCF
Connectivity
Economy & Flexibility Simple & Supported

Divide & Conquer & Retry
error retry
error retry retry
retry
Batch
Stream
Other stream

Application
･･･
Server2
Application
･･･
Server3
Application
･･･
Server1
FluentLog Server
High Latency!
must wait for a day...
Before…

Application
･･･
Server2
Application
･･･
Server3
Application
･･･
Server1
Fluentd Fluentd Fluentd
Fluentd Fluentd
In streaming!
After…

Why JSON / MessagePack? (1
> Schema on Write (Traditional MPP DB)
> Writing data using schema for improving 
query performance
> Pros
> minimum query overhead
> Cons
> Need to design schema and workload before
> Data load is expensive operation

Why JSON / MessagePack? (2
> Schema on Read (Hadoop)
> Writing data without schema and map schema
at query time
> Pros
> Robust over schema and workload change
> Data load is cheap operation
> Cons
> High overhead at query time

Core Plugins
> Divide & Conquer 
> Buffering & Retrying 
> Error handling 
> Message routing 
> Parallelism
> Read / receive data
> Parse data
> Filter data
> Buffer data
> Format data
> Write / send data

Core Plugins
> Divide & Conquer 
> Buffering & Retrying 
> Error handling 
> Message routing 
> Parallelism
> Read / receive data
> Parse data
> Filter data
> Buffer data
> Format data
> Write / send data 
Common

Concerns
Use Case

Speciﬁc

> default second unit
> from data source
Event structure(log message)
✓ Time
> for message routing
> where is from?
✓ Tag
> JSON format
> MessagePack 
internally
> schema-free
✓ Record

Architecture (v0.12 or later)
EngineInput
Filter Output
Buffer
> grep
> record_transfomer

> …
> Forward

> File tail

> ...
> Forward

> File

> ...
Output
> File

> Memory
not pluggable
FormatterParser

Conﬁguration and operation
> No central / master node
> @include helps configuration sharing
> Operation depends on your environment
> Use your deamon / deploy tools
> Use Chef in Treasure Data
> Apache like syntax

Setup fluentd (e.g. Ubuntu)
$ apt-get install ruby!
!
$ gem install fluentd!
!
$ edit fluent.conf!
!
$ fluentd -c fluent.conf
http://docs.fluentd.org/articles/faq#w-what-version-of-ruby-does-fluentd-support

Treasure Agent (td-agent)
> Treasure Data distribution of Fluentd
> include ruby, popular plugins and etc
> Treasure Agent 2 is current stable
> Recommend to use v2, not v1
> rpm, deb and dmg
> Latest version is 2.2.0 with ﬂuentd v0.12

Setup td-agent
$ curl -L http://toolbelt.treasuredata.com/
sh/install-redhat-td-agent2.sh | sh!
!
$ edit /etc/td-agent/td-agent.conf!
!
$ sudo service td-agent start
See: http://docs.ﬂuentd.org/categories/installation

Apache to Mongo
tail
insert
event
buffering
routing
127.0.0.1 - - [11/Dec/2014:07:26:27] "GET / ...
127.0.0.1 - - [11/Dec/2014:07:26:30] "GET / ...
127.0.0.1 - - [11/Dec/2014:07:26:32] "GET / ...
127.0.0.1 - - [11/Dec/2014:07:26:40] "GET / ...
127.0.0.1 - - [11/Dec/2014:07:27:01] "GET / ...
...
Fluentd
Web Server
2014-02-04 01:33:51

apache.log

{

"host": "127.0.0.1",

"method": "GET",

...

}

Plugins - use rubygems
$ fluent-gem search -rd fluent-plugin!
!
$ fluent-gem search -rd fluent-mixin!
!
$ fluent-gem install fluent-plugin-mongo
In td-agent: 
/usr/sbin/td-agent-gem install fluent-plugin-mongo

# receive events via HTTP
<source>
@type http
port 8888
</source>
!
# read logs from a file
<source>
@type tail
path /var/log/httpd.log
format apache
tag apache.access
</source>
!
# save access logs to MongoDB
<match apache.access>
@type mongo
database apache
collection log
</match>
# save alerts to a file

<match alert.**>

@type file

path /var/log/fluent/alerts

</match>

!
# forward other logs to servers

<match **>

@type forward

<server>

host 192.168.0.11

weight 20

</server>

<server>

host 192.168.0.12

weight 60

</server>

</match>

!
@include http://example.com/conf

> Apply filtering routine to event stream
> No more tag tricks! 
 
 
 
 
 
Filter
<match access.**>

@type record_reformer

tag reformed.${tag}

</match>

!
<match reformed.**>

@type growthforecast

</match>
<filter access.**>

@type record_transformer

…

</filter>
v0.10: v0.12:
<match access.**>

@type growthforecast

</match>

Nagios
MongoDB
Hadoop
Alerting
Amazon S3
Analysis
Archiving
MySQL
Apache
Frontend
Access logs
syslogd
App logs
System logs
Backend
Databases
buffering / processing / routing
M x N → M + N

Roadmap
> v0.10 (old stable)
> v0.12 (current stable)
> Filter / Label / At-least-once
> v0.14 (spring - early summer, 2015)
> New plugin APIs, ServerEngine, Time…
> v1 (summer - fall, 2015)
> Fix new features / APIs
https://github.com/ﬂuent/ﬂuentd/wiki/V1-Roadmap

# logs from a file
<source>
type tail
pos_file /tmp/pos_file
format apache2
tag backend.apache
</source>
!
# logs from client libraries
<source>
type forward
port 24224
</source>
!
# store logs to MongoDB
<match backend.*>
type mongo
database fluent
collection test
</match>

# Ruby!
Fluent.open(“myapp”)!
Fluent.event(“login”, {“user” => 38})!
#=> 2014-12-11 07:56:01 myapp.login {“user”:38}
> Ruby

> Java

> Perl

> PHP

> Python

> D

> Scala

> ...
Client libraries

Less Simple Forwarding
- At-most-once / At-least-once 
- HA (failover)

- Load-balancing

All data
Near realtime and batch combo!
Hot data

# logs from a file
<source>
type tail
pos_file /tmp/pos_file
format apache2
tag web.access
</source>
!
# logs from client libraries
<source>
type forward
port 24224
</source>
!
# store logs to ES and HDFS
<match web.*>
type copy
<store>
type elasticsearch
logstash_format true
</store>
<store>
type webhdfs
host namenode
port 50070
path /path/on/hdfs/
</store>
</match>

CEP for Stream Processing
Norikra is a SQL based CEP engine: http://norikra.github.io/

> Kubernetes
!
!
!
!
!
> Google Compute Engine
> https://cloud.google.com/logging/docs/install/compute_install
Fluentd on Kubernetes / GCE

Treasure Data
Frontend
Job Queue
Worker
Hadoop
Presto
Fluentd
Applications push
metrics to Fluentd 
(via local Fluentd)
Datadog
for realtime monitoring
Treasure Data
for historical analysis
Fluentd sums up data minutes 
(partial aggregation)

hundreds of app servers
sends event logs
sends event logs
sends event logs
Rails app td-agent
td-agent
td-agent
Google
Spreadsheet
Treasure Data
MySQL
Logs are available
after several mins.
Daily/Hourly
Batch
KPI
visualizationFeedback rankings
Rails app
Rails app
Unlimited scalability
Flexible schema
Realtime
Less performance impact
Cookpad
✓ Over 100 RoR servers (2012/2/4)

Slideshare
http://engineering.slideshare.net/2014/04/skynet-project-monitor-scale-and-auto-heal-a-system-in-the-cloud/

Log Analysis System And its designs in LINE Corp. 2014 early

Line BusinessConnect
http://developers.linecorp.com/blog/?p=3386

fluent-bit
> Made for Embedded Linux
> OpenEmbedded & Yocto Project
> Intel Edison, RasPi & Beagle Black boards
> https://github.com/fluent/fluent-bit
> Standalone application or Library mode
> Built-in plugins
> input: cpu, kmsg, output: fluentd
> First release at the end of Mar 2015

fluentd-forwarder
> Forwarding agent written in Go
> Focusing log forwarding to Fluentd
> Work on Windows
> Bundle TCP input/output and TD output
> No flexible plugin mechanizm
> We have a plan to add some input/output
> Similar product
> fluent-agent-lite, fluent-agent-hydra, ik

fluentd-ui
> Manage Fluentd instance via Web UI
> https://github.com/fluent/fluentd-ui

Bulk loading
!
Parallel processing
!
Pluggable architecture
http://embulk.org/

The problems at Treasure Data
> Treasure Data Service on the Cloud
> Customers want to try Treasure Data, but
> SEs write scripts to bulk load their data.
Hard work :(
> Customers want to migrate their big data, but
> Hard work :(
> Fluentd solved streaming data collection, but
> bulk data loading is another problem.

Embulk
> Bulk Loader version of Fluentd
> Pluggable architecture
> JRuby, JVM languages
> High performance parallel processing
> Share your script as a plugin
> https://github.com/embulk

The problems of bulk load
> Data cleaning (normalization)
> How to normalize broken records?
> Error handling
> How to remove broken records?
> Idempotent retrying
> How to retry without duplicated loading?
> Performance optimization

HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behaviour
✓ Idempotent retrying
Plugins Plugins
bulk load
http://www.embulk.org/plugins/

Setup embulk (e.g. Linux/Mac)
$ curl --create-dirs -o ~/.embulk/bin/embulk
-L “http://dl.embulk.org/embulk-latest.jar"!
!
$ chmod +x ~/.embulk/bin/embulk!
!
$ echo 'export PATH="$HOME/.embulk/bin:
$PATH"' >> ~/.bashrc!
!
$ source ~/.bashrc

Try example
$ embulk example ./try1!
!
$ embulk guess ./example.yml -o config.yml!
!
$ embulk preview config.yml!
!
$ embulk run config.yml

# install
$ wget http://dl.embulk.org/embulk-latest.jar -O
embulk.jar
$ chmod 755 embulk.jar 
!
# guess
$ vi example.yml
$ ./embulk guess example.yml 
-o config.yml
Guess format & schema in:
type: file
path_prefix: /path/to/sample_
out: 
type: stdout
in:
type: file
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
skip_header_lines: 1
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp,
format: '%Y-%m-%d %H:%M:%S’}
- {name: purchase, type: timestamp,
format: ‘%Y%m%d'}
- {name: comment, type: string}
out: 
type: stdout
guess
by guess plugins

# install
embulk.jar
!
# guess
$ vi example.yml
-o config.yml 
!
# preview
$ ./embulk preview config.yml
$ vi config.yml # if necessary
+--------------------------------------+---------------+--------------------+
| time:timestamp | uid:long | word:string |
+--------------------------------------+---------------+--------------------+
| 2015-01-27 19:23:49 UTC | 32,864 | embulk |
| 2015-01-27 19:01:23 UTC | 14,824 | jruby |
| 2015-01-28 02:20:02 UTC | 27,559 | plugin |
| 2015-01-29 11:54:36 UTC | 11,270 | fluentd |
+--------------------------------------+---------------+--------------------+
Preview & fix config

# install
embulk.jar
!
# guess
$ vi example.yml
-o config.yml 
!
# preview
!
# run
$ ./embulk run config.yml -o config.yml
exec: {}
in:
type: file
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
columns:
- {name: time, type: timestamp, 
format: ‘%Y%m%d'}
last_path: /path/to/sample_001.csv.gz
out: 
type: stdout
Deterministic run

exec: {}
in:
type: ﬁle
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
columns:
- {name: time, type: timestamp, 
format: ‘%Y%m%d'}
last_path: /path/to/sample_01.csv.gz
out: 
type: stdout
Repeat
# install
embulk.jar
!
# guess
$ vi example.yml
-o conﬁg.yml 
!
# preview
!
# run
!
# repeat

Other cases
> Treasure Data
> Embulk worker for automatic import
> Web services
> Send existing logs to Elasticsearch
> Business / Batch systems
> Database to Database
> etc…

Check: treasuredata.com
Cloud service for the entire data pipeline

Fluentd and Embulk Game Server 4

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Fluentd and Embulk Game Server 4

Similar to Fluentd and Embulk Game Server 4 (20)

More from N Masahiro

More from N Masahiro (20)

Recently uploaded

Recently uploaded (20)

Fluentd and Embulk Game Server 4