Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
FLUENTD: COLLECT MORE DATA, GROW FASTER
July 9, 2016
Kazuki Ohta
CTO & Co-Founder of Treasure Data
WHO AM I?
Kazuki Ohta
CTO/Co-founder@Treasure Data
Founder and Chair@Hadoop User Group
Japan (2000+ members)
High-Performa...
FIRST 10 YEARS: STORE AND PROCESS AT SCALE
NEXT 10 YEARS: EFFECTIVE DATA COLLECTION
Total Data volume
What’s collected
effectively
Data volume
Time
SIMPLIFY MESSY DATA PIPELINES: M X N → M + N
WHAT’S FLUENTD?
Data collector for Unified Logging Layer
Streaming data transfer based on
MessagePack
Robust core and plug...
HOW DOES FLUENTD DO THIS?HOW DOES FLUENTD WORK?
WHAT’S FLUENTD?
Simple core + plugins
Buffering, HA (failover),
load balancing, etc.
Like syslogd
AN EXTENSIBLE & RELIABLE...
Common concerns Use case specific
PLUGINSCORE
• Read Data
• Parse Data
• Buffer Data
• Write Data
• Format Data
• Divide &...
INTERNAL ARCHITECTURE
“output-ish”“input-ish”
Input Parser Filter Buffer Output Formatter
key:foo
key:bar
key:baz
USE CASES
SIMPLE FORWARDING
# logs from a file
<source>
@type tail
path /var/log/httpd.log
format apache2
tag backend.apache
</source>
# logs from cli...
LESS SIMPLE FORWARDING
- At-most-once / At-least-once
- HA (failover)
- Load-balancing
LAMBDA ARCHITECTURE
# logs from a file
<source>
@type tail
path /var/log/httpd.log
format apache2
tag web.access
</source>
# logs from client ...
LOGGING TO GCP
# logs from a file
<source>
@type tail
path /var/log/httpd.log
format apache2
tag web.access
</source>
# logs from client ...
CONTAINER LOGGING
LOGGING FOR THE CONTAINER AGE
200+
outputs
WHAT’S NEXT FOR DATA COLLECTION?
Treasure Data joins Cloud Native
Computing Foundation (CNCF)
1. Improving Kubernetes and ...
www.treasuredata.com
www.fluentd.org
THANKS!
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data, Grow Faster, Kazuki Ohta, CTO, Treasure Data
Upcoming SlideShare
Loading in …5
×

Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data, Grow Faster, Kazuki Ohta, CTO, Treasure Data

255 views

Published on

Since Doug Cutting invented Hadoop and Amazon Web Services released S3 ten years ago, we've seen quite a bit of innovation in large-scale data storage and processing. These innovations have enabled engineers to build data infrastructure at scale, many of them fail to fill their scalable systems with useful data, struggling to unify data silos or failing to collect logs from thousands of servers and millions of containers. Fluentd and Embulk are two projects that I've been involved to solve the unsexy yet critical problem of data collection and transport. In this talk, I will give an overview of Fluentd and Embulk and give a survey of how they are used at companies like Microsoft and Atlassian or in projects like Docker and Kubernetes.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data, Grow Faster, Kazuki Ohta, CTO, Treasure Data

  1. 1. FLUENTD: COLLECT MORE DATA, GROW FASTER July 9, 2016 Kazuki Ohta CTO & Co-Founder of Treasure Data
  2. 2. WHO AM I? Kazuki Ohta CTO/Co-founder@Treasure Data Founder and Chair@Hadoop User Group Japan (2000+ members) High-Performance Computing researcher GitHub: @kzk Grew up on Open Source Linux KDE Committer at age 18 Maintainer of Fluentd and related plugins. Early Hadoop community contributor. Hired several open source committers@Treasure Data (we are hiring!)
  3. 3. FIRST 10 YEARS: STORE AND PROCESS AT SCALE
  4. 4. NEXT 10 YEARS: EFFECTIVE DATA COLLECTION Total Data volume What’s collected effectively Data volume Time
  5. 5. SIMPLIFY MESSY DATA PIPELINES: M X N → M + N
  6. 6. WHAT’S FLUENTD? Data collector for Unified Logging Layer Streaming data transfer based on MessagePack Robust core and plugins written in Ruby and C 300+ community contributed plugins (A certification process in the works) www.fluentd.org/plugins Apache License, Version 2.0 https://github.com/fluent/fluentdand 1000s of companies!
  7. 7. HOW DOES FLUENTD DO THIS?HOW DOES FLUENTD WORK?
  8. 8. WHAT’S FLUENTD? Simple core + plugins Buffering, HA (failover), load balancing, etc. Like syslogd AN EXTENSIBLE & RELIABLE DATA COLLECTION TOOL
  9. 9. Common concerns Use case specific PLUGINSCORE • Read Data • Parse Data • Buffer Data • Write Data • Format Data • Divide & Conquer • Buffering & Retries • Error Handling • Message Routing • Parallelism
  10. 10. INTERNAL ARCHITECTURE “output-ish”“input-ish” Input Parser Filter Buffer Output Formatter
  11. 11. key:foo key:bar key:baz
  12. 12. USE CASES
  13. 13. SIMPLE FORWARDING
  14. 14. # logs from a file <source> @type tail path /var/log/httpd.log format apache2 tag backend.apache </source> # logs from client libraries <source> @type forward port 24224 </source> # store logs to MongoDB <match backend.*> @type mongo database fluent collection test </match>
  15. 15. LESS SIMPLE FORWARDING - At-most-once / At-least-once - HA (failover) - Load-balancing
  16. 16. LAMBDA ARCHITECTURE
  17. 17. # logs from a file <source> @type tail path /var/log/httpd.log format apache2 tag web.access </source> # logs from client libraries <source> @type forward port 24224 </source> # store logs to ES and HDFS <match *.*> @type copy <store> @type elasticsearch logstash_format true </store> <store> @type webhdfs host namenode port 50070 path /path/on/hdfs/ </store> </match>
  18. 18. LOGGING TO GCP
  19. 19. # logs from a file <source> @type tail path /var/log/httpd.log format apache2 tag web.access </source> # logs from client libraries <source> @type forward port 24224 </source> # store logs to GCP <match **> @type google_cloud detect_subservice false buffer_chunk_limit 2M buffer_queue_limit 8 flush_interval 5s max_retry_wait 30 #retry forever disable_retry_limit </match>
  20. 20. CONTAINER LOGGING
  21. 21. LOGGING FOR THE CONTAINER AGE 200+ outputs
  22. 22. WHAT’S NEXT FOR DATA COLLECTION? Treasure Data joins Cloud Native Computing Foundation (CNCF) 1. Improving Kubernetes and Docker integration Fluentd v0.14 released! 1. Windows support 2. Nanosecond time precision (Asked by Google and others) 3. New API to simplify writing plugins Embulk: Fluentd’s Batch Friend 1. Created by Fluentd creator 2. Dozens of plugsin already 3. Contributions welcome!
  23. 23. www.treasuredata.com www.fluentd.org THANKS!

×