How to collect Big Data into Hadoop

8,995 views

Published on

Big Data processing to collect Big Data

  • Be the first to comment

How to collect Big Data into Hadoop

  1. 1. How to collect Big Datainto HadoopBig Data processing to collect Big Data fluentd.org Sadayuki Furuhashi
  2. 2. Self-introduction> Sadayuki Furuhashi> Treasure Data, Inc. Founder & Software Architect> Open source projects MessagePack - efficient serializer (original author) Fluentd - event collector (original author)
  3. 3. We’re Hiring!sf@treasure-data.com
  4. 4. Today’s topic
  5. 5. Big DataReport &Monitor
  6. 6. Big DataCollect Store Process Visualize Report & Monitor
  7. 7. easier & shorter timeCollect Store Process Visualize Cloudera Excel Horton Works Tableau MapR R
  8. 8. How to shorten here? easier & shorter time Collect Store Process Visualize Cloudera Excel Horton Works Tableau MapR R
  9. 9. Problems to collect data
  10. 10. Poor man’s data collection1. Copy files from servers using rsync2. Create a RegExp to parse the files3. Parse the files and generate a 10GB CSV file4. Put it into HDFS
  11. 11. Problems to collect “big data”> Includes broken values > needs error handling & retrying> Time-series data are changing and uncler > parse logs before storing> Takes time to read/write > tools have to be optimized and parallelized> Takes time for trial & error> Causes network traffic spikes
  12. 12. Problem of poor man’s data collection> Wastes time to implement error handling> Wastes time to maintain a parser> Wastes time to debug the tool> Not reliable> Not efficient
  13. 13. Basic theoriesto collect big data
  14. 14. Divide & Conquer error error
  15. 15. Divide & Conquer & Retry error retry retry error retry retry
  16. 16. StreamingDon’t handle big files here Do it here
  17. 17. Apache Flume and Fluentd
  18. 18. Apache Flume
  19. 19. Apache Flumeaccess logs Agent app logs Agent Collectorsystem logs Agent Collector Agent ...
  20. 20. Apache Flume - network topology Master Agent ack Agent CollectorFlume OG Collector Agent Collector send Agent Agent send/ack Agent CollectorFlume NG Collector Agent Collector Agent
  21. 21. Apache Flume - pipeline Source SinkFlume OG plugin Source Channel SinkFlume NG
  22. 22. Apache Flume - configuration Master Master manages all configuration (optional) Agent Agent CollectorFlume NG Collector Agent Collector Agent
  23. 23. Apache Flume - configuration# sourcehost1.sources = avro-source1host1.sources.avro-source1.type = avrohost1.sources.avro-source1.bind = 0.0.0.0host1.sources.avro-source1.port = 41414host1.sources.avro-source1.channels = ch1# channelhost1.channels = ch_avro_loghost1.channels.ch_avro_log.type = memory# sinkhost1.sinks = log-sink1host1.sinks.log-sink1.type = loggerhost1.sinks.log-sink1.channel = ch1
  24. 24. Fluentd
  25. 25. Fluentd - network topology Agent send/ack Agent CollectorFlume NG Collector Agent Collector Agent fluentd send/ack fluentd fluentdFluentd fluentd fluentd fluentd fluentd
  26. 26. Fluentd - pipeline Source Channel SinkFlume NG plugin Input Buffer OutputFluentd
  27. 27. Fluentd - configuration fluentd fluentd fluentdFluentd fluentd fluentd fluentd fluentd Use chef, puppet, etc. for configuration (they do things better) No central node - keep things simple
  28. 28. Fluentd - configuration<source> type forward port 24224</source><match **> type file path /var/log/logs</match>
  29. 29. Fluentd - configuration # source host1.sources = avro-source1 host1.sources.avro-source1.type = avro<source> host1.sources.avro-source1.bind = 0.0.0.0 type forward host1.sources.avro-source1.port = 41414 port 24224 host1.sources.avro-source1.channels = ch1</source> # channel<match **> host1.channels = ch_avro_log type file host1.channels.ch_avro_log.type = memory path /var/log/logs</match> # sink host1.sinks = log-sink1 host1.sinks.log-sink1.type = logger host1.sinks.log-sink1.channel = ch1
  30. 30. Fluentd - Users
  31. 31. Fluentd - plugin distribution platform$ fluent-gem search -rd fluent-plugin$ fluent-gem install fluent-plugin-mongo
  32. 32. Fluentd - plugin distribution platform$ fluent-gem search -rd fluent-plugin$ fluent-gem install fluent-plugin-mongo 94 plugins!
  33. 33. Concept of FluentdCustomization is essential > small core + many pluginsFluentd core helps to implement plugins > common features are already implemented
  34. 34. Fluentd core PluginsDivide & ConquerRetrying read / receive dataParallelize write / send dataError handlingMessage routing
  35. 35. Fluentd plugins
  36. 36. in_tail apache fluentd access.log ✓ read a log file ✓ custom regexp ✓ custom parser in Ruby
  37. 37. out_mongo apache fluentd in_tail access.log buffer
  38. 38. out_mongo apache fluentd in_tail access.log buffer ✓ retry automatically ✓ exponential retry wait ✓ persistent on a file
  39. 39. out_s3 apache fluentd in_tail access.log buffer Amazon S3 ✓ slice files based on time ✓ retry automatically 2013-01-01/01/access.log.gz ✓ exponential retry wait 2013-01-01/02/access.log.gz ✓ persistent on a file 2013-01-01/03/access.log.gz ...
  40. 40. out_hdfs ✓ custom text formater apache fluentd in_tail access.log buffer HDFS ✓ slice files based on time ✓ retry automatically 2013-01-01/01/access.log.gz ✓ exponential retry wait 2013-01-01/02/access.log.gz ✓ persistent on a file 2013-01-01/03/access.log.gz ...
  41. 41. out_hdfs ✓ automatic fail-over ✓ load balancing fluentd apache fluentd fluentd in_tail fluentd access.log buffer ✓ slice files based on time ✓ retry automatically 2013-01-01/01/access.log.gz ✓ exponential retry wait 2013-01-01/02/access.log.gz ✓ persistent on a file 2013-01-01/03/access.log.gz ...
  42. 42. Fluentd examples
  43. 43. Fluentd at Treasure Data - REST API logs fluent-logger-rubyAPI servers + in_forward Rails app fluentd Rails app fluentd out_forward watch server fluentd
  44. 44. Fluentd at Treasure Data - backend logs fluent-logger-rubyAPI servers + in_forward worker servers Rails app Ruby app fluentd fluentd Rails app Ruby app fluentd fluentd out_forward watch server fluentd
  45. 45. Fluentd at Treasure Data - monitoring fluent-logger-rubyAPI servers + in_forward worker servers Rails app Queue Ruby app fluentd fluentd PerfectQueue Rails app Ruby app fluentd fluentd out_forward script in_exec fluentd watch server
  46. 46. Fluentd at Treasure Data - Hadoop logs✓ resource consumption statistics for each user Hadoop✓ capacity monitoring JobTracker thrift API call script in_exec fluentd watch server
  47. 47. Fluentd at Treasure Data - store & analyze fluentd watch server out_tdlog out_metricsense ✓ streaming aggregation Treasure Data Librato Metricsfor historical analysis for realtime analysis
  48. 48. Plugin development
  49. 49. class SomeInput < Fluent::Input Fluent::Plugin.register_input(myin, self) config_param :tag, :string def start Thread.new { while true time = Engine.new record = {“user”=>1, “size”=>1} Engine.emit(@tag, time, record) end } end def shutdown ... endend<source> type myin tag myapp.api.heartbeat</source>
  50. 50. class SomeOutput < Fluent::BufferedOutput Fluent::Plugin.register_output(myout, self) config_param :myparam, :string def format(tag, time, record) [tag, time, record].to_json + "n" end def write(chunk) puts chunk.read endend<match **> type myout myparam foobar</match>
  51. 51. class MyTailInput < Fluent::TailInput Fluent::Plugin.register_input(mytail, self) def configure_parser(conf) ... end def parse_line(line) array = line.split(“t”) time = Engine.now record = {“user”=>array[0], “item”=>array[1]} return time, record endend<source> type mytail</source>
  52. 52. Fluentd v11Error streamStreaming processingBetter DSLMultiprocess

×