How to collect Big Data into Hadoop
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

How to collect Big Data into Hadoop

  • 6,207 views
Uploaded on

Big Data processing to collect Big Data

Big Data processing to collect Big Data

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,207
On Slideshare
5,428
From Embeds
779
Number of Embeds
11

Actions

Shares
Downloads
155
Comments
0
Likes
23

Embeds 779

http://shinodogg.com 549
http://d.hatena.ne.jp 134
https://twitter.com 55
http://www.twylah.com 24
http://garagekidztweetz.hatenablog.com 8
http://webcache.googleusercontent.com 4
https://www.rebelmouse.com 1
http://tweetedtimes.com 1
http://moderation.local 1
http://twitter.com 1
https://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. How to collect Big Datainto HadoopBig Data processing to collect Big Data fluentd.org Sadayuki Furuhashi
  • 2. Self-introduction> Sadayuki Furuhashi> Treasure Data, Inc. Founder & Software Architect> Open source projects MessagePack - efficient serializer (original author) Fluentd - event collector (original author)
  • 3. We’re Hiring!sf@treasure-data.com
  • 4. Today’s topic
  • 5. Big DataReport &Monitor
  • 6. Big DataCollect Store Process Visualize Report & Monitor
  • 7. easier & shorter timeCollect Store Process Visualize Cloudera Excel Horton Works Tableau MapR R
  • 8. How to shorten here? easier & shorter time Collect Store Process Visualize Cloudera Excel Horton Works Tableau MapR R
  • 9. Problems to collect data
  • 10. Poor man’s data collection1. Copy files from servers using rsync2. Create a RegExp to parse the files3. Parse the files and generate a 10GB CSV file4. Put it into HDFS
  • 11. Problems to collect “big data”> Includes broken values > needs error handling & retrying> Time-series data are changing and uncler > parse logs before storing> Takes time to read/write > tools have to be optimized and parallelized> Takes time for trial & error> Causes network traffic spikes
  • 12. Problem of poor man’s data collection> Wastes time to implement error handling> Wastes time to maintain a parser> Wastes time to debug the tool> Not reliable> Not efficient
  • 13. Basic theoriesto collect big data
  • 14. Divide & Conquer error error
  • 15. Divide & Conquer & Retry error retry retry error retry retry
  • 16. StreamingDon’t handle big files here Do it here
  • 17. Apache Flume and Fluentd
  • 18. Apache Flume
  • 19. Apache Flumeaccess logs Agent app logs Agent Collectorsystem logs Agent Collector Agent ...
  • 20. Apache Flume - network topology Master Agent ack Agent CollectorFlume OG Collector Agent Collector send Agent Agent send/ack Agent CollectorFlume NG Collector Agent Collector Agent
  • 21. Apache Flume - pipeline Source SinkFlume OG plugin Source Channel SinkFlume NG
  • 22. Apache Flume - configuration Master Master manages all configuration (optional) Agent Agent CollectorFlume NG Collector Agent Collector Agent
  • 23. Apache Flume - configuration# sourcehost1.sources = avro-source1host1.sources.avro-source1.type = avrohost1.sources.avro-source1.bind = 0.0.0.0host1.sources.avro-source1.port = 41414host1.sources.avro-source1.channels = ch1# channelhost1.channels = ch_avro_loghost1.channels.ch_avro_log.type = memory# sinkhost1.sinks = log-sink1host1.sinks.log-sink1.type = loggerhost1.sinks.log-sink1.channel = ch1
  • 24. Fluentd
  • 25. Fluentd - network topology Agent send/ack Agent CollectorFlume NG Collector Agent Collector Agent fluentd send/ack fluentd fluentdFluentd fluentd fluentd fluentd fluentd
  • 26. Fluentd - pipeline Source Channel SinkFlume NG plugin Input Buffer OutputFluentd
  • 27. Fluentd - configuration fluentd fluentd fluentdFluentd fluentd fluentd fluentd fluentd Use chef, puppet, etc. for configuration (they do things better) No central node - keep things simple
  • 28. Fluentd - configuration<source> type forward port 24224</source><match **> type file path /var/log/logs</match>
  • 29. Fluentd - configuration # source host1.sources = avro-source1 host1.sources.avro-source1.type = avro<source> host1.sources.avro-source1.bind = 0.0.0.0 type forward host1.sources.avro-source1.port = 41414 port 24224 host1.sources.avro-source1.channels = ch1</source> # channel<match **> host1.channels = ch_avro_log type file host1.channels.ch_avro_log.type = memory path /var/log/logs</match> # sink host1.sinks = log-sink1 host1.sinks.log-sink1.type = logger host1.sinks.log-sink1.channel = ch1
  • 30. Fluentd - Users
  • 31. Fluentd - plugin distribution platform$ fluent-gem search -rd fluent-plugin$ fluent-gem install fluent-plugin-mongo
  • 32. Fluentd - plugin distribution platform$ fluent-gem search -rd fluent-plugin$ fluent-gem install fluent-plugin-mongo 94 plugins!
  • 33. Concept of FluentdCustomization is essential > small core + many pluginsFluentd core helps to implement plugins > common features are already implemented
  • 34. Fluentd core PluginsDivide & ConquerRetrying read / receive dataParallelize write / send dataError handlingMessage routing
  • 35. Fluentd plugins
  • 36. in_tail apache fluentd access.log ✓ read a log file ✓ custom regexp ✓ custom parser in Ruby
  • 37. out_mongo apache fluentd in_tail access.log buffer
  • 38. out_mongo apache fluentd in_tail access.log buffer ✓ retry automatically ✓ exponential retry wait ✓ persistent on a file
  • 39. out_s3 apache fluentd in_tail access.log buffer Amazon S3 ✓ slice files based on time ✓ retry automatically 2013-01-01/01/access.log.gz ✓ exponential retry wait 2013-01-01/02/access.log.gz ✓ persistent on a file 2013-01-01/03/access.log.gz ...
  • 40. out_hdfs ✓ custom text formater apache fluentd in_tail access.log buffer HDFS ✓ slice files based on time ✓ retry automatically 2013-01-01/01/access.log.gz ✓ exponential retry wait 2013-01-01/02/access.log.gz ✓ persistent on a file 2013-01-01/03/access.log.gz ...
  • 41. out_hdfs ✓ automatic fail-over ✓ load balancing fluentd apache fluentd fluentd in_tail fluentd access.log buffer ✓ slice files based on time ✓ retry automatically 2013-01-01/01/access.log.gz ✓ exponential retry wait 2013-01-01/02/access.log.gz ✓ persistent on a file 2013-01-01/03/access.log.gz ...
  • 42. Fluentd examples
  • 43. Fluentd at Treasure Data - REST API logs fluent-logger-rubyAPI servers + in_forward Rails app fluentd Rails app fluentd out_forward watch server fluentd
  • 44. Fluentd at Treasure Data - backend logs fluent-logger-rubyAPI servers + in_forward worker servers Rails app Ruby app fluentd fluentd Rails app Ruby app fluentd fluentd out_forward watch server fluentd
  • 45. Fluentd at Treasure Data - monitoring fluent-logger-rubyAPI servers + in_forward worker servers Rails app Queue Ruby app fluentd fluentd PerfectQueue Rails app Ruby app fluentd fluentd out_forward script in_exec fluentd watch server
  • 46. Fluentd at Treasure Data - Hadoop logs✓ resource consumption statistics for each user Hadoop✓ capacity monitoring JobTracker thrift API call script in_exec fluentd watch server
  • 47. Fluentd at Treasure Data - store & analyze fluentd watch server out_tdlog out_metricsense ✓ streaming aggregation Treasure Data Librato Metricsfor historical analysis for realtime analysis
  • 48. Plugin development
  • 49. class SomeInput < Fluent::Input Fluent::Plugin.register_input(myin, self) config_param :tag, :string def start Thread.new { while true time = Engine.new record = {“user”=>1, “size”=>1} Engine.emit(@tag, time, record) end } end def shutdown ... endend<source> type myin tag myapp.api.heartbeat</source>
  • 50. class SomeOutput < Fluent::BufferedOutput Fluent::Plugin.register_output(myout, self) config_param :myparam, :string def format(tag, time, record) [tag, time, record].to_json + "n" end def write(chunk) puts chunk.read endend<match **> type myout myparam foobar</match>
  • 51. class MyTailInput < Fluent::TailInput Fluent::Plugin.register_input(mytail, self) def configure_parser(conf) ... end def parse_line(line) array = line.split(“t”) time = Engine.now record = {“user”=>array[0], “item”=>array[1]} return time, record endend<source> type mytail</source>
  • 52. Fluentd v11Error streamStreaming processingBetter DSLMultiprocess