Distributed Stream Processing in the real [Perl] world


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Distributed Stream Processing in the real [Perl] world

  1. 1. Distributed Stream Processing in the real [Perl] world. YAPC::Asia 2012 Day 1 (2012/09/28) TAGOMORI Satoshi (@tagomoris) NHN Japan12年9月29日土曜日
  2. 2. tagomoris • TAGOMORI Satoshi ( @tagomoris ) • Working at NHN Japan12年9月29日土曜日
  3. 3. What this talk contains • What "Stream Processing" is • Why we want "Stream Processing" • What features we should write for "Stream Processing" • Frameworks and tools for "Distributed Stream Processing" • Implementations in the Perl world12年9月29日土曜日
  4. 4. What "Stream Processing" is12年9月29日土曜日
  5. 5. Stream12年9月29日土曜日
  6. 6. Stream ? •Continuously increasing data •access logs, trace logs, sales checks, ... •typically written in file line-by-line tail -f12年9月29日土曜日
  7. 7. Stream Processing •Convert, select, aggregate passed data •NOT wait EOF (in many cases) tail -f|grep ^hit|sed -es/hit/miss/g12年9月29日土曜日
  8. 8. Stream Processing over network •Data are collected from many nodes •to seach/query/store •Separate heavy processes from edge nodes edge: tail -f|nc backend: nc -l|grep|sed|tee|...12年9月29日土曜日
  9. 9. Why we want "Stream Processing"12年9月29日土曜日
  10. 10. Batch file copy & convert access.0928.16.log 16:00 ............................... 16:00 ........................................ .. 60min. 16:59 ................. 16:59 ........................... latency for 16:00 log flush wait 3min. 62+ minutes ?min. Copy over network ?min. Convert into query friendly structure12年9月29日土曜日
  11. 11. Stream data copy & convert access.0928.16.log Copy over network 16:00 ............................... Convert next-to-next 16:00 ........................................ in real time .. 16:59 ................. 16:59 ........................... Very low latency for each log lines (if traffic is not larger than capacity)12年9月29日土曜日
  12. 12. Case of data size explosion (batch) serviceA Casual batch over multi node/service may be blocked by serviceB unbalanced data size needs long tranfer serviceC time serviceD Asynchronous batch is very good problem...12年9月29日土曜日
  13. 13. Case of data size explosion (stream) serviceA Streams are mixed and not blocked by heavy traffics serviceB heavy (if traffic is not larger than capacity) serviceC traffic serviceD12年9月29日土曜日
  14. 14. What features we should write for "Stream Processing"12年9月29日土曜日
  15. 15. One-by-one input/process/output12年9月29日土曜日
  16. 16. One-by-one input/process/output convert one record format one record (or none) select •Basic feature •I/O call overhead is relatively heavy12年9月29日土曜日
  17. 17. Burst transfer/read/write and process12年9月29日土曜日
  18. 18. Burst transfer/read/write and process read and read and store convert store records many records records temprally many records format (or few or none) temprally from to input select output •less input/output calls •more performance with async I/O and multi process12年9月29日土曜日
  19. 19. Control buffer flush intervals12年9月29日土曜日
  20. 20. Control buffer flush intervals buffer buffer read and read and store store read records many many records records write inputs temprally records (or few or none) temprally records from to input output 0.5sec? 1sec? 3sec? 30sec? •Control flushing about buffer size and latency •(Semi-)real-time control flow arguments •Max size of lost data when process crushed12年9月29日土曜日
  21. 21. Buffering/Queueing12年9月29日土曜日
  22. 22. Buffering/Queueing output buffer send to next node STOP records next node buffer output send to buffer next node records next node buffer buffer output send to recover buffer next node records next node buffer output send to streaming buffer next node records next node12年9月29日土曜日
  23. 23. Connection keepalive Connection pooling12年9月29日土曜日
  24. 24. Connection keepalive / connection pooling node B node A node C node D •Keep connections and select one to use •TCP connection establishment needs large cost •manage node status (alive/down) at same time •not only inter-nodes, but also inter-process connections12年9月29日土曜日
  25. 25. Distribution12年9月29日土曜日
  26. 26. Distribution: Load balancing (cpu/node) send to processor next node load send to records processor balancer next node send to processor next node •Distribute large scale data to many nodes •nodes: servers, or processor processes •to make total performance high12年9月29日土曜日
  27. 27. Distribution: High availability (process/node) send to processor next node load send to records processor balancer next node send to processor next node •Distribute large scale data to N+1 (or 2 or more) nodes •to make system tolerant of node trouble •without any failover (and takeback) operations12年9月29日土曜日
  28. 28. Routing records for output A service A process A records for records router router output B service B records for process B output C service C12年9月29日土曜日
  29. 29. TOO MANY FEATURES TO IMPLEMENT !!!!!12年9月29日土曜日
  30. 30. Frameworks and tools for "Distributed Stream Processing"12年9月29日土曜日
  31. 31. Frameworks and tools •Apache Kafka •written in Scala (... with JVM!) •Twitter Storm •written in Clojure (...with JVM!) •Fluentd12年9月29日土曜日
  32. 32. Fluentd12年9月29日土曜日
  33. 33. Fluentd •Mainly written by @frsyuki in TreasureData •APLv2 software on github •Log read/transfer/write daemon based on MessagePack •structured data (Hash: key:value pairs) •Plugin mechanism for input/output/buffer features •now many plugins are published12年9月29日土曜日
  34. 34. Fluentd features: input/output •File tailing, network, and other input plugins •tail and parse line-by-line •receive records from app logger or other fluentd •in_syslog, in_exec, in_dstat, ..... •Output to many many storage/systems •other fluentd, file, S3, mongodb, mysql, HDFS, .....12年9月29日土曜日
  35. 35. Fluentd features: buffers •Pluggable buffers •output plugin buffers are swappable (by configuration) •In memory buffers: fast, but lost at fluentd down •file buffers: slow, but always saved •Buffer plugins are also added by users •No one public plugin exists now....12年9月29日土曜日
  36. 36. Fluentd features: routing •Tag based routing •all records have tag and time •Fluentd use tags which plugin the record sended next •configurartions are: •tag matcher pattern + plugin configuration12年9月29日土曜日
  37. 37. Fluentd features: exec_filter •Output records to specified (and forked) command •And get records from commands STDOUT •We can specify our stream processor as command12年9月29日土曜日
  38. 38. Im very sorry that....12年9月29日土曜日
  39. 39. Fluentd is written in RubyFluentd plugins released as rubygems12年9月29日土曜日
  40. 40. Problems about Fluentd (for stream processing) •Eager buffering •Eager default buffering config, not to flush under 8MB •Performance •Many many features for data protection injures performance •Doesnt work on Windows12年9月29日土曜日
  41. 41. Implementations in the Perl world12年9月29日土曜日
  42. 42. fluent-agent-lite (Fluent::AgentLite) •Log collection agent tools (in perl) by tagomoris •fast and low load •gets logs from file/STDIN, and sends to other nodes •minimal features for log collector agent •doesnt parse log lines (send 1 attribute with whole log line) •supports load balancing and failover of destination12年9月29日土曜日
  43. 43. fluent-agent (Fluent::Agent) •Fluentd feature subset tools by tagomoris •written in Perl •libuv and UV module for async I/O lib (for Windows) •Goal: simple, fast and easy deployment •UNDER CONSTRUCTION •60% features and many bugs, not in CPAN now12年9月29日土曜日
  44. 44. Features of Fluent::Agent •1 input, 1 output and 0/1 filter •Network I/O: protocol compatible with Fluentd •and simple load balancing/failover feature •File input/output: superset features of Fluentd (in plan) •Filter with any command: compatible with Fluentds exec_filter filter data/records input output data/records any program you want12年9月29日土曜日
  45. 45. Pros of Fluent::Agent (in plan) •Simple and fast software for stream processing •Stateless nodes •fluent-agent works without any configuration files •fluent-agent works with only commandline options •Simple buffering and load balance •less memory usage12年9月29日土曜日
  46. 46. Cons of Fluent::Agent (in fact) •Poor input/output methods •fluent-agent doesnt have plugin architecture (currently) •in future, CPAN based plugin system? •Lack of data protection for death of process •fluent-agent have only memory buffer12年9月29日土曜日
  47. 47. Fluentd and fluent-agent12年9月29日土曜日
  48. 48. Fluentd and fluent-agent and fluent-agent-lite service node fluent-agent-lite service fluent-agent node fluent-agent-lite service fluent-agent fluent-agent fluentd node fluent-agent-lite service fluent-agent node fluent-agent-lite fluent-agent service fluent-agent node fluent-agent-lite service fluent-agent fluentd node fluent-agent-lite fluent-agent service writer for node fluent-agent-lite fluent-agent storages / deliver processor aggregator12年9月29日土曜日
  49. 49. Conclusion •Distributed Stream Processing is: •to provides more power to our application •very hard (and interesting) problem •that we have some supporting frameworks/tools like Fluentd and/or fluent-agent12年9月29日土曜日
  50. 50. Lets try to improve your application with stream processing instead of many many batches Thanks! CAST: crouton & luke & chacha Thanks to @kbysmnr12年9月29日土曜日