Distributed Stream Processing in the real [Perl] world

Uploaded on


  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Distributed Stream Processing in the real [Perl] world. YAPC::Asia 2012 Day 1 (2012/09/28) TAGOMORI Satoshi (@tagomoris) NHN Japan12年9月29日土曜日
  • 2. tagomoris • TAGOMORI Satoshi ( @tagomoris ) • Working at NHN Japan12年9月29日土曜日
  • 3. What this talk contains • What "Stream Processing" is • Why we want "Stream Processing" • What features we should write for "Stream Processing" • Frameworks and tools for "Distributed Stream Processing" • Implementations in the Perl world12年9月29日土曜日
  • 4. What "Stream Processing" is12年9月29日土曜日
  • 5. Stream12年9月29日土曜日
  • 6. Stream ? •Continuously increasing data •access logs, trace logs, sales checks, ... •typically written in file line-by-line tail -f12年9月29日土曜日
  • 7. Stream Processing •Convert, select, aggregate passed data •NOT wait EOF (in many cases) tail -f|grep ^hit|sed -es/hit/miss/g12年9月29日土曜日
  • 8. Stream Processing over network •Data are collected from many nodes •to seach/query/store •Separate heavy processes from edge nodes edge: tail -f|nc backend: nc -l|grep|sed|tee|...12年9月29日土曜日
  • 9. Why we want "Stream Processing"12年9月29日土曜日
  • 10. Batch file copy & convert access.0928.16.log 16:00 ............................... 16:00 ........................................ .. 60min. 16:59 ................. 16:59 ........................... latency for 16:00 log flush wait 3min. 62+ minutes ?min. Copy over network ?min. Convert into query friendly structure12年9月29日土曜日
  • 11. Stream data copy & convert access.0928.16.log Copy over network 16:00 ............................... Convert next-to-next 16:00 ........................................ in real time .. 16:59 ................. 16:59 ........................... Very low latency for each log lines (if traffic is not larger than capacity)12年9月29日土曜日
  • 12. Case of data size explosion (batch) serviceA Casual batch over multi node/service may be blocked by serviceB unbalanced data size needs long tranfer serviceC time serviceD Asynchronous batch is very good problem...12年9月29日土曜日
  • 13. Case of data size explosion (stream) serviceA Streams are mixed and not blocked by heavy traffics serviceB heavy (if traffic is not larger than capacity) serviceC traffic serviceD12年9月29日土曜日
  • 14. What features we should write for "Stream Processing"12年9月29日土曜日
  • 15. One-by-one input/process/output12年9月29日土曜日
  • 16. One-by-one input/process/output convert one record format one record (or none) select •Basic feature •I/O call overhead is relatively heavy12年9月29日土曜日
  • 17. Burst transfer/read/write and process12年9月29日土曜日
  • 18. Burst transfer/read/write and process read and read and store convert store records many records records temprally many records format (or few or none) temprally from to input select output •less input/output calls •more performance with async I/O and multi process12年9月29日土曜日
  • 19. Control buffer flush intervals12年9月29日土曜日
  • 20. Control buffer flush intervals buffer buffer read and read and store store read records many many records records write inputs temprally records (or few or none) temprally records from to input output 0.5sec? 1sec? 3sec? 30sec? •Control flushing about buffer size and latency •(Semi-)real-time control flow arguments •Max size of lost data when process crushed12年9月29日土曜日
  • 21. Buffering/Queueing12年9月29日土曜日
  • 22. Buffering/Queueing output buffer send to next node STOP records next node buffer output send to buffer next node records next node buffer buffer output send to recover buffer next node records next node buffer output send to streaming buffer next node records next node12年9月29日土曜日
  • 23. Connection keepalive Connection pooling12年9月29日土曜日
  • 24. Connection keepalive / connection pooling node B node A node C node D •Keep connections and select one to use •TCP connection establishment needs large cost •manage node status (alive/down) at same time •not only inter-nodes, but also inter-process connections12年9月29日土曜日
  • 25. Distribution12年9月29日土曜日
  • 26. Distribution: Load balancing (cpu/node) send to processor next node load send to records processor balancer next node send to processor next node •Distribute large scale data to many nodes •nodes: servers, or processor processes •to make total performance high12年9月29日土曜日
  • 27. Distribution: High availability (process/node) send to processor next node load send to records processor balancer next node send to processor next node •Distribute large scale data to N+1 (or 2 or more) nodes •to make system tolerant of node trouble •without any failover (and takeback) operations12年9月29日土曜日
  • 28. Routing records for output A service A process A records for records router router output B service B records for process B output C service C12年9月29日土曜日
  • 29. TOO MANY FEATURES TO IMPLEMENT !!!!!12年9月29日土曜日
  • 30. Frameworks and tools for "Distributed Stream Processing"12年9月29日土曜日
  • 31. Frameworks and tools •Apache Kafka •written in Scala (... with JVM!) •Twitter Storm •written in Clojure (...with JVM!) •Fluentd12年9月29日土曜日
  • 32. Fluentd12年9月29日土曜日
  • 33. Fluentd •Mainly written by @frsyuki in TreasureData •APLv2 software on github •Log read/transfer/write daemon based on MessagePack •structured data (Hash: key:value pairs) •Plugin mechanism for input/output/buffer features •now many plugins are published12年9月29日土曜日
  • 34. Fluentd features: input/output •File tailing, network, and other input plugins •tail and parse line-by-line •receive records from app logger or other fluentd •in_syslog, in_exec, in_dstat, ..... •Output to many many storage/systems •other fluentd, file, S3, mongodb, mysql, HDFS, .....12年9月29日土曜日
  • 35. Fluentd features: buffers •Pluggable buffers •output plugin buffers are swappable (by configuration) •In memory buffers: fast, but lost at fluentd down •file buffers: slow, but always saved •Buffer plugins are also added by users •No one public plugin exists now....12年9月29日土曜日
  • 36. Fluentd features: routing •Tag based routing •all records have tag and time •Fluentd use tags which plugin the record sended next •configurartions are: •tag matcher pattern + plugin configuration12年9月29日土曜日
  • 37. Fluentd features: exec_filter •Output records to specified (and forked) command •And get records from commands STDOUT •We can specify our stream processor as command12年9月29日土曜日
  • 38. Im very sorry that....12年9月29日土曜日
  • 39. Fluentd is written in RubyFluentd plugins released as rubygems12年9月29日土曜日
  • 40. Problems about Fluentd (for stream processing) •Eager buffering •Eager default buffering config, not to flush under 8MB •Performance •Many many features for data protection injures performance •Doesnt work on Windows12年9月29日土曜日
  • 41. Implementations in the Perl world12年9月29日土曜日
  • 42. fluent-agent-lite (Fluent::AgentLite) •Log collection agent tools (in perl) by tagomoris •fast and low load •gets logs from file/STDIN, and sends to other nodes •minimal features for log collector agent •doesnt parse log lines (send 1 attribute with whole log line) •supports load balancing and failover of destination12年9月29日土曜日
  • 43. fluent-agent (Fluent::Agent) •Fluentd feature subset tools by tagomoris •written in Perl •libuv and UV module for async I/O lib (for Windows) •Goal: simple, fast and easy deployment •UNDER CONSTRUCTION •60% features and many bugs, not in CPAN now12年9月29日土曜日
  • 44. Features of Fluent::Agent •1 input, 1 output and 0/1 filter •Network I/O: protocol compatible with Fluentd •and simple load balancing/failover feature •File input/output: superset features of Fluentd (in plan) •Filter with any command: compatible with Fluentds exec_filter filter data/records input output data/records any program you want12年9月29日土曜日
  • 45. Pros of Fluent::Agent (in plan) •Simple and fast software for stream processing •Stateless nodes •fluent-agent works without any configuration files •fluent-agent works with only commandline options •Simple buffering and load balance •less memory usage12年9月29日土曜日
  • 46. Cons of Fluent::Agent (in fact) •Poor input/output methods •fluent-agent doesnt have plugin architecture (currently) •in future, CPAN based plugin system? •Lack of data protection for death of process •fluent-agent have only memory buffer12年9月29日土曜日
  • 47. Fluentd and fluent-agent12年9月29日土曜日
  • 48. Fluentd and fluent-agent and fluent-agent-lite service node fluent-agent-lite service fluent-agent node fluent-agent-lite service fluent-agent fluent-agent fluentd node fluent-agent-lite service fluent-agent node fluent-agent-lite fluent-agent service fluent-agent node fluent-agent-lite service fluent-agent fluentd node fluent-agent-lite fluent-agent service writer for node fluent-agent-lite fluent-agent storages / deliver processor aggregator12年9月29日土曜日
  • 49. Conclusion •Distributed Stream Processing is: •to provides more power to our application •very hard (and interesting) problem •that we have some supporting frameworks/tools like Fluentd and/or fluent-agent12年9月29日土曜日
  • 50. Lets try to improve your application with stream processing instead of many many batches Thanks! CAST: crouton & luke & chacha Thanks to @kbysmnr12年9月29日土曜日