Development of a Distributed Stream Processing System
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Development of a Distributed Stream Processing System

on

  • 793 views

Development of a Distributed Stream Processing System (DSPS) in node.js and ZeroMQ and demonstration of an application of trending topics with a dataset from Twitter.

Development of a Distributed Stream Processing System (DSPS) in node.js and ZeroMQ and demonstration of an application of trending topics with a dataset from Twitter.

Statistics

Views

Total Views
793
Views on SlideShare
792
Embed Views
1

Actions

Likes
5
Downloads
14
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Development of a Distributed Stream Processing System Presentation Transcript

  • 1. Development of a Distributed Stream Processing System Maycon Viana Bordin Final Assignment Instituto de Informática Universidade Federal do Rio Grande do Sul CMP157 – PDP 2013/2, Claudio Geyer
  • 2. What’s Stream Processing?
  • 3. Stream Source: emits data continuously and sequentially
  • 4. B Operators: count, join, filter, map
  • 5. B Data streams
  • 6. Sink
  • 7. Data Stream
  • 8. B Tuple -> (“word”, 55)
  • 9. B Tuples are ordered by a timestamp or other attribute 7 6 5 4 3 2 1
  • 10. Data from the stream source may or may not be structured
  • 11. The amount of data is usually unbounded in size
  • 12. The input rate is variable and typically unpredictable
  • 13. Operators
  • 14. OP
  • 15. Receives one or more data streams OP Sends one or more data streams
  • 16. Operators Classification
  • 17. OPERATORS
  • 18. OPERATORS Stateless (map, filter)
  • 19. OPERATORS Stateless (map, filter) Stateful
  • 20. OPERATORS Stateless Stateful (map, filter) Non-Blocking (count, sum)
  • 21. OPERATORS Stateless Stateful (map, filter) Blocking Non-Blocking (join, freq. itemset) (count, sum)
  • 22. Blocking operators need all input in order to generate a result
  • 23. but that’s not possible since data streams are unbounded
  • 24. To solve this issue, tuples are grouped in windows
  • 25. Range in time units or number of tuples window start (ws) window end (we)
  • 26. old ws old we advance new ws new we
  • 27. Implementation Architecture
  • 28. submit/start/stop app client slave master slave slave slave heartbeat worker thread
  • 29. The heartbeat carries the status of each worker in the slave
  • 30. Tuples processed Throughput Latency The heartbeat carries the status of each worker in the slave
  • 31. Implementation Application
  • 32. Applications are composed as a DAG (Directed Acyclic Graph)
  • 33. To illustrate, let’s look at the graph of a Trending Topics application
  • 34. stream extract hashtags countmin sketch File Sink
  • 35. stream Stream source emits tweets in JSON format extract hashtags countmin sketch File Sink
  • 36. Extract the text from the tweet and add a timestamp to each tuple stream extract hashtags countmin sketch File Sink
  • 37. stream extract hashtags Extract and emit each #hashtag in the tweet countmin sketch File Sink
  • 38. stream extract hashtags Constant time and space approximate frequent itemset countmin sketch [Cormode and Muthukrishnan, 2005] File Sink
  • 39. stream extract hashtags Without a window, it will emit all top-k items each time a hashtag is received countmin sketch File Sink
  • 40. stream extract hashtags With a window the number of tuples emitted is reduced, but the latency is increasead countmin sketch File Sink
  • 41. The second step in building an application is to set the number of instances of each operator:
  • 42. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 43. But the user has to choose the way tuples are going to be partitioned among the operators
  • 44. All-to-All Partitioning
  • 45. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 46. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 47. Round-Robin Partitioning
  • 48. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 49. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 50. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 51. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 52. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 53. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 54. Field Partitioning
  • 55. stream (“foo”, 1) extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 56. stream (“foo”, 1) extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 57. stream (“foo”, 1) extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 58. stream (“foo”, 1) extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  • 59. The communication between operators is done with the pub/sub pattern
  • 60. stream extract extract extract extract extract U countmin countmin countmin countmin countmin File Sink
  • 61. stream extract extract extract extract extract U countmin countmin countmin countmin countmin The operator subscribes to all upstream operators, with his ID as a filter File Sink
  • 62. stream extract extract extract extract extract U countmin countmin countmin countmin countmin The operator will only receive tuples with his ID as prefix File Sink
  • 63. The last step is to get each operator instance from the graph and assign it to a node
  • 64. node-0 node-1 node-2 stream extract countmin extract countmin extract extract countmin countmin countmin File Sink extract
  • 65. Currently the scheduler is static and only balances the number of operators per node
  • 66. Implementation Framework
  • 67. trending-topics.js
  • 68. Tests Specification
  • 69. Application Trending Topics Dataset of 40GB from Twitter
  • 70. Test Environment GridRS - PUCRS 3 nodes 4 x 3.52 GHz (Intel Xeon) 2 GB RAM Linux 2.6.32-5-amd64 Gigabit Ethernet
  • 71. Metrics Runtime Latency: time to a tuple traverse the graph Throughput: no. of tuples processed per sec. Loss of Tuples Methodology 5 runs per test. Every 3s each operator sends its status with no. of tuples processed. The PerfMon sink collects a tuple every 100ms, and sends the average latency every 3s (and cleans up the collected tuples). Variables Number of nodes Number of operator instances Window size
  • 72. Tests Number of Nodes
  • 73. Runtime vs Latency 90.00 2000.00 80.00 1800.00 1600.00 70.00 1400.00 1200.00 50.00 1000.00 40.00 800.00 30.00 600.00 20.00 400.00 10.00 runtime (min) 200.00 latency (ms) 0.00 0.00 1 2 No. of nodes 3 Latency (ms) Runtime (minutes) 60.00
  • 74. Runtime vs Stream Rate 90.00 25.00 80.00 20.00 70.00 15.00 50.00 40.00 10.00 30.00 20.00 5.00 10.00 runtime (min) stream rate (MB/s) 0.00 0.00 1 2 No. of nodes 3 Stream rate (MB/s) Runtime (minutes) 60.00
  • 75. Throughput 7000.00 stream extractor 6000.00 countmin filesink perfmon Tuples per second (tps) 5000.00 4000.00 3000.00 2000.00 1000.00 0.00 1 2 No. of nodes 3
  • 76. Loss of Tuples 10000.00 stream extractor countmin 8000.00 filesink perfmon Lost tuples 6000.00 4000.00 2000.00 0.00 1 -2000.00 2 No. of nodes 3
  • 77. Throughput and Latency Over Time (nodes=3, instances=5, window=20) 14000 100000 stream 90000 extractor countmin 80000 10000 latency 70000 60000 8000 50000 6000 40000 30000 4000 20000 2000 10000 0 0 3 76 117 158 207 274 315 356 417 466 507 548 609 657 698 738 811 852 893 936 1008 1048 1089 1156 1197 1238 1308 1348 1389 1446 1501 1542 1583 1617 1657 1698 1728 1759 1800 1838 1865 1899 1939 1980 2011 2038 2079 Throughput (tuples/ps) filesink Time (seconds) Latency (ms) 12000
  • 78. Tests Window Size
  • 79. Runtime vs Latency 27.20 700.00 27.00 600.00 26.80 Runtime (minutes) 26.40 400.00 26.20 300.00 26.00 25.80 200.00 25.60 100.00 runtime (min) 25.40 latency (ms) 25.20 0.00 20 80 120 Window Size 200 Latency (ms) 500.00 26.60
  • 80. Tests No. of Instances
  • 81. Runtime vs Stream Rate 40 35 35 30 30 25 20 20 15 15 10 10 5 5 runtime (min) stream rate (MB/s) 0 0 1 5 No. of Instances Stream rate (MB/s) Runtime (minutes) 25
  • 82. Conclusions
  • 83. The system was able to process more data with the inclusion of more nodes
  • 84. On the other hand, distributing the load increased the latency
  • 85. The scheduler has to reduce the network communication
  • 86. The communication between workers in the same node has to happen through main memory
  • 87. References Chakravarthy, Sharma. Stream data processing: a quality of service perspective: modeling, scheduling, load shedding, and complex event processing. Vol. 36. Springer, 2009. Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications." Journal of Algorithms 55.1 (2005): 58-75. Gulisano, Vincenzo Massimiliano, Ricardo Jiménez Peris, and Patrick Valduriez. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Diss. Informatica, 2012. Source code @ github.com/mayconbordin/tempest