Development of a Distributed
Stream Processing System
Maycon Viana Bordin
Final Assignment

Instituto de Informática
Unive...
What’s
Stream Processing?
Stream Source:

emits data continuously and
sequentially
B

Operators:

count, join, filter, map
B

Data streams
Sink
Data Stream
B

Tuple -> (“word”, 55)
B

Tuples are ordered by a
timestamp or other attribute

7

6

5

4

3

2

1
Data from the stream source may or
may not be structured
The amount of data is usually
unbounded in size
The input rate is variable and
typically unpredictable
Operators
OP
Receives one or
more data
streams

OP
Sends one or
more data
streams
Operators
Classification
OPERATORS
OPERATORS

Stateless
(map, filter)
OPERATORS

Stateless
(map, filter)

Stateful
OPERATORS

Stateless

Stateful

(map, filter)

Non-Blocking
(count, sum)
OPERATORS

Stateless

Stateful

(map, filter)

Blocking

Non-Blocking

(join, freq. itemset)

(count, sum)
Blocking operators need all input in
order to generate a result
but that’s not possible since data
streams are unbounded
To solve this issue, tuples are
grouped in windows
Range in time units or number of tuples

window start
(ws)

window end
(we)
old ws

old we

advance

new ws

new we
Implementation
Architecture
submit/start/stop
app

client

slave
master

slave

slave
slave
heartbeat
worker thread
The heartbeat carries the status of
each worker in the slave
Tuples processed
Throughput
Latency

The heartbeat carries the status of
each worker in the slave
Implementation
Application
Applications are composed as a DAG
(Directed Acyclic Graph)
To illustrate, let’s look at the graph of
a Trending Topics application
stream

extract
hashtags

countmin
sketch

File Sink
stream

Stream source emits
tweets in JSON
format

extract
hashtags

countmin
sketch

File Sink
Extract the text from
the tweet and add a
timestamp to each
tuple

stream

extract
hashtags

countmin
sketch

File Sink
stream

extract
hashtags

Extract and emit
each #hashtag in
the tweet
countmin
sketch

File Sink
stream

extract
hashtags

Constant time and space
approximate frequent
itemset

countmin
sketch

[Cormode and Muthukrishna...
stream

extract
hashtags

Without a window, it
will emit all top-k
items each time a
hashtag is received

countmin
sketch
...
stream

extract
hashtags

With a window the
number of tuples emitted
is reduced, but the
latency is increasead

countmin
s...
The second step in building an
application is to set the number of
instances of each operator:
stream

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink
But the user has to choose the way
tuples are going to be partitioned
among the operators
All-to-All Partitioning
stream

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink
stream

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink
Round-Robin Partitioning
stream

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink
stream

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink
stream

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink
stream

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink
stream

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink
stream

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink
Field Partitioning
stream

(“foo”, 1)

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Si...
stream

(“foo”, 1)

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Si...
stream

(“foo”, 1)

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Si...
stream

(“foo”, 1)

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Si...
The communication between
operators is done with the pub/sub
pattern
stream

extract

extract

extract

extract

extract

U

countmin

countmin

countmin

countmin

countmin

File Sink
stream

extract

extract

extract

extract

extract

U

countmin

countmin

countmin

countmin

countmin

The operator sub...
stream

extract

extract

extract

extract

extract

U

countmin

countmin

countmin

countmin

countmin

The operator wil...
The last step is to get each operator
instance from the graph and assign it
to a node
node-0

node-1

node-2

stream

extract

countmin

extract

countmin

extract

extract

countmin

countmin

countmin

File...
Currently the scheduler is static and
only balances the number of
operators per node
Implementation
Framework
trending-topics.js
Tests
Specification
Application

Trending Topics
Dataset of 40GB from Twitter
Test Environment

GridRS - PUCRS
3 nodes
4 x 3.52 GHz (Intel Xeon)
2 GB RAM
Linux 2.6.32-5-amd64
Gigabit Ethernet
Metrics

Runtime
Latency: time to a tuple traverse the graph
Throughput: no. of tuples processed per sec.
Loss of Tuples

...
Tests
Number of Nodes
Runtime vs Latency
90.00

2000.00

80.00

1800.00

1600.00

70.00

1400.00

1200.00
50.00
1000.00
40.00
800.00
30.00
600.0...
Runtime vs Stream Rate
90.00

25.00

80.00

20.00

70.00

15.00
50.00

40.00
10.00
30.00

20.00

5.00

10.00

runtime (min...
Throughput
7000.00

stream
extractor
6000.00

countmin
filesink
perfmon

Tuples per second (tps)

5000.00

4000.00

3000.0...
Loss of Tuples
10000.00

stream
extractor
countmin
8000.00

filesink
perfmon

Lost tuples

6000.00

4000.00

2000.00

0.00...
Throughput and Latency Over
Time
(nodes=3, instances=5, window=20)

14000

100000

stream
90000

extractor
countmin

80000...
Tests
Window Size
Runtime vs Latency
27.20

700.00

27.00
600.00
26.80

Runtime (minutes)

26.40
400.00
26.20
300.00
26.00

25.80

200.00

2...
Tests
No. of Instances
Runtime vs Stream Rate
40

35

35

30

30

25
20
20
15
15

10
10

5

5

runtime (min)
stream rate (MB/s)

0

0
1

5

No. o...
Conclusions
The system was able to process more
data with the inclusion of more nodes
On the other hand, distributing the
load increased the latency
The scheduler has to reduce the
network communication
The communication between workers
in the same node has to happen
through main memory
References
Chakravarthy, Sharma. Stream data processing: a quality of
service perspective: modeling, scheduling, load shed...
Development of a Distributed Stream Processing System
Upcoming SlideShare
Loading in …5
×

Development of a Distributed Stream Processing System

1,355 views

Published on

Development of a Distributed Stream Processing System (DSPS) in node.js and ZeroMQ and demonstration of an application of trending topics with a dataset from Twitter.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,355
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
38
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Development of a Distributed Stream Processing System

  1. 1. Development of a Distributed Stream Processing System Maycon Viana Bordin Final Assignment Instituto de Informática Universidade Federal do Rio Grande do Sul CMP157 – PDP 2013/2, Claudio Geyer
  2. 2. What’s Stream Processing?
  3. 3. Stream Source: emits data continuously and sequentially
  4. 4. B Operators: count, join, filter, map
  5. 5. B Data streams
  6. 6. Sink
  7. 7. Data Stream
  8. 8. B Tuple -> (“word”, 55)
  9. 9. B Tuples are ordered by a timestamp or other attribute 7 6 5 4 3 2 1
  10. 10. Data from the stream source may or may not be structured
  11. 11. The amount of data is usually unbounded in size
  12. 12. The input rate is variable and typically unpredictable
  13. 13. Operators
  14. 14. OP
  15. 15. Receives one or more data streams OP Sends one or more data streams
  16. 16. Operators Classification
  17. 17. OPERATORS
  18. 18. OPERATORS Stateless (map, filter)
  19. 19. OPERATORS Stateless (map, filter) Stateful
  20. 20. OPERATORS Stateless Stateful (map, filter) Non-Blocking (count, sum)
  21. 21. OPERATORS Stateless Stateful (map, filter) Blocking Non-Blocking (join, freq. itemset) (count, sum)
  22. 22. Blocking operators need all input in order to generate a result
  23. 23. but that’s not possible since data streams are unbounded
  24. 24. To solve this issue, tuples are grouped in windows
  25. 25. Range in time units or number of tuples window start (ws) window end (we)
  26. 26. old ws old we advance new ws new we
  27. 27. Implementation Architecture
  28. 28. submit/start/stop app client slave master slave slave slave heartbeat worker thread
  29. 29. The heartbeat carries the status of each worker in the slave
  30. 30. Tuples processed Throughput Latency The heartbeat carries the status of each worker in the slave
  31. 31. Implementation Application
  32. 32. Applications are composed as a DAG (Directed Acyclic Graph)
  33. 33. To illustrate, let’s look at the graph of a Trending Topics application
  34. 34. stream extract hashtags countmin sketch File Sink
  35. 35. stream Stream source emits tweets in JSON format extract hashtags countmin sketch File Sink
  36. 36. Extract the text from the tweet and add a timestamp to each tuple stream extract hashtags countmin sketch File Sink
  37. 37. stream extract hashtags Extract and emit each #hashtag in the tweet countmin sketch File Sink
  38. 38. stream extract hashtags Constant time and space approximate frequent itemset countmin sketch [Cormode and Muthukrishnan, 2005] File Sink
  39. 39. stream extract hashtags Without a window, it will emit all top-k items each time a hashtag is received countmin sketch File Sink
  40. 40. stream extract hashtags With a window the number of tuples emitted is reduced, but the latency is increasead countmin sketch File Sink
  41. 41. The second step in building an application is to set the number of instances of each operator:
  42. 42. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  43. 43. But the user has to choose the way tuples are going to be partitioned among the operators
  44. 44. All-to-All Partitioning
  45. 45. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  46. 46. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  47. 47. Round-Robin Partitioning
  48. 48. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  49. 49. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  50. 50. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  51. 51. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  52. 52. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  53. 53. stream extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  54. 54. Field Partitioning
  55. 55. stream (“foo”, 1) extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  56. 56. stream (“foo”, 1) extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  57. 57. stream (“foo”, 1) extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  58. 58. stream (“foo”, 1) extract extract extract extract extract countmin countmin countmin countmin countmin File Sink
  59. 59. The communication between operators is done with the pub/sub pattern
  60. 60. stream extract extract extract extract extract U countmin countmin countmin countmin countmin File Sink
  61. 61. stream extract extract extract extract extract U countmin countmin countmin countmin countmin The operator subscribes to all upstream operators, with his ID as a filter File Sink
  62. 62. stream extract extract extract extract extract U countmin countmin countmin countmin countmin The operator will only receive tuples with his ID as prefix File Sink
  63. 63. The last step is to get each operator instance from the graph and assign it to a node
  64. 64. node-0 node-1 node-2 stream extract countmin extract countmin extract extract countmin countmin countmin File Sink extract
  65. 65. Currently the scheduler is static and only balances the number of operators per node
  66. 66. Implementation Framework
  67. 67. trending-topics.js
  68. 68. Tests Specification
  69. 69. Application Trending Topics Dataset of 40GB from Twitter
  70. 70. Test Environment GridRS - PUCRS 3 nodes 4 x 3.52 GHz (Intel Xeon) 2 GB RAM Linux 2.6.32-5-amd64 Gigabit Ethernet
  71. 71. Metrics Runtime Latency: time to a tuple traverse the graph Throughput: no. of tuples processed per sec. Loss of Tuples Methodology 5 runs per test. Every 3s each operator sends its status with no. of tuples processed. The PerfMon sink collects a tuple every 100ms, and sends the average latency every 3s (and cleans up the collected tuples). Variables Number of nodes Number of operator instances Window size
  72. 72. Tests Number of Nodes
  73. 73. Runtime vs Latency 90.00 2000.00 80.00 1800.00 1600.00 70.00 1400.00 1200.00 50.00 1000.00 40.00 800.00 30.00 600.00 20.00 400.00 10.00 runtime (min) 200.00 latency (ms) 0.00 0.00 1 2 No. of nodes 3 Latency (ms) Runtime (minutes) 60.00
  74. 74. Runtime vs Stream Rate 90.00 25.00 80.00 20.00 70.00 15.00 50.00 40.00 10.00 30.00 20.00 5.00 10.00 runtime (min) stream rate (MB/s) 0.00 0.00 1 2 No. of nodes 3 Stream rate (MB/s) Runtime (minutes) 60.00
  75. 75. Throughput 7000.00 stream extractor 6000.00 countmin filesink perfmon Tuples per second (tps) 5000.00 4000.00 3000.00 2000.00 1000.00 0.00 1 2 No. of nodes 3
  76. 76. Loss of Tuples 10000.00 stream extractor countmin 8000.00 filesink perfmon Lost tuples 6000.00 4000.00 2000.00 0.00 1 -2000.00 2 No. of nodes 3
  77. 77. Throughput and Latency Over Time (nodes=3, instances=5, window=20) 14000 100000 stream 90000 extractor countmin 80000 10000 latency 70000 60000 8000 50000 6000 40000 30000 4000 20000 2000 10000 0 0 3 76 117 158 207 274 315 356 417 466 507 548 609 657 698 738 811 852 893 936 1008 1048 1089 1156 1197 1238 1308 1348 1389 1446 1501 1542 1583 1617 1657 1698 1728 1759 1800 1838 1865 1899 1939 1980 2011 2038 2079 Throughput (tuples/ps) filesink Time (seconds) Latency (ms) 12000
  78. 78. Tests Window Size
  79. 79. Runtime vs Latency 27.20 700.00 27.00 600.00 26.80 Runtime (minutes) 26.40 400.00 26.20 300.00 26.00 25.80 200.00 25.60 100.00 runtime (min) 25.40 latency (ms) 25.20 0.00 20 80 120 Window Size 200 Latency (ms) 500.00 26.60
  80. 80. Tests No. of Instances
  81. 81. Runtime vs Stream Rate 40 35 35 30 30 25 20 20 15 15 10 10 5 5 runtime (min) stream rate (MB/s) 0 0 1 5 No. of Instances Stream rate (MB/s) Runtime (minutes) 25
  82. 82. Conclusions
  83. 83. The system was able to process more data with the inclusion of more nodes
  84. 84. On the other hand, distributing the load increased the latency
  85. 85. The scheduler has to reduce the network communication
  86. 86. The communication between workers in the same node has to happen through main memory
  87. 87. References Chakravarthy, Sharma. Stream data processing: a quality of service perspective: modeling, scheduling, load shedding, and complex event processing. Vol. 36. Springer, 2009. Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications." Journal of Algorithms 55.1 (2005): 58-75. Gulisano, Vincenzo Massimiliano, Ricardo Jiménez Peris, and Patrick Valduriez. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Diss. Informatica, 2012. Source code @ github.com/mayconbordin/tempest

×