Development of a Distributed Stream Processing System

Development of a Distributed
Stream Processing System
Maycon Viana Bordin
Final Assignment

Instituto de Informática
Universidade Federal do Rio Grande do Sul

CMP157 – PDP 2013/2, Claudio Geyer

Stream Source:

emits data continuously and
sequentially

B

Operators:

count, join, filter, map

B

Tuples are ordered by a
timestamp or other attribute

7

6

5

4

3

2

1

Data from the stream source may or
may not be structured

The amount of data is usually
unbounded in size

The input rate is variable and
typically unpredictable

Receives one or
more data
streams

OP
Sends one or
more data
streams

OPERATORS

Stateless
(map, filter)

OPERATORS

Stateless
(map, filter)

Stateful

OPERATORS

Stateless

Stateful

(map, filter)

Non-Blocking
(count, sum)

OPERATORS

Stateless

Stateful

(map, filter)

Blocking

Non-Blocking

(join, freq. itemset)

(count, sum)

Blocking operators need all input in
order to generate a result

but that’s not possible since data
streams are unbounded

To solve this issue, tuples are
grouped in windows

Range in time units or number of tuples

window start
(ws)

window end
(we)

old ws

old we

advance

new ws

new we

submit/start/stop
app

client

slave
master

slave

slave
slave
heartbeat
worker thread

The heartbeat carries the status of
each worker in the slave

Tuples processed
Throughput
Latency

The heartbeat carries the status of
each worker in the slave

Applications are composed as a DAG
(Directed Acyclic Graph)

To illustrate, let’s look at the graph of
a Trending Topics application

stream

extract
hashtags

countmin
sketch

File Sink

stream

Stream source emits
tweets in JSON
format

extract
hashtags

countmin
sketch

File Sink

Extract the text from
the tweet and add a
timestamp to each
tuple

stream

extract
hashtags

countmin
sketch

File Sink

stream

extract
hashtags

Extract and emit
each #hashtag in
the tweet
countmin
sketch

File Sink

stream

extract
hashtags

Constant time and space
approximate frequent
itemset

countmin
sketch

[Cormode and Muthukrishnan, 2005]

File Sink

stream

extract
hashtags

Without a window, it
will emit all top-k
items each time a
hashtag is received

countmin
sketch

File Sink

stream

extract
hashtags

With a window the
number of tuples emitted
is reduced, but the
latency is increasead

countmin
sketch

File Sink

The second step in building an
application is to set the number of
instances of each operator:

stream

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink

But the user has to choose the way
tuples are going to be partitioned
among the operators

stream

(“foo”, 1)

extract

extract

extract

extract

extract

countmin

countmin

countmin

countmin

countmin

File Sink

The communication between
operators is done with the pub/sub
pattern

stream

extract

extract

extract

extract

extract

U

countmin

countmin

countmin

countmin

countmin

File Sink

stream

extract

extract

extract

extract

extract

U

countmin

countmin

countmin

countmin

countmin

The operator subscribes
to all upstream
operators, with his ID as
a filter

File Sink

stream

extract

extract

extract

extract

extract

U

countmin

countmin

countmin

countmin

countmin

The operator will only
receive tuples with his
ID as prefix

File Sink

The last step is to get each operator
instance from the graph and assign it
to a node

node-0

node-1

node-2

stream

extract

countmin

extract

countmin

extract

extract

countmin

countmin

countmin

File Sink

extract

Currently the scheduler is static and
only balances the number of
operators per node

Application

Trending Topics
Dataset of 40GB from Twitter

Test Environment

GridRS - PUCRS
3 nodes
4 x 3.52 GHz (Intel Xeon)
2 GB RAM
Linux 2.6.32-5-amd64
Gigabit Ethernet

Metrics

Runtime
Latency: time to a tuple traverse the graph
Throughput: no. of tuples processed per sec.
Loss of Tuples

Methodology

5 runs per test.
Every 3s each operator sends its status with
no. of tuples processed.
The PerfMon sink collects a tuple every
100ms, and sends the average latency every
3s (and cleans up the collected tuples).

Variables

Number of nodes
Number of operator instances
Window size

Runtime vs Latency
90.00

2000.00

80.00

1800.00

1600.00

70.00

1400.00

1200.00
50.00
1000.00
40.00
800.00
30.00
600.00
20.00

400.00

10.00

runtime (min)

200.00

latency (ms)
0.00

0.00
1

2

No. of nodes

3

Latency (ms)

Runtime (minutes)

60.00

Runtime vs Stream Rate
90.00

25.00

80.00

20.00

70.00

15.00
50.00

40.00
10.00
30.00

20.00

5.00

10.00

runtime (min)
stream rate (MB/s)

0.00

0.00
1

2

No. of nodes

3

Stream rate (MB/s)

Runtime (minutes)

60.00

Throughput
7000.00

stream
extractor
6000.00

countmin
filesink
perfmon

Tuples per second (tps)

5000.00

4000.00

3000.00

2000.00

1000.00

0.00
1

2

No. of nodes

3

Loss of Tuples
10000.00

stream
extractor
countmin
8000.00

filesink
perfmon

Lost tuples

6000.00

4000.00

2000.00

0.00
1

-2000.00

2

No. of nodes

3

Throughput and Latency Over
Time
(nodes=3, instances=5, window=20)

14000

100000

stream
90000

extractor
countmin

80000

10000

latency

70000

60000

8000

50000
6000

40000

30000

4000

20000
2000
10000

0

0
3
76
117
158
207
274
315
356
417
466
507
548
609
657
698
738
811
852
893
936
1008
1048
1089
1156
1197
1238
1308
1348
1389
1446
1501
1542
1583
1617
1657
1698
1728
1759
1800
1838
1865
1899
1939
1980
2011
2038
2079

Throughput (tuples/ps)

filesink

Time (seconds)

Latency (ms)

12000

Runtime vs Latency
27.20

700.00

27.00
600.00
26.80

Runtime (minutes)

26.40
400.00
26.20
300.00
26.00

25.80

200.00

25.60

100.00

runtime (min)

25.40

latency (ms)
25.20

0.00
20

80

120

Window Size

200

Latency (ms)

500.00

26.60

Runtime vs Stream Rate
40

35

35

30

30

25
20
20
15
15

10
10

5

5

runtime (min)
stream rate (MB/s)

0

0
1

5

No. of Instances

Stream rate (MB/s)

Runtime (minutes)

25

The system was able to process more
data with the inclusion of more nodes

On the other hand, distributing the
load increased the latency

The scheduler has to reduce the
network communication

The communication between workers
in the same node has to happen
through main memory

References
Chakravarthy, Sharma. Stream data processing: a quality of
service perspective: modeling, scheduling, load shedding, and
complex event processing. Vol. 36. Springer, 2009.
Cormode, Graham, and S. Muthukrishnan. "An improved data
stream summary: the count-min sketch and its applications."
Journal of Algorithms 55.1 (2005): 58-75.
Gulisano, Vincenzo Massimiliano, Ricardo Jiménez Peris, and
Patrick Valduriez. StreamCloud: An Elastic Parallel-Distributed
Stream Processing Engine. Diss. Informatica, 2012.

Source code @ github.com/mayconbordin/tempest

Development of a Distributed Stream Processing System

Recommended

Recommended

More Related Content

Similar to Development of a Distributed Stream Processing System

Similar to Development of a Distributed Stream Processing System (20)

More from Maycon Viana Bordin

More from Maycon Viana Bordin (20)

Recently uploaded

Recently uploaded (20)

Development of a Distributed Stream Processing System