Real Time Analytics: Algorithms and Systems

Real-time Analytics
Algorithms and Systems
Arun
Kejariwal*,
Sanjeev
Kulkarni+,
Karthik
Ramasamy☨

*Machine
Zone,
+PeerNova,
☨Twi@er
@arun_kejariwal,
@sanjeevrk,
@karthikz

2
A look at our presentation agenda
Outline
Motivation
Why
bother?
Emerging Applications
IoT,
Health
Care,
Machine
data

Connected
vehicles

3
Algorithms: I
ClassiﬁcaAon
Systems: II
3rd
GeneraAon
Systems: I
1st
&
2nd
GeneraAon
Algorithms: II
Deep
Dive

4
The Road Ahead
Challenges
Closing
Q&A

5
Real-time is key
Information Age
Ká !

6
Large
variety
of
media

Blogs,
reviews,
news
arAcles,

streaming
content

> 500M
Tweets
everyday
Challenge: Surfacing Relevant Content
Explosive Content Creation
[1]
hPp://www.kpcb.com/blog/2014-‐internet-‐trends

> 300 hrs
Video
uploaded
every
minute
> 1.8 B
Photos
uploaded
online
in
2014
[1]

7
High Volume
Content Consumption
WhatsApp
Messages
per
day
[1]
Pandora
Listener
hours

(Q2
2015)
[3]
Skype
Calls
per
month
E-mails
Per
second
Google
Searches
/year
[2]
Netflix
Hours
streamed

per
month
>30B
5.3B
4.76B
>
1T
>2.2M
>
1B
!
É
[1]
hPps://www.facebook.com/jan.koum/posts/10152994719980011?pnref=story

[2]
hPp://searchengineland.com/google-‐1-‐trillion-‐searches-‐per-‐year-‐212940

[3]
hPp://press.pandora.com/phoenix.zhtml?c=251764&p=irol-‐newsArAcle&ID=2070623
]
9

8
A New World
Mobile, Mobile, Mobile
5.4
B
Mobile
Phone
Users
[1]
69%
Y/Y
Growth
Data
Traffic

55%
Mobile
Video
Traffic
34%
Global
e-‐Commerce
[2]
AVAILABILITY
PERFORMANCE
RELIABILITY
Anywhere, Anytime, Any Device
Smartphone
Subscripòns

in
2014
[1]
2.1B
[1]

[2]
hPp://www.criteo.com/media/1894/criteo-‐state-‐of-‐mobile-‐commerce-‐q1-‐2015-‐ppt.pdf
f
K
.

9
Market pulse
Finance/Investing
[1]
Image
borrowed
from
hPp://www.bloomberg.com/bw/arAcles/2013-‐06-‐06/how-‐the-‐robots-‐lost-‐high-‐frequency-‐tradings-‐rise-‐and-‐fall

[2]
hPp://arAcles.economicAmes.indiaAmes.com/2014-‐12-‐26/news/57420480_1_ravi-‐varanasi-‐mobile-‐plaeorm-‐nse
1
minute
bids
and
oﬀers

March
8,
2011
[1]
Mobile
trading
on
the
rise
[2]

NSE

48%
increase
in
turnover,
Jan’14
-‐>
Dec’14

BSE

0.25%
(Jan’14)
-‐>
0.5%
(Nov’14)
of
total

volume

10
Entertainment: MMOs
Game of War
Largest single world concurrent mobile game in the world
“Real-‐`me

Many-‐to-‐Many
is

Tomorrow's
Internet”

-‐
Francois
Orsini
-‐
Global scale
CollaboraAve:
make
alliances
Real-time messaging
Chat
translaAon
in
mulAple

languages

11
On
the rise
Cybersecurity
2014
Staples
Dec’14
JP
Morgan
Oct’14
New
York
July’14
Michaels
Jan’14
PF
Changs
June’14
Home
Depot
Sept’14
UPS
Aug’14
Sony
Nov’14
OPM,
Anthem,
UCLA

2015
2015
[1]
hPp://www.mcafee.com/us/resources/reports/rp-‐economic-‐impact-‐cybercrime2.pdf
400 B [1]

12
Supporting higher volume and speed
Hardware Innovations
Massively parallel
Intel’s “Knights Landing” Xeon Phi - 72 cores [1]
High speed
Low Power
“…
quickly
idenAfy
fraud
detecAon
paPerns
in
ﬁnancial

transacAons;
healthcare
researchers
could
process
and
analyze

larger
data
sets
in
real
Ame,
acceleraAng
complex
tasks
such
as

geneAc
analysis
and
disease
tracking.”
[3]
Intel and Micron’s 3D XPoint Technology
1000x faster than NAND
[1]
hPp://www.anandtech.com/show/9436/quick-‐note-‐intel-‐knights-‐landing-‐xeon-‐phi-‐omnipath-‐100-‐isc-‐2015

[2]
Intel
IDS’15

[3]
hPp://newsroom.intel.com/community/intel_newsroom/blog/2015/07/28/intel-‐and-‐micron-‐produce-‐breakthrough-‐memory-‐technology

[2]
Q

13
Hardware support for apps
Hardware Innovations
[1]
Images
borrowed
from
Julius
Madelblat’s

and
Andy
Vargas,
Rajeev
Nalawadi
and
Shane
Abreu’s
Technology
Insight
at
IDF’15.
Image and Touch processing support in Intel’s Skylake [1]

Emerging
Applica`ons
Overview

15
Real time
User Experience, Productivity
Real-time Video Streams
N E W S
Drones Robotics
I N D U S T R Y

$ 4 0
B
b y
2 0 2 0
[ 3 ]
D E L I V E R Y / M O N i T O R I N G

$ 1 . 7 B
f o r
2 0 1 5 [ 1 ]
[1]


[2]
hPp://www.bostondynamics.com/robot_Atlas.html

[3]
hPp://www.marketsandmarkets.com/Market-‐Reports/Industrial-‐RoboAcs-‐Market-‐643.html
[2]

16
$1.9
T
in
value
by
2020
-‐
Mfg
(15%),
Health
Care
(15%),
Insurance
(11%)

26
B
-‐
75
B
units
[2,
3,
4,
5]
[1]

Background
image
taken
from
hPps://www.uspsoig.gov/sites/default/files/document-‐library-‐files/2015/rarc-‐wp-‐15-‐013.pdf

[2]
hPp://www.gartner.com/newsroom/id/2636073

[3]
hPps://www.abiresearch.com/press/more-‐than-‐30-‐billion-‐devices-‐will-‐wirelessly-‐conne

[4]
hPp://newsroom.cisco.com/feature-‐content?type=webcontent&arAcleId=1208342

[5]
hPp://www.businessinsider.com/75-‐billion-‐devices-‐will-‐be-‐connected-‐to-‐the-‐internet-‐by-‐2020-‐2013-‐10

[6]
hPps://www.abiresearch.com/press/ibeaconble-‐beacon-‐shipments-‐to-‐break-‐60-‐million-‐by/
Improve
operaAonal
efficiencies,
customer
experience,
new
business
modelsY
Beacons:
Retailers
and
bank
branches

60M
units
market
by
2019
[6]
Smart
buildings:

Reduce
energy
costs,
cut
maintenance
costs

Increase
safety
&
security
Large Market Potential
Internet of Things

17
The Future
Biostamps [2]
Mobile
Sensor Network
Exponential growth [1]
[1]
hPp://opensignal.com/assets/pdf/reports/2015_08_fragmentaAon_report.pdf

[2]
hPp://www.ericsson.com/thinkingahead/networked_society/stories/#/ﬁlm/mc10-‐biostamp

18
Continuous Monitoring
Intelligent Health Care
Tracking Movements
Measure
eﬀect
of
social

inﬂuences
Google Lens
Measure
glucose
level
in

tears
Watch/Wristband
Smart Textiles
Skin
temperature

PerspiraAon
Ingestible Sensors
MedicaAon
compliance
[1]
Heart
funcAon
[1]
hPp://www.hhnmag.com/Magazine/2015/Apr/cover-‐medical-‐technology
!
!

19
Connected World
Internet of Things
30
B
connected
devices
by
2020
Health Care
153
Exabytes
(2013)
-‐>
2314
Exabytes
(2020)
Machine Data
40%
of
digital
universe
by
2020
Connected Vehicles
Data
transferred
per
vehicle
per
month

4
MB
-‐>
5
GB
Digital Assistants (Predictive Analytics)
$2B
(2012)
-‐>
$6.5B
(2019)
[1]

Siri/Cortana/Google
Now
Augmented/Virtual Reality
$150B
by
2020
[2]

Oculus/HoloLens/Magic
Leap
Ñ
!+
>
[1]
hPp://www.siemens.com/innovaAon/en/home/pictures-‐of-‐the-‐future/digitalizaAon-‐and-‐so{ware/digital-‐assistants-‐trends.html

[2]
hPp://techcrunch.com/2015/04/06/augmented-‐and-‐virtual-‐reality-‐to-‐hit-‐150-‐billion-‐by-‐2020/#.7q0heh:oABw

ANALYTICS
What is
Real-Time Analytics?

21
What is Analytics?
According to wikipedia
DISCOVERY
Ability
to
idenAfy
paPerns
in
data

COMMUNICATION
Provide
insights
in
a
meaningful
way
"
"

22
Types of Analytics
" E
CUBE ANALYTICS
Business
Intelligence
PREDICTIVE ANALYTICS
StaAsAcs
and
Machine
learning

23
What is Real-Time Analytics?
BATCH
high throughput
> 1 hour
monthly active users
relevance for ads
adhoc
queries
NEAR
REAL TIME
low latency
< 1 ms
Financial
Trading
ad impressions count
hash tag trends
approximate
> 1 sec
Online
Non-Transactional
latency sensitive
< 500 ms
fanout Tweets
search for Tweets
deterministic
workflows
Online
Transactional
It’s contextual

24
What is Real-Time Analytics?It’s contextual
Value&of&Data&to&Decision/Making&
Time&
Preven8ve/&
Predic8ve&
Ac8onable&
Reac8ve&
Historical&
Real%&
Time&
Seconds& Minutes& Hours& Days&
Tradi8onal&“Batch”&&&&&&&&&&&&&&&
Business&&Intelligence&
Informa9on&Half%Life&
In&Decision%Making&
Months&
Time/cri8cal&
Decisions&
[1]
Courtesy
Michael
Franklin,
BIRTE,
2015.

25
Real Time Analytics
STREAMING
Analyze
data
as
it
is
being

produced
INTERACTIVE
Store
data
and
provide
results

instantly
when
a
query
is

posed
H
C

ALGORITHMS
Mining
Streaming Data

27
It’s diﬀerent
Key Characteristics
APPROXIMATE
H I G H
V E L O C I T Y
ONE PASS
L O W
L A T E N C Y
DISTRIBUTED
H I G H
V O L U M E

28
It’s diﬀerent
Key Characteristics
FAULT TOLERANCE [1]
A V A I L A B I L I T Y
SCALE OUT
H I G H
P E R F O R M A N C E
ROBUST
I N C O M P L E T E
D A T A
[1]
ByzanAne
failures
are
described
in
the
following
journal
paper:
J.
Driscoll,
Kevin;
Hall,
Brendan;
Sivencrona,
Håkan;
Zumsteg,
Phil
(2003).
"ByzanAne
Fault
Tolerance,
from
Theory
to
Reality"
2788.
pp.
235–248.

29
Categorization
Sampling
A/B
TesAng
Filtering
Set
Membership
Correlation
Fraud
DetecAon
"

30
Estimating Cardinality
Site
audience
analysis
Estimating Quantiles
Network
analysis
Estimating Moments
Databases
Frequent Elements
Trending
hashtags
E

31
Counting Inversions
Measure
sortedness
of
data
Finding Subsequences
Traﬃc
analysis
Path Analysis
Web
graph
analysis
Clustering
Medical
imaging

32
Data Prediction
Financial
trading
Anomaly Detection
Sensor
networks

33
Sampling
Obtain
a
representaAve
sample
from
a
data
stream

Maintain
dynamic
sample

A
data
stream
is
a
conAnuous
process

Not
known
in
advance
how
many
points
may
elapse
before
an
analyst
may
need
to
use
a
representaAve
sample

Reservoir
sampling
[1]

ProbabilisAc
inserAons
and
deleAons
on
arrival
of
new
stream
points

Probability
of
successive
inserAon
of
new
points
reduces
with
progression
of
the
stream

An
unbiased
sample
contains
a
larger
and
larger
fracAon
of
points
from
the
distant
history
of
the
stream

PracAcal
perspecAve

Data
stream
may
evolve
and
hence,
the
majority
of
the
points
in
the
sample
may
represent
the
stale
history
[1]
J.
S.
ViPer.
Random
Sampling
with
a
Reservoir.
ACM
TransacAons
on
MathemaAcal
So{ware,
Vol.
11(1):37–57,
March
1985.

34
Sampling

Sliding
window
approach
(sample
size
k,
window
width
n)

Sequence-‐based

Replace
expired
element
with
newly
arrived
element

Disadvantage:
highly
periodic

Chain-‐sample
approach

Select
element
ith
with
probability
Min(i,n)/n

Select
uniformly
at
random
an
index
from
[i+1,
i+n]
of
the
element

which
will
replace
the
ith
item

Maintain
k
independent
chain
samples

Timestamp-‐based

#
elements
in
a
moving
window
may
vary
over
Ame

Priority-‐sample
approach
[1]
B.
Babcock.
Sampling
From
a
Moving
Window
Over
Streaming
Data.
In
Proceedings
of
SODA,
2002.
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

35
Sampling

Biased
Reservoir
Sampling
[1]

Use
a
temporal
bias
funcAon
-‐
recent
points
have
higher
probability
of
being
represented
in
the
sample
reservoir

Memory-‐less
bias
funcAons

Future
probability
of
retaining
a
current
point
in
the
reservoir
is
independent
of
its
past
history
or
arrival
Ame

Probability
of
an
rth
point
belonging
to
the
reservoir
at
the
Ame
t
is
proporAonal
to
the
bias
funcAon

ExponenAal
bias
funcAons
for
rth
data
point
at
Ame
t,

where,
r
≤
t,

λ

[0,
1]
is
the
bias
rate

Maximum
reservoir
requirement
R(t)
is
bounded
[1]
C.
C.
Aggarwal.On
Biased
Reservoir
Sampling
in
the
presence
of
Stream
EvoluAon.
in
Proceedings
of
VLDB,
2006.

36
Sampling
General problem

Input:
Tuples
of
n
components

Subset
are
key
components
-‐
basis
for
sampling

Sample
of
size
a/b

Hash
key
to
b
buckets

Accept
a
tuple
if
hash
value
<
a

Space
constraint

a
<-‐
a
-‐
1

Remove
tuples
whose
keys
hash
to
a

37
Set Membership
Filtering
Determine,
with
some
false
probability,
if
an
item
in
a
data
stream
has
been
seen
before

Databases
(e.g.,
speed
up
semi-‐join
operaAons),
Caches,
Routers,
Storage
Systems

Reduce
space
requirement
in
probabilisAc
rouAng
tables

Speedup
longest-‐preﬁx
matching
of
IP
addresses

Encode
mulAcast
forwarding
informaAon
in
packets

Summarize
content
to
aid
collaboraAons
in
overlay
and
peer-‐to-‐peer
networks

Improve
network
state
management
and
monitoring

38
Set Membership
Filtering
[1]
IllustraAon
borrowed
from
hPp://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
[1]
ApplicaAon
to
hyphenaAon
programs

Early
UNIX
spell
checkers

39
Set Membership
Filtering

Natural
generalizaAon
of
hashing

False
posiAves
are
possible

No
false
negaAves

No
deleAons
allowed

For
false
posiAve
rate
ε,
#
hash
funcAons
=
log2(1/ε)
where,
n
=
#
elements,
k
=
#
hash
funcAons

m
=
#
bits
in
the
array

40
Set Membership
Filtering

Minimizing
false
posiAve
rate
ε
w.r.t.
k
[1]

k
=
ln
2
*
(m/n)

ε
=
(1/2)k
≈
(0.6185)m/n

1.44
*
log2(1/ε)
bits
per
item

Independent
of
item
size
or
#
items

InformaAon-‐theoreAc
minimum:
log2(1/ε)
bits
per
item

44%
overhead

X
=
#
0
bits

where
[1]
A.
Broder
and
M.
Mitzenmacher.
Network
ApplicaAons
of
Bloom
Filters:
A
Survey.
In
Internet
MathemaAcs
Vol.
1,
No.
4,
2005.

41
Set Membership
Filtering
DerivaAves

CounAng
Bloom
filters:
Support
deleAon

Bit
-‐>
small
counter

Typically,
4
bits
per
counter
suffice

Increment,
Decrement

Blocked
Bloom
filters

d-‐le{
CounAng
Bloom
filters

QuoAent
filters

Rank-‐Indexed
Hashing

42
Set Membership
Filtering
Cuckoo Filter [1]

Key
Highlights

Add
and
remove
items
dynamically

For
false
posiAve
rate
ε
<
3%,
more
space
efficient
than
Bloom
filter

Higher
performance
than
Bloom
filter
for
many
real
workloads

AsymptoAcally
worse
performance
than
Bloom
filter

Min
fingerprint
size
α
log
(#
entries
in
table)

Overview

Stores
only
a
fingerprint
of
an
item
inserted

Original
key
and
value
bits
of
each
item
not
retrievable

Set
membership
query
for
item
x:
search
hash
table
for
fingerprint
of
x
[1]
Fan
et
al.,
Cuckoo
Filter:
PracAcally
BePer
Than
Bloom.
In
Proceedings
of
the
10th
ACM
InternaAonal
on
Conference
on
Emerging
Networking
Experiments
and
Technologies,
2014.

43
Set Membership
Filtering
[1]
R.
Pagh
and
F.
Rodler.
Cuckoo
hashing.
Journal
of
Algorithms,
51(2):122-‐144,
2004.

[2]
IllustraAon
borrowed
from
“Fan
et
al.,
Cuckoo
Filter:
PracAcally
BePer
Than
Bloom.
In
Proceedings
of
the
10th
ACM
InternaAonal
on
Conference
on
Emerging
Networking
Experiments
and
Technologies,
2014.”
[2]
IllustraAon
of
Cuckoo
hashing
[2]
Cuckoo Hashing [1]

High
space
occupancy

PracAcal
implementaAons:
mulAple
items/bucket

Example
uses:
So{ware-‐based
Ethernet
switches

Cuckoo Filter

Uses
a
mulA-‐way
associaAve
Cuckoo
hash
table

Employs
parAal-‐key
cuckoo
hashing

Relocate
exisAng
ﬁngerprints
to
their
alternaAve

locaAons
[2]

44
Set Membership
Filtering
Cuckoo Filter

ParAal-‐key
cuckoo
hashing

Fingerprint
hashing
ensures
uniform
distribuAon
of

items
in
the
table

Length
of
ﬁngerprint
<<
Size
of
h1
or
h2

Possible
to
have
mulAple
entries
of
a
ﬁngerprint
in

a
bucket

DeleAon

Item
must
have
been
previously
inserted
Comparison

45
Large
set
of
real-‐world
applica`ons

Database
systems/Search
engines

#
disAnct
queries

Network
monitoring
applicaAons

Natural
language
processing

#
disAnct
moAfs
in
a
DNA
sequence

#
disAnct
elements
of
RFID/sensor
networks
# Distinct Elements

46
Historical
context

ProbabilisAc
counAng
[Flajolet
and
MarAn,
1983]

LogLog
counAng
[Durand
and
Flajolet,
2003]

HyperLogLog
[Flajolet
et
al.,
2007]

Sliding
HyperLogLog
[Chabchoub
and
Hebrail,
2010]

HyperLogLog
in
PracAce
[Heule
et
al.,
2013]

Self-‐Organizing
Bitmap
[Chen
and
Cao,
2009]

Discrete
Max-‐Count
[Ting,
2014]

Sequence
of
sketches
forms
a
Markov
chain
when
h
is
a
strong
universal
hash

EsAmate
cardinality
using
a
marAngale
# Distinct Elements
N
≤
109

47
Hyperloglog

Apply
hash
funcAon
h
to
every
element
in
a
mulAset

Cardinality
of
mulAset
is
2max(ϱ)
where
0ϱ-‐11
is
the
bit
paPern
observed
at
the
beginning
of
a
hash
value

Above
suﬀers
with
high
variance

Employ
stochasAc
averaging

ParAAon
input
stream
into
m
sub-‐streams
Si
using
ﬁrst
p
bits
of
hash
values
(m
=
2p)
# Distinct Elements
where

48
Hyperloglog
in
Prac`ce:
Op`mizaòns

Use
of
64-‐bit
hash
funcAon

Total
memory
requirement
5
*
2p
-‐>
6
*
2p,
where
p
is
the
precision

Empirical
bias
correcAon

Uses
empirically
determined
data
for
cardinaliAes
smaller
than
5m
and
uses
the
unmodified
raw
esAmate
otherwise

Sparse
representaAon

For
n≪m,
store
an
integer
obtained
by
concatenaAng
the
bit
paPerns
for
idx
and
ϱ(w)

Use
variable
length
encoding
for
integers
that
uses
variable
number
of
bytes
to
represent
integers

Use
difference
encoding
-‐
store
the
difference
between
successive
elements

Other
opAmizaAons
[1,
2]
# Distinct Elements
[1]
hPp://druid.io/blog/2014/02/18/hyperloglog-‐opAmizaAons-‐for-‐real-‐world-‐systems.html

[2]
hPp://anArez.com/news/75

49
Self-‐Learning
Bitmap
(S-‐bitmap)
[1]

Achieve
constant
relaAve
esAmaAon
errors
for
unknown
cardinaliAes
in
a
wide
range,
say
from
10s
to
>106

Bitmap
obtained
via
adapAve
sampling
process

Bits
corresponding
to
the
sampled
items
are
set
to
1

Sampling
rates
are
learned
from
#
disAnct
items
already
passed
and
reduced
sequenAally
as
more
bits
are
set
to
1

For
given
input
parameters
Nmax
and
esAmaAon
precision
ε,
size
of
bit
mask

For
r
=
1
-‐2ε2(1+ε2)-‐1
and
sampling
probability
pk
=
m
(m+1-‐k)-‐1(1+ε2)rk,
where
k
∈
[1,m]

RelaAve
error
≣
ε
# Distinct Elements
[1]
Chen
et
al.
“DisAnct
counAng
with
a
self-‐learning
bitmap”.
Journal
of
the
American
StaAsAcal
AssociaAon,
106(495):879–890,
2011.

50
Large
set
of
real-‐world
applica`ons

Database
applicaAons

Sensor
networks

OperaAons

ProperAes

Provide
tunable
and
explicit
guarantees
on
the
precision
of
approximaAon

Single
pass

Early
work

[Greenwald
and
Khanna,
2001]
-‐
worst
case
space
requirement

[Arasu
and
Manku,
2004]
-‐
sliding
window
based
model,
worst
case
space
requirement

Quantiles, Histograms, Icebergs

51
q-‐digest
[1]

Groups
values
in
variable
size
buckets
of
almost
equal
weights

Unlike
a
tradiAonal
histogram,
buckets
can
overlap

Key
features

Detailed
informaAon
about
frequent
values
preserved

Less
frequent
values
lumped
into
larger
buckets

Using
message
of
size
m,
answer
within
an
error
of

Except
root
and
leaf
nodes,
a
node
v
∈
q-‐digest
iﬀ
[1]
Shrivastava
et
al.,
Medians
and
Beyond:
New
AggregaAon
Techniques
for
Sensor
Networks.
In
Proceedings
of
SenSys,
2004.
Max
signal

value
#
Elements
Compression

Factor
Complete
binary
tree

52
q-‐digest

Building
a
q-‐digest

q-‐digests
can
be
constructed
in
a
distributed
fashion

Merge
q-‐digests

Applica`ons

Track
bandwidth
hogs

Determine
popular
tourist
desAnaAons

Itemset
mining

Entropy
esAmaAon

Compressed
sensing

Search
log
mining

Network
data
analysis

DBMS
opAmizaAon

53
Frequent Elements
A core streaming problem

Count-‐min
Sketch
[1]

A
two-‐dimensional
array
counts
with
w
columns
and
d
rows

Each
entry
of
the
array
is
iniAally
zero

d
hash
funcAons
are
chosen
uniformly
at
random
from
a
pairwise
independent
family

Update

For
a
new
element
i,
for
each
row
j
and
k
=
hj(i),
increment
the
kth
column
by
one

Point
query

where,
sketch
is
the
table

Parameters
54
Frequent Elements
[1]
Cormode,
Graham;
S.
Muthukrishnan
(2005).
"An
Improved
Data
Stream
Summary:
The
Count-‐Min
Sketch
and
its
ApplicaAons".
J.
Algorithms
55:
29–38.
),( δε
}1{}1{:,,1 wnhh d ……… →
!
!
"
#
#
$
=
ε
e
w
!
!
"
#
#
$
=
δ
1
lnd
sketch

Variants
of
Count-‐min
Sketch
[1]

Count-‐Min
sketch
with
conservaAve
update
(CU
sketch)

Update
an
item
with
frequency
c

Avoid
unnecessary
updaAng
of
counter
values
=>
Reduce
over-‐esAmaAon
error

Prone
to
over-‐esAmaAon
error
on
low-‐frequency
items

Lossy
ConservaAve
Update
(LCU)
-‐
SWS

Divide
stream
into
windows

At
window
boundaries,
∀
1
≤
i
≤
w,
1
≤
j
≤
d,
decrement
sketch[i,j]
if
0
<
sketch[i,j]
≤

55
Frequent Elements
[1]
Cormode,
G.
2009.
Encyclopedia
entry
on
’Count-‐MinSketch’.
In
Encyclopedia
of
Database
Systems.
Springer.,
511–516.

56
Anomaly Detection
Large
set
of
real-‐world
applica`ons

Social
media:
Trending
analysis

Fraud
detecAon:
Insurance,
E-‐commerce,
MarkeAng

Network
intrusion
detecAon

Health
care

Sensor
networks

Anomalous
state
detecAon
(e.g.,
wind
turbines)

OperaAons

Metric
space:
System,
ApplicaAon,
Data
Center

PotenAally
impact
performance,
availability,
reliability
Researched over > 50 yrs

57
Anomaly Detection
Anomaly
is
contextual

Manufacturing

StaAsAcs

Econometrics,
Financial
engineering

Signal
processing

Control
systems,
Autonomous
systems
-‐
fault
detecAon
[1]

Networking

ComputaAonal
biology
(e.g.,
microarray
analysis)

Computer
vision
[1]
A.
S.
Willsky,
“A
survey
of
design
methods
for
failure
detecAon
systems,”
AutomaAca,
vol.
12,
pp.
601–611,
1976.

58
Anomaly Detection
Characteriza`on

Magnitude

Width

Frequency

DirecAon

Flavors

Global

Local
Global
Local

59
Anomaly Detection
Tradi`onal
Approaches

Rule
based:
μ
±
σ

Manufacturing,
StaAsAcal
Process
Control
[1]

Moving
averages

SMA

EWMA

PEWMA

AssumpAon:
Normal
distribuAon

Mostly
does
not
hold
in
real
life
[1]
W.
A.
Shewhart.
Economic
Quality
Control
of
Manufactured
Product,
The
Bell
Labs
Technical
Journal,
9(2):364-‐389,
1930.
[1]

60
Anomaly Detection
In
Prac`ce

Robustness

μ
and
σ
are
not
robust
in
presence
of
anomalies

Use
median
and
MAD
(Median
Absolute
DeviaAon)

Seasonality

Trend

MulA-‐modal
distribuAon

Time
series
decomposiAon

AnomalyDetecAon
R
package
[1]

[1]
hPps://github.com/twiPer/AnomalyDetecAon

Marrying
Time
Series
Decomposi`on
and
Robust
Sta`s`cs

61
Anomaly Detection
Trend Smoothing Distortion
Creates “Phantom” Anomalies
Median is Free from Distortion

62
Anomaly Detection
Real-‐Time

Challenges

AdapAve
learning

Automated
modeling

Marrying
theory
with
contextual
relevance

OperaAons

Large
set
of
different
services
in
a
technology
stack

Different
stacks
use
different
services

Promising
products
such
as
Opsclarity

63
Anomaly Detection
Anomalies
in
opera`onal
data:
Challenges
Contextual Application Topology Map
Hierarchical
Datacenter ! Applications ! Services ! Hosts
•  Automatically discover Developer / Architect’s view of the
application - for the Operations team
-  Framework for system config and context
•  Real-time, streaming architecture
-  Keeps up with today’s elastic infrastructure
•  Scale to 1000s of hosts, 100s of (micro) services
•  Present evolution of system state over time
-  DVR-like replay of health, system changes, failures
Evolving Needs of Modern Operations

64
Anomaly Detection
Anomalies
in
opera`onal
data:
Challenges

AutomaAcally
learn
base-‐lines
for
metrics

Data
variety
requires
advanced
staAsAcal
approaches

Detect
issues
earlier,
proacAve
alerAng
Example: Detecting Disk Full Issues Early

66
The Key Aspects
Requirements of Stream Processing
In-stream Handle imperfections Predictable Performance
Process
data
as
it
is

passes
by
Delayed,
missing
and

out-‐of-‐order
data
and
Repeatable and
Scalability
I
8
Requirements
of
Stream
Processing,
Mike
Stonebraker
et.
al,
SIGMOD
Record
2005

67
The Key Aspects
Requirements of Stream Processing
High level languages Integrate stored and
streaming data
Data safety and
availability
Process and respond
SQL
or
DSL
for
comparing
present

with
the
past
and
Repeatable
ApplicaAon
should
keep

at
high
volumes
8
Requirements
of
Stream
Processing,
Mike
Stonebraker
et.
al,
SIGMOD
Record
2005
# # $ %

68
Window Processing
Stream Processing
T.
Akidau
et
al.,
The
Dataﬂow
Model:
A
PracAcal
Approach
to
Balancing
Correctness,
Latency,
and
Cost
in
Massive-‐Scale,
Unbounded,
Out-‐of-‐Order
Data
Processing,
In
VLDB,
2015.
&
# $

69
Three Generations
First Generation
Extensions
to
exisAng
database
engines
or
simplisAc
engines

Dedicated
to
speciﬁc
applicaAons
or
use
cases
Second Generation
Enhanced
methods
regarding
language
expressiveness

Distributed
processing,
load
balancing
and
fault
tolerance
Third Generation
Massive
parallelizaAon
for
processing
large
data
sets

Dedicated
towards
cloud
compuAng
,
%
hPp://www.slideshare.net/zbigniew.jerzak/cloudbased-‐data-‐stream-‐processing

1st generation - Active Database Systems
SYSTEMS
"

71
Late 1980s Late 1990s
1st Generation Systems
HiPAC
[Dayal
et
al.,
1988]
Starbust
[Widom/Finkelstein
et
al.,
1990]
!

72
Postgres
[Stonebraker/Kemnitz
et
al.,
1991]
ODE
[Gehani/Jagadish
et
al.,
1991]

73
Notable features
Early: Active DBs, ECA rules, triggers,
publish-subscribe
Event-Condition-Action
)
'
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Event

Occurrences
Triggered

Rules
Evaluated

Rules
Selected

Rules
Event

Source
Signaling Triggering
EvaluaAon
SchedulingExecuAon
G Systems - HiPAC, Starbust, Postgres, ODE
“AcAve
Database
Systems”,
Paton
and
Diaz,
ACM
CompuAng
Surveys,
1999

74
Notable features
1st Generation Applications
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Actuation (also IoT?)
Finance
Enforcing database integrity constraints
Monitoring the physical world (IoT?)
Supply chain
News and update dissemination
(
#)
#
Battlefield awarenessHealth monitoring
-
d

75
Issues
Rules were (are) hard to program
or understand
Smart engineering of traditional approaches
can get you close enough?!
Little commercial activity
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
#

2nd generation - Streaming Database Systems
SYSTEMS
"

77
Early 2000s Late 2000s
2nd Generation Systems
Niagara CQ
[Jianjun
Chun
et
al.,
2000]
Telegraph, Telegraph CQ
[Hellerstein
et
al.,
2000]

[Chandrasekaran
et
al.,
2003]
!

78
STREAM
[Arasu
et
al.,
2003]
Aurora
[Abadi
et
al.,
2003]
Borealis
[Abadi
et
al.,
2005]
✉
(

79
Cayuga
[Demeres
et
al.,
2007]
MCOPE
[Park
et
al.,
2009]

Repeatedly apply generic SQL to the results of window operators
80
The basic idea
Stream Query Processing
Support full SQL language and eco system
A table is a set of records and a stream is an unbounded
sequence of records
SQL
g
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Each window outputs a set of records
Window operators convert streams to
tablesÄ
Rstream
semanAcs
in
CQL,
Arvind
Arasu
et
al.
VLDB
Journal
2006
Streams Tables
Window
Operators
3
#
$

81
Telegraph CQ
Data
stream
query
processor
Con`nuous
and
adap`ve

query
processing
Built
by
modifying
PostgreSQL
01
02
03
Developed at University of California, Berkeley
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs

82
Niagara CQ
Incremental

group
opAmizaAon
strategy

Incremental
evaluaAon
of
conAnuous
queries
A
distributed
database
system
for
conAnuous
queries

using
a
query
language
like
XML-‐QL
for
changing
data

sets
Query
Grouping
Allows
for
sharing
common
parts
of

two
or
more
queries
Caching
For
performance
Push/Pull
data
inges`on
for
detected
changes
in
data
Change
based
and
Timer
CQ
ConAnuous
queries
to
trigger
on
data

changes
and
regular
Amed
based
01
02
03
04
Developed at UW-Madison

83
Niagara CQ
Query grouping and sharing
quotes.xml
Select

Symbol
=
INTC
Trigger
AcAon
1
quotes.xml
Select

Symbol
=
MSFT
Trigger
AcAon
2
Select
Constant

Table

INTC/MSFT
quotes.xml
Split
Trigger
AcAon
1 Trigger
AcAon
2

84
Borealis
Load
aware
distribuAon

Fine
grained
high
availability

Load
shredding
mechanisms
A
low
latency
stream
processing
engine

with
a
focus
on
fault
tolerance
and

distribuAon
Distributed
stream
engine
Allows
for
sharing
common
parts
of

two
or
more
queries
Dynamic
query
modificaòn
For
performance
Dynamic
system
op`mizaòn
for
detected
changes
in
data
Dynamic
revision
of
results
ConAnuous
queries
to
trigger
on
data

changes
and
regular
Amed
based
01
02
03
04
Developed at MIT, Brown and Brandeis

85
Summary
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Can reuse many of relational operators
Historical comparison becomes a join
of a stream and its history table
Views on streams can be created
Streams can be processed using
relational operators
Can leverage an RDMS system
Stream and stream results can be
stored in tables for later querying +
(,
g$
G

86
Issues
Despite significant commercial activity,
no real breakout
No standardization and comprehensive
benchmarks
6
%
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
& Value proposition for learning new concepts
was not clear

88
The last decade
Streaming Platforms
S4
Yahoo!
Flink
Apache
Storm
TwiPer
Spark
Databricks
Samza
LinkedIn
Heron
TwiPer
MillWheel
Google
Pulsar
eBay
%%
S-Store
ISTC,
Intel,
MIT,
Brown,
CMU,
Portland
State
S
Trill
Microso{
T

89
Earliest distributed stream system
Apache S4
Scalable
Throughput
is
linear
as
addiAonal

nodes
are
added
Cluster management
Hides
managements
using
a
layer

in
ZooKeeper
Decentralized
All
nodes
are
symmetric
and
no

centralized
service
Extensible
Building
blocks
of
plaeorm
can
be
replaced

by
custom
implementaAons
Fault tolerance
Standby
servers
take
over
when
a

node
fails
$
(,
g#
G
Proven
Deployed
in
Yahoo
processing
thousands
of

search
queries
per
second

90
Twitter Storm
Guaranteed
Message
Passing
Horizontal
Scalability
Robust
Fault
Tolerance
Concise
Code-Focus
on Logic
b Ñ /

91
Storm Terminology
Topology
Directed
acyclic
graph

verAces
=
computaAon,
and

edges
=
streams
of
data
tuples
Spouts
Sources
of
data
tuples
for
the
topology

Examples
-‐
Ka•a/Kestrel/MySQL/Postgres
Bolts
Process
incoming
tuples,
and
emit
outgoing
tuples

Examples
-‐
ﬁltering/aggregaAon/join/any
funcAon
,
%

92
Storm Topology
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5

93
Tweet Word Count Topology
% %
Tweet Spout Parse Tweet Bolt Word Count Bolt
Live stream of Tweets
#worldcup : 1M
soccer: 400K
….

94
% %
When
a
parse
tweet
bolt
task
emits
a
tuple

which
word
count
bolt
task
should
it
send
to?
% %% %% %% %

95
Storm Groupings
01 02 03 04
Shuﬀle Grouping
Random distribution of tuples
Fields Grouping
Group tuples by a field or
multiple fields
All Grouping
Replicates tuples to all tasks
Global Grouping
Send the entire stream to one
task
/
.
-
,

96
% %
% %% %% %% %
Shuﬀle Grouping Fields Grouping

97
Storm Architecture
Nimbus
ZK
Cluster
Supervisor
W1 W2 W3 W4
Supervisor
W1 W2 W3 W4
Topology
Submission
Assignment
Maps
Sync Code
Slave Node Slave Node
Master Node

98
Storm Worker
TASK TASKTASK TASK EXECUTOR
TASKTASK EXECUTORTASK
TASK EXECUTORTASK

99
Data Flow in Storm Workers
Global
Receive

Thread
Global
Send

Thread
In
Queue
User
Logic

Thread
Out
Queue
Send

Thread
Outgoing

Message
Buﬀer

100
Storm Metrics
Support and trouble shooting
Continuous performance
Cluster availability#
g
G

101
Collecting Topology Metrics
% %
% Scribe
Metrics Bolt

103
Overloaded Zookeeper
S1
S2
S3W
W
W
STORM
zk
SERVICES

104
S1
S2
S3W
W
W
STORM
zk
SERVICES
zk

105
zk
S1
S2
S3W
W
W
STORM
zk
SERVICES

106
Analyzing Zookeeper Traﬀic
67
%
33
%
Oﬀset/ParAAon
is

wriPen
every
2

secs
Kafka Spout
Workers
write

heart
beats
every

3
secs
Storm Runtime

W
107
Heartbeat Daemons
zk
S1
S2
S3W
W
STORM
zk
SERVICES
Heartbeat

Cluster

Key
Value

Store

108
Some experiments
Storm Overheads
Read
from
Ka•a
cluster
and
serialize
in
a
loop

Sustain
input
rates
of
300K
msgs/sec
from
Ka•a
topic
Java program
No
acks
to
achieve
at
least
once
semanAcs

Storm
processes
were
co-‐located

using
isolaAon
scheduler
1-stage topology
Enable
acks
for
at
least
once
semanAcs
1-stage topology
with acks

109
Performance comparison
Storm Overheads
AverageCPUUtilization
0%
20%
40%
60%
80%
MachinesUsed
0
1
2
3
JAVA 1-STAGE 1-STAGE-ACK
Machines Avg. CPU
77%
58.2%58.3%
3
11

110
Storm Deployment
shared pool
storm
cluster

111
Storm Deployment
shared pool
storm
cluster
joe’s topology
isolated pools

112
Storm Deployment
shared pool
storm
cluster
joe’s topology
isolated pools
jane’s topology

113
Storm Deployment
shared pool
storm
cluster
joe’s topology
isolated pools
jane’s topology
dave’s topology

114
MillWheel
DAG Processing
Streams

ComputaAons
.
Cloud DataFlow

Uses
MillWheel
(From Google
Not
OpenSource
⛔
Exactly Once
Checkpoint
User
State
4

115
MillWheel
Computations
Arbitrary
User
Logic

Per
Key
OperaAon
Persistent State
Key/Value
API

Backed
by
BigTable
Streams
IdenAﬁed
By
Names

Unbounded
Keys
Per
Key
OperaAon
Serial

Diﬀerent
Keys
Parallel
Core Concepts
L
f
⚿
t

116
MillWheel
Caught up Time
Deﬁned
per
computaAon
Discard Late Data
~0.001%
at
Google
Seeded by Injectors
Input
Sources
Monotonic
Makes
life
easy
for
users
Low Watermark: The Concept of Time
Ê
4 6
u

117
MillWheel
Checkpoint
Same
Ame
as
User
State
DoubleCount
No
Dedup
Seeded by Injectors
Input
Sources
No checkpoint
Simpler
API
Strong And Week: Productions
'
4
(
q

118
MillWheel
Key/Value Abstractions
ComputaAons
Persistance Layer
BigTable
Idempotent
No
Side
Eﬀects
Batched
Eﬃcient
Computation State: Exactly Once Semantics
ó
a t
$

119
PubSub weds Processing
Exactly
Once
Processing
4
Tightly
Integrated
with
Kasaq
Open
Sourced
by
LinkedIn
K
Durability
via
YarnV

120
Samza
ParAAon
1ParAAon
0 ParAAon
2
Streams: Partitioned

121
Samza
ParAAon
0
Task
Task: Work on a single partition

122
Samza
Stream
A Stream
B
Task
1 Task
2 Task
3
Stream
C
Job
1
Job: Collection of Tasks

123
Samza
Samza State API
key
value
store
State As a Stream
persist
on
Ka•a
ó
f
Stateful Tasks: Exactly Once Semantics

124
Samza
Kafka based Streams
Persistence
t Simple API
Single
Node
Job
2
Stateful
Exactly
Once
4 Yarn Friendly
Durability
K
Tight Coupling: Queue and Processing

125
One Size Fits All
Apache Flink
General
Purpose
Analy`cs
Engine
Open
Source
and
Community
Driven
Works
well
with
Hadoop
Ecosystem
K
Came
out
of
Stratosphere
n

126
Apache Flink
Fast RunTime
Complex
DAG
Operators

Streamed
Data
to
Op
Iterative Algorithms
Much
Faster
In-‐
Memory
OperaAons
Intuitive APIs
Java/Scala/Python

Concise
Query
Coming
from
OLTP

World
% !
2 b
Ambitious Goal: One Size Fits All

127
Apache Flink
Data Streamed
between
operators
.
Master
Submission
and

Scheduling
L
Workers
Do
Actual
Work
K
Distributed Runtime: Scale

128
Apache Flink
Stack: Co-Exist with Hadoop

129
One system to replace them all!

General
purpose
Compute
Engine
Open
Source/Big
Community
K
MapReduce,
Streaming,
SQL,
…!
Integrates
well
with
Hadoop
Ecosystem(

130
Lots
Huge
CollecAon
with

Lineage
info
Resilient
Lost
DataSets
are
re-‐
computed
Distributed
Across
the
cluster
Core Concept: Lots of RDDS
t
(
)DataSet
Input
Data
divided
into

Batches
$
Streaming

131
W1
W2
W1
W3
W2
W1
W2
W1
W3
W1
W4
W3

W1
W5
W4
W6
W2
W7

W4
W7
W3
W5
W8
W2

W1
W4
W8
FlatMap Map reduceByKey
W1:1
W2:1
W1:1
W4:1
W1:1
W5:1
W1:3
W2:4
W3:1
W4:1
W5:4
W6:2
RDDs In Action:- WordCount
Streaming

132
Scala: Functional and Concise
Streaming

133
Streaming: Fits Naturally

Spark

Streaming

Spark

Engine
W3 W2 W4 W1W2W1
DStream
W2 W4 W1W3W2W1
Streaming

134
T0
to
T1 T1
to
T2 T2
to
T3
T0
to
T1 T1
to
T2 T2
to
T3
lines
words
ﬂatMap
Series of RDDs
5
Window FunctionsA
Can Create other Dstreamsq
Streaming: With Dstreams
Streaming

135
DStream: Operators
Regular Spark Operators
map,
ﬂatMap,
ﬁlter,
…
Y Transform
RDD
-‐>
RDD
$
Window Operators
countByWindow,

reduceByWindow
A Join
join
mulAple

Dstreams
,
Streaming

136
Basic Sources
HDFS,
S3,
…
É
Reliability
ack
vs
noAck
sources
VCustom
Implement
Interface
J
^ Advanced
Ka•a,
TwiPerUAls
u
Input DStreams: Sources of Data
Streaming

137
Exaclty Once
Conﬁdent
about
results
4
Ecosystem
Hadoop,Yarn,
Ka•a,
…
K
Scalable
RDDs
as
scale
unit

Single System
Batch
+
Streaming
v
Basic Premise: One Size Fits All
Streaming

138
Annota`on
plugin
framework
to
extend
SQL
Stream Processing: With SQL
Processing
logic
in
SQL
%
Clustering
with
elas`c
scaling
No
down`me
during
upgrades(

139
Channels
Key/Value
API
É
Processor
SQL,
Custom
J
Core Concept: CEP Cell
Inbound

Channel
Outbound

Channel
Processor
CEP
Cell

140
Example Pipeline: Stitching Cells

141
Messaging Models
Used
for
low
latency.

Producer
pushes
data
to
consumer.

Write
to
Kakﬂa
if
consumer
down
or

unable
to
keep
up
for
replay
later
Push
Atmost once
/
Producer
writes
events
to
Ka•a

Consumer
consumes
Ka•a

Storing
to
Ka•a
allows
for
replay

Pull
Atleast once
/

142
Deployment Architecture
Events are partitioned
All
events
with
the
same
key
are
routed
to
the

same
cell

Scaling
More
cells
are
added
to
the
pipeline
for
scaling

Pulsar
automaAcally
detects
new
cells
and

rebalances
traﬃc

143
SQL:
Event filtering and routing

145
Better Storm
Twitter Heron
Container
Based
Architecture
Separate
Monitoring
and
Scheduling
-
Simpliﬁed
Execu`on
Model
2
Much
Be@er
Performance%

146
Storm: Issues
Heron
Poor Performance
Queue
ContenAons

MulAple
Languages
&Lack of BackPressure
Unpredictable
Drops
!
Complex Execution Env
Hard
to
tune
! SPOF
Overloaded
Nimbus
"

147
Heron
Batching of tuples
AmorAzing
the
cost
of
transferring
tuples $
Task isolation
Ease
of
debug-‐ability/isolaAon/proﬁling
(Fully API compatible with Storm
Directed
acyclic
graph

Topologies,
Spouts
and
Bolts
,
Support for back pressure
Topologies
should
self
adjusAng
gUse of main stream languages
C++,
Java
and
Python #
Eﬀiciency
Reduce resource consumption
G
Design: Goals

148
Heron
Topology 1
Topology
Submission
Scheduler
Topology 2
Topology N
Architecture: High Level

149
Heron
Topology
Master
ZK
Cluster
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
CONTAINER CONTAINER
Metrics
Manager
Metrics
Manager
Architecture: Topology

150
Heron
Gateway for metrics
G
Assigns role#
Monitoring of containers
g
Topology Master

151
Heron
Topology
Master
ZK
Cluster
Logical Plan,
Physical Plan and
Execution State
Prevent
mul`ple
TM
becoming

masters
Allows
other
process
to
discover
TM
01
02
Topology Master

152
Heron
% %
S1 B2 B3
%
B4
Stream Manager: BackPressure

153
Stream Manager
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
Stream Manager: BackPressure

154
Heron
Slows upstream and downstream instances
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
Stream Manager: TCP BackPressure

S1 S1
S1S1S1 S1
S1S1
155
Heron
B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
B2
B3 B4
B2
B3
B2
B3 B4
B4
Stream Manager: Spout BackPressure

156
Heron
Exposes Storm and Heron APIAPI
Collects several metricsG
Runs only one task (spout/bolt)
g
Instance: Worker Bee

157
Heron
Stream
Manager
Metrics
Manager
Gateway
Thread
Task Execution
Thread
data-in queue
data-out queue
metrics-out queue
Instance: Worker Bee

158
Heron
Topology 1
Topology 2
Topology N
Heron
Tracker
Heron
VIZ
Heron
Web
ZK
Cluster
Aurora Services
Observability
Deployment

161
Heron
COMPONENTS EXPT #1 EXPT #2 EXPT #3 EXPT #4
Spout 25 100 200 300
Bolt 25 100 200 300
# Heron containers 25 100 200 300
# Storm workers 25 100 200 300
Performance: Settings

162
Heron
milliontuples/min
0
350
700
1050
1400
Spout Parallelism
25 100 200 500
Storm Heron
latency(ms)
0
625
1250
1875
2500
Spout Parallelism
25 100 200 500
Storm Heron
Throughput Latency
10 -14x 5 -15x
Performance: Atleast Once

163
Heron
#coresused
0
625
1250
1875
2500
Spout Parallelism
25 100 200 500
Storm Heron
2 -3x
Performance: CPU Usage

164
Heron
Throughput CPU usage
milliontuples/min
0
1250
2500
3750
5000
Spout Parallelism
25 100 200 500
Storm Heron
#coresused
0
625
1250
1875
2500
Spout Parallelism
25 100 200 500
Storm Heron
Performance: Atmost Once

165
Heron Performance
% %
Client Event Spout Distributor Bolt User Count Bolt
%
Aggregator Bolt
Shuﬀle Grouping Fields Grouping Fields Grouping
Performance: RTAC Topology

166
Heron
#coresused
0
100
200
300
400
Storm Heron
latency(ms)
0
17.5
35
52.5
70
Storm Heron
Latency CPU usage
Performance: RTAC Atleast Once

167
Heron
#coresused
0
62.5
125
187.5
250
Storm Heron
CPU usage
Performance: RTAC Atmost Once

168
Issues
3rd Generation Systems
Bit early to tell
Still no standardization and too many systems
6
%
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs

169
Growing set
Commercial Platforms
01 02 03 04
08 07 06 05
Infosphere Vibe Apama
Event

Processor
Data
Torrent Vitria
OI Blaze StreamBase

171
Combining batch and real time
Lambda Architecture
New
Data
Client

172
Lambda Architecture - The Good
Message

Broker
CollecAon
Pipeline
Lambda
Architecture

AnalyAcs
Pipeline
Results

173
Lambda Architecture - The Bad
Have to fix everything (may be twice)!
How much Duct Tape required?
Have to write everything twice!
Subtle diﬀerences in semantics
What about Graphs, ML, SQL, etc?
$
*,
7#

174
Summingbird
Summingbird
Program
Map
Reduce
Job
HDFS
Message
broker
Storm/Heron
Topology
Online
key
value
result

store
Batch
key
value
result

store
Client

175
Near real-time processing
SQL-on-Hadoop
Com
m
ercial
Commercial
Apache
Commercial
Cloudera
Hortonworks
Pivotal
MammothDB

Auto scaling the system in the presence of unpredictability
176
Technology Challenges
The Road Ahead
Auto tuning of real time analytics jobs/queries
Exploiting faster networks for eﬀiciently moving data
Ä
Ü
J

Real-time personalization
177
Applications
The Road Ahead
Preferences,
Ame,
locaAon
and
social
Wearable computing
Screen
size
fragmentaAon
Analytics: Image, Video, Touch
PaPern
RecogniAon,
Anomaly
DetecAon
+

178
WHAT WHY WHERE WHEN WHO HOW
Any Question ???

179
@arun_kejariwal, @sanjeevrk, @karthikz
Get in Touch

THANKS
FOR
ATTENDING
!!!

Real Time Analytics: Algorithms and Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Real Time Analytics: Algorithms and Systems

Similar to Real Time Analytics: Algorithms and Systems (20)

More from Arun Kejariwal

More from Arun Kejariwal (20)

Recently uploaded

Recently uploaded (20)

Real Time Analytics: Algorithms and Systems