Tutorial - Modern Real Time Streaming Architectures

A Tutorial
Modern Real-Time Streaming
Architectures
Karthik Ramasamy*, Sanjeev Kulkarni*, Arun Kejariwal^ & Sijie Guo*
*Streamlio, ^MZ

2
MESSAGING STREAMING DATA SKETECHES
LAMBDA, KAPPA UNIFICATION
OUTLINE

3
Information Age
Real <me is key
Ká !

4
Increasingly Connected World
Internet of Things
30 B connected devices by 2020
Health Care
153 Exabytes (2013) -> 2314 Exabytes (2020)
Machine Data
40% of digital universe by 2020
Connected Vehicles
Data transferred per vehicle per month
4 MB -> 5 GB
Digital Assistants (Predictive Analytics)
$2B (2012) -> $6.5B (2019) [1]
Siri/Cortana/Google Now
Augmented/Virtual Reality
$150B by 2020 [2]
Oculus/HoloLens/Magic Leap
Ñ
!+
>

6
Traditional Data Processing
Challenges
Introduces too much “decision latency”
Responses are delivered “aber the fact”
Maximum value of the iden<ﬁed situa<on is lost
Decisions are made on old and stale data
Data at Rest
Store Analyze Act

7
The New Era: Streaming Data/Fast Data
Events are analyzed and processed in real-<me as they drive
Decisions are <mely, contextual and based on fresh data
Decision latency is eliminated
Data in mo<on

9
Product Safety
Observa@ons
Fight spammy content, engagements, and behaviors in Twiger
Spam campaign comes in large batch
Despite randomized tweaks, enough similarity among spammy en<<es are
preserved
Requirement
Real <me - a compe<<on with spammers (i.e) “detect” vs “mutate”
Generic - need to support all common feature representa<ons

10
Product Safety - System Overview
KV store for
clustering
Messaging
System
(Event Bus)
Similarity Clustering (Heron)

11
Real Time Ads
KV store
Messaging
System
(Event Bus)
Ads Serving (Heron)
Ads Predic<on (Heron)
Impressions
Spend
Ads Analy<cs (Heron)
Engagements
Spend
Ads Requests
Ads Responses
Impressions
Spend

12
Connected Cars
KV store for
clustering
Messaging
System
Traﬃc Pagerns
Messaging
System Data Capture/Filter
Fuel Eﬃciency

13
Real Time Use Cases
Algorithmic trading
Online fraud detec<on
Geo fencing
Proximity/loca<on tracking
Intrusion detec<on systems
Traﬃc management
Real <me recommenda<ons
Churn detec<on
Internet of things
Social media/data analy<cs
Gaming data feed

15
Recurring Pattern
ProcessMessaging
Storage
Data Inges<on Data Processing
Results StorageData Storage
Data
Serving

16
State of the World
Aggregation
Systems
Messaging
Systems
Result
Engine
HDFS
Queryable
Engines

17
Towards Uniﬁcation and Simpliﬁcation
Interactive
Querying
Storm API Streamlets SQL
Application
Builder
Pulsar
API
BK/
HDFS
API
Kubernetes
Metadata
Management
Operational
Monitoring
Chargeback
Security
Authentication
Quota
Management
Kafka
API

18
Real Time Stack
REAL TIME
STACK
Collectors
s
Compute
J
Messaging
a
Storage
b

19
of MESSAGING FRAMEWORKS
An In-Depth

21
Current Messaging Systems
01 02 03 04
05 06 07 08
ActiveMQ RabbitMQ Pulsar RocketMQ
Azure
Event Hub
Google
Pub-Sub
Satori Kafka

22
Why Apache Pulsar?
Ordering
Guaranteed ordering
Mul@-tenancy
A single cluster can support many
tenants and use cases
High throughput
Can reach 1.8 M messages/s in a
single par@@on
Durability
Data replicated and synced to disk
Geo-replica@on
Out of box support for geographically
distributed applica@ons
Uniﬁed messaging model
Support both Topic & Queue
seman@c in a single model
Delivery Guarantees
At least once, at most once and eﬀec@vely
once
Low Latency
Low publish latency of 5ms at 99pct
Highly scalable
Can support millions of topics

23
Uniﬁed Messaging Model
Producer (X)
Producer (Y)
Topic (T)
Subscrip<on (A)
Subscrip<on (B)
Consumer (A1)
Consumer (B2)
Consumer (B1)
Consumer (B3)

24
Pulsar Producer
PulsarClient client = PulsarClient.create(
“http://broker.usw.example.com:8080”);
Producer producer = client.createProducer(
“persistent://my-property/us-west/my-namespace/my-topic”);
// handles retries in case of failure
producer.send("my-message".getBytes());
// Async version:
producer.sendAsync("my-message".getBytes()).thenRun(() -> {
// Message was persisted
});

25
Pulsar Consumer
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Consumer consumer = client.subscribe(
"persistent://my-property/us-west/my-namespace/my-topic",
"my-subscription-name");
while (true) {
// Wait for a message
Message msg = consumer.receive();
System.out.println("Received message: " + msg.getData());
// Acknowledge the message so that it can be deleted by broker
consumer.acknowledge(msg);
}

26
Pulsar Architecture
Separa@on of Storage and Serving
SERVING
Brokers can be added independently
Trafﬁc can be shifted quickly across brokers
STORAGE
Bookies can be added independently
New bookies will ramp up trafﬁc quickly

27
Segment Centric Architecture

28
Broker Failure Recovery
• Topic is reassigned to
an available broker
based on load
• Can reconstruct the
previous state
consistently
• No data needs to be
copied
• Failover handled
transparently by client
library

29
Bookie Failure Recovery
• After a write failure,
BookKeeper will
immediately switch write to
a new bookie, within the
same segment.

30
Bookie Failure Recovery
• In background, starts a
many-to-many recovery
process to regain the
conﬁgured replication
factor

31
Pulsar Architecture
Geo Replica@on
GEO REPLICATION
Asynchronous replication
Integrated in the broker message ﬂow
Simple conﬁguration to add/remove regions
Topic (T1) Topic (T1)
Topic (T1)
Subscrip<on
(S1)
Subscrip<on
(S1)
Producer
(P1)
Consumer
(C1)
Producer
(P3)
Producer
(P2)
Consumer
(C2)
Data Center A Data Center B
Data Center C

32
Pulsar Use Cases - Message Queue
Online Events Topic (T)
Worker 1
Worker 2
Decouple Online/Oﬄine
Topic (T)
Worker 3
MESSAGE QUEUES
Decouple online or background
High availability
Reliable data transport
No@ﬁca@ons
Long running tasks
Low latency
publish

33
Pulsar Use Cases - Feedback System
Event Topic (T)
Propagate States
Controller
Topic (T)
Serving System Serving System Serving System
FEEDBACK SYSTEM
Coordinate large number of machines
Propagate states
Examples
State propagation
Personalization
Ad-systems
Feedback
Updates

34
Pulsar in Production
3+ years
Serves 2.3 million topics
100 billion messages/day
Average latency < 5 ms
99% 15 ms (strong durability guarantees)
Zero data loss
80+ applica<ons
Self served provisioning
Full-mesh cross-datacenter replica<on -
8+ data centers

36
of STREAMING FRAMEWORKS
An In-Depth

37
Current Streaming Frameworks
01 02 03 04
05 06 07 08
Beam S-Store Spark Flink
Heron Storm Apex
KAFKA
STREAMS

39
Heron Terminology
Topology
Directed acyclic graph
ver<ces = computa<on, and
edges = streams of data tuples
Spouts
Sources of data tuples for the topology
Examples - Pulsar/Kata/MySQL/Postgres
Bolts
Process incoming tuples, and emit outgoing tuples
Examples - ﬁltering/aggrega<on/join/any func<on
,
%

40
Heron Topology
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5

41
Heron Topology - Physical Execution
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
%%
%%
%%
%%
%%

42
Heron Groupings
01 02 03 04
Shuﬀle Grouping
Random distribution of tuples
Fields Grouping
Group tuples by a field or
multiple fields
All Grouping
Replicates tuples to all tasks
Global Grouping
Send the entire stream to one
task
/
.
-
,

43
Heron Topology - Physical Execution
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
%%
%%
%%
%%
%%
Shuﬀle Grouping
Shuﬀle Grouping
Fields Grouping
Fields Grouping
Fields Grouping
Fields Grouping

44
Writing Heron Topologies
Procedural - Low Level API
Directly write your spouts and
bolts
Functional - Mid Level API
Use of maps, ﬂat maps, transform, windows
Declarative - SQL (coming)
Use of declara<ve language - specify what you
want, system will ﬁgure it out.
,
%

45
Streamlet Functional API
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);

46
Heron Design Goals
Efficiency
Reduce resource consump<on
Support for diverse workloads
Throughput vs latency sensi<ve
Support for mul@ple seman@cs
Atmost once, Atleast once,
Effec<vely once
Na@ve Mul@-Language Support
C++, Java, Python
Task Isola@on
Ease of debug-ability/isola<on/profiling
Support for back pressure
Topologies should be self adjus<ng
Use of containers
Runs in schedulers - Kubernetes & DCOS &
many more
Mul@-level APIs
Procedural, Func<onal and Declara<ve for
diverse applica<ons
Diverse deployment models
Run as a service or pure library

47
Heron Architecture
Scheduler
Topology 1 Topology 2 Topology N
Topology
Submission

48
Topology Master
Monitoring of containers Gateway for metrics Assigns role

49
Topology Architecture
Topology Master
ZK
Cluster
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
DATA CONTAINER DATA CONTAINER
Metrics
Manager
Metrics
Manager
Metrics
Manager
Health
Manager
MASTER
CONTAINER

50
Stream Manager
Routes tuples Implements backpressure Ack management

51
Stream Manager
Sample topology
% %
S1 B2 B3
%
B4

52
Stream Manager
Physical execu@on
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4

53
Stream Manager Backpressure
Spout based back pressureTCP backpressure Stage by stage back pressure

54
TCP based backpressure
Slows upstream and downstream instances
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4

55
Spout based backpressure
S1 S1
S1S1S1 S1
S1S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
B2
B3 B4
B2
B3
B2
B3 B4
B4

56
Heron Instance
Runs only one task (spout/bolt)
Exposes Heron API
Collects several metrics
API
G

57
Heron Instance
Stream
Manager
Metrics
Manager
Gateway
Thread
Task Execution
Thread
data-in queue
data-out queue
metrics-out queue

58
Heron Deployment
Topology 1
Topology 2
Topology N
Heron
Tracker
Heron
VIZ
Heron
Web
ZK
Cluster
Aurora Services
Observability

60
Heron @Twitter
LARGEST CLUSTER
100’s of TOPOLOGIES
BILLIONS OF MESSAGES100’s OF TERABYTESREDUCED INCIDENTS
GOOD NIGHT SLEEP
3X - 5X reduction in resource usage

62
AND
of STREAM PROCESSING APPLICATIONS

63
Pulsar Operations
Reac<ng to Failures
Brokers
Bookies
Common Issues
Consumer Backlog
I/O Priori<za<on and Throgling
Mul<-Tenancy

64
Reacting to Failures - Brokers
Brokers don’t have durable state
Easily replaceable
Topics are immediately reassigned to healthy brokers
Expanding capacity
Simply add new broker node
If other brokers are overloaded, traﬃc will be automa<cally assigned
Load manager
Monitor traﬃc load on all brokers (CPU, memory, network, topics)
Ini<ally place topics to least loaded brokers
Reassign topics when a broker is overloaded

65
Reacting to Failures - Bookies
When a bookie fails, brokers will immediately con<nue on other bookies
Auto-Recovery mechanism will re-establish the replica<on factor in background
If a bookie keeps giving errors or <meouts, it will be “quaran<ned”
Not considered for new ledgers for some period of <me

66
Consumer Backlog
Metrics are available to make assessments:
When problem started
How big is backlog? Messages? Disk space?
How fast is draining?
What’s the ETA to catch up with publishers?
Establish where is the bogleneck
Applica<on is not fast enough
Disk read IO

67
Enforcing Multi-Tenancy
Ensure tenants don’t cause performance issues on other tenants
Backlog quotas
Sob-Isola<on
Flow control
Throgling
In cases when user behavior is triggering performance degrada<on
Hard-isola<on as a last resource for quick reac<on while proper ﬁx is deployed
Isolate tenant on a subset of brokers
Can be also applied at the BookKeeper level

68
Heron Use Cases
Monitoring
Real Time
Machine Learning
Ads
Real Time Trends
Product Safety
Real Time Business
Intelligence

70
Heron Topology Scale
CONTAINERS - 1 TO 600 INSTANCES - 10 TO 6000

71
Heron Happy Facts :)
! No more pages during midnight for Heron team
" Very rare incidents for Heron customer teams
# Easy to debug during incident for quick turn around
$ Reduced resource u<liza<on saving cost

72
Heron Developer Issues
01 02
Container resource alloca@on
Parallelism tuning

73
Heron Operational Issues
01 02 03
Slow Hosts Network Issues Data Skew
/
.
-
04
Load Variations
,
05
SLA Violations
/

74
Slow Hosts
Memory Parity Errors Impending Disk Failures Lower GHZ

75
Network Issues
Network Slowness
Network Partitioning
G

76
Network Slowness
01 02 03
Delays processing
Data is accumula@ng
Timelines of results
Is aﬀected

77
Data Skew
Multiple Keys
Several keys map into single
instance and their count is
high
Single Key
Single key maps into a instance
and its count is high
H
C

78
Load Variations
Spikes
Sudden surge of data - short
lived vs last for several
minutes
Daily Patterns
Predictable change in traﬃc
H
C

79
Self Regulating Streaming Systems
Automate Tuning
SLO
Maintenance
Self Regula@ng
Streaming Systems
Tuning
Manual, <me-consuming and error-
prone task of tuning various systems
knobs to achieve SLOs
SLO
Maintenance of SLOs in the face of
unpredictable load varia<ons and
hardware or sobware performance
degrada<on
Self Regula@ng Streaming Systems
System that adjusts itself to the environmental changes and
con<nue to produce results

80
Self Regulating Streaming Systems
Self tuning Self stabilizing Self healing
G "g
Several tuning knobs
Time consuming tuning phase
The system should take
as input an SLO and
automatically configure
the knobs.
The system should
react to external shocks
a n d a u t o m a t i c a l l y
reconfigure itself
Stream jobs are long running
Load variations are common
The system should
identify internal faults
and attempt to recover
from them
System performance affected
by hardware or software
delivering degraded quality of
service

81
Enter Dhalion
Dhalion periodically executes
well-specified policies that
optimize execution based on
some objective.
We created policies that
dynamically provision resources
in the presence of load variations
and auto-tune streaming
applications so that a throughput
SLO is met.
Dhalion is a policy based
framework integrated into Heron

82
Dhalion Policy Framework
Symptom
Detector 1
Symptom
Detector 2
Symptom
Detector 3
Symptom
Detector N
....
Diagnoser 1
Diagnoser 2
Diagnoser M
....
Resolver
Invocation
D
iagnosis
1
Diagnosis 2
D
iagnosis
M
Symptom 1
Symptom 2
Symptom 3
Symptom N
Symptom
Detection
Diagnosis
Generation
Resolution
Resolver 1
Resolver 2
Resolver M
....
Resolver
Selection
Metrics

83
Dynamic Resource Provisioning
Policy
This policy reacts to unexpected
load varia<ons (workload spikes)
Goal
Goal is to scale up and scale down
the topology resources as needed -
while keeping the topology in a
steady state where back pressure is
not observed
H
C
Policy

84
Dynamic Resource Provisioning
Pending Tuples
Detector
Backpressure
Detector
Processing Rate
Skew Detector
Resource Over
provisioning
Diagnoser
Resource Under
Provisioning
Diagnoser
Data Skew
Diagnoser
Resolver
Invocation
Diagnosis
Symptoms
Symptom
Detection
Diagnosis
Generation
Resolution
Metrics
Slow Instances
Diagnoser
Bolt Scale
Down Resolver
Bolt Scale
Up Resolver
Data Skew
Resolver
Restart
Instances
Resolver
Implementa@on

Streaming and One-Pass
Algorithms

DATA CHARACTERISTICS
UNBOUNDED

DATA CHARACTERISTICS
UNORDERED
Varying Skew
Event @me
Processing @me
Correctness
Completeness

88
Nehlix , HBO, Hulu, YouTube, Dailymo@on, ESPN, Sling TV
Satori, Facebook Live, Periscope
Spo@fy, Pandora, Apple Music, Tidal
Amazon Twitch, YouTube Gaming, Microsoj Mixer

KEY CHARACTERISTICS
LOW LATENCY

HIGH VELOCITY DATA
90
KEY CHARACTERISTICS
In-Order O(1) Storage O(n) Time
ONE PASS

91
DISTRIBUTED COMPUTATION SCALE OUT
ROBUST
CHARACTERISTICS
FAULT TOLERANCE
KEY

92
DATA SKETCHES
Early work
The space complexity of approximating
the frequency moments
Counting
Frequent Elements
[Misra and Gries, 1982]
Flajolet and Martin 1985]
Computing on Data Streams
[Henzinger et al. 1998]
[Alon et al. 1996]
Counting
[Morris, 1977]
Median of a sequence
[Munro and Paterson, 1980]
Membership
[Bloom, 1970]

DATA SKETCHES
UNIQUE
FILTER
COUNT
HISTOGRAM
QUANTILE
MOMENTS
TOP-K

ADVANCED DATA SKETCHES
RANDOM PROJECTIONS, FREQUENT DIRECTIONS
Dimensionality Reduc<on
RANDOMIZED NUMERICAL ALGEBRA
Matrix mul<ply
GRAPHS
Summarize adjacency
Connec<vity, k-connec<vity, Spanners, Sparsiﬁca<on
GEOMETRIC
Diameter, Lp distances, Min-cost matchings
Informa<on Distances, e.g., Hellinger distance
SKETCHING SKETCHES
Tes<ng independence

95
1 2 3
5 4
SAMPLING
A/B
Tes<ng
FILTERING
Set
membership
CORRELATION
Fraud
Detec<on
QUANTILES
Network
Analysis
CARDINALITY
Site Audience
Analysis
Applications

96
6 7 8
10 9
MOMENTS
Database
FREQUENT
ELEMENTS
Trending
Hashtags
CLUSTERING
Medical
Imaging
ANOMALY
DETECTION
Sensor Networks
SUBSEQUENCES
Traﬃc
Analysis
Applications
!
!
"
"

97
Sampling
[1] J. S. Viger. Random Sampling with a Reservoir. ACM Transac<ons on Mathema<cal Sobware, Vol. 11(1):37–57, March 1985.
Obtain a representa<ve sample from a data stream
Maintain dynamic sample
A data stream is a con<nuous process
Not known in advance how many points may elapse before an analyst may need to use a
representa<ve sample
Reservoir sampling [1]
Probabilis<c inser<ons and dele<ons on arrival of new stream points
Probability of successive inser<on of new points reduces with progression of the stream
An unbiased sample contains a larger and larger frac<on of points from the distant history of the stream
Prac<cal perspec<ve
Data stream may evolve and hence, the majority of the points in the sample may represent the
stale history

98
Sampling
Sliding window approach (sample size k, window width n)
Sequence-based
Replace expired element with newly arrived element
Disadvantage: highly periodic
Chain-sample approach
Select element ith with probability Min(i,n)/n
Select uniformly at random an index from [i+1, i+n] of the element which will replace the ith item
Maintain k independent chain samples
Timestamp-based
# elements in a moving window may vary over <me
Priority-sample approach
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
[1] B. Babcock. Sampling From a Moving Window Over Streaming Data. In Proceedings of SODA, 2002.

99
Sampling
[1] C. C. Aggarwal.On Biased Reservoir Sampling in the presence of Stream Evolu<on. in Proceedings of VLDB, 2006.
Biased Reservoir Sampling [1]
Use a temporal bias func<on - recent points have higher probability of
being represented in the sample reservoir
Memory-less bias func<ons
Future probability of retaining a current point in the reservoir is independent of its past history or
arrival <me
Probability of an rth point belonging to the reservoir at the <me t is propor<onal to the bias func<on
Exponen<al bias func<ons for rth data point at <me t, where, r ≤ t, λ ∈ [0, 1] is the
bias rate
Maximum reservoir requirement R(t) is bounded

100
Filtering
Set Membership
Determine, with some false probability, if an item in a data stream has
been seen before
Databases (e.g., speed up semi-join opera<ons), Caches, Routers, Storage Systems
Reduce space requirement in probabilis<c rou<ng tables
Speedup longest-preﬁx matching of IP addresses
Encode mul<cast forwarding informa<on in packets
Summarize content to aid collabora<ons in overlay and peer-to-peer networks
Improve network state management and monitoring

101
Filtering
Set Membership
Applica<on to hyphena<on
programs
Early UNIX spell checkers
[1] Illustra<on borrowed from hgp://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
[1]

102
Filtering
Set Membership
Natural generaliza<on of hashing
False posi<ves are possible

No false nega<ves
No dele<ons allowed
For false posi<ve rate ε, # hash func<ons = log2(1/ε)
where, n = # elements,
k = # hash func<ons
m = # bits in the array

103
Filtering
Set Membership
Minimizing false posi<ve rate ε w.r.t. k [1]
k = ln 2 * (m/n)
ε = (1/2)k ≈ (0.6185)m/n
1.44 * log2(1/ε) bits per item
Independent of item size or # items
Informa<on-theore<c minimum: log2(1/ε) bits per item
44% overhead
X = # 0 bits
where
[1] A. Broder and M. Mitzenmacher. Network Applica<ons of Bloom Filters: A Survey. In Internet Mathema<cs Vol. 1, No. 4, 2005.

104
Filtering
Set Membership: Cuckoo Filter [1]
Key Highlights
Add and remove items dynamically
For false posi<ve rate ε < 3%, more space efficient than Bloom filter
Higher performance than Bloom filter for many real workloads
Asympto<cally worse performance than Bloom filter
Min fingerprint size α log (# entries in table)
Overview
Stores only a fingerprint of an item inserted
Original key and value bits of each item not retrievable
Set membership query for item x: search hash table for fingerprint of x
[1] Fan et al., Cuckoo Filter: Prac<cally Beger Than Bloom. In Proceedings of the 10th ACM Interna<onal on Conference on Emerging Networking Experiments and Technologies, 2014.

105
Filtering
Set Membership
[1] R. Pagh and F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122-144, 2004.
[2] Illustra<on borrowed from “Fan et al., Cuckoo Filter: Prac<cally Beger Than Bloom. In Proceedings of the 10th ACM Interna<onal on Conference on Emerging Networking Experiments and Technologies, 2014.”
[2]
Illustra<on of Cuckoo hashing [2]
Cuckoo Hashing [1]
High space occupancy
Prac<cal implementa<ons: mul<ple items/bucket
Example uses: Sobware-based Ethernet switches
Cuckoo Filter [2]
Uses a mul<-way associa<ve Cuckoo hash table
Employs par<al-key cuckoo hashing
Store ﬁngerprint of an item
Relocate exis<ng ﬁngerprints to their alterna<ve loca<ons
[2]

Dele<on
Item must have been previously inserted
106
Filtering
Set Membership
Cuckoo Filter
Par<al-key cuckoo hashing
Fingerprint hashing ensures uniform
distribu<on of items in the table
Length of fingerprint << Size of h1 or h2
Possible to have mul<ple entries of a
fingerprint in a bucket
Alternate
bucket
Significantly shorter
than h1 and h2

107
Filtering
Set Membership
Comparison
k ➛ # hash functions, d ➛ # partitions

108
Cardinality
Dis<nct Elements
Database systems/Search engines
# dis<nct queries
Network monitoring applica<ons
Natural language processing
# dis<nct mo<fs in a DNA sequence
# dis<nct elements of RFID/sensor networks

109
Previous work
Probabilis<c coun<ng [Flajolet and Mar<n, 1985]
LogLog coun<ng [Durand and Flajolet, 2003]
HyperLogLog [Flajolet et al., 2007]
Sliding HyperLogLog [Chabchoub and Hebrail, 2010]
HyperLogLog in Prac<ce [Heule et al., 2013]
Self-Organizing Bitmap [Chen and Cao, 2009]
Discrete Max-Count [Ting, 2014]
Sequence of sketches forms a Markov chain when h is a strong universal hash
Es<mate cardinality using a mar<ngale
Cardinality

110
Comparison
N ≤ 109
Cardinality

111
Hyperloglog
Apply hash func<on h to every element in a mul<set
Cardinality of mul<set is 2max(ϱ) where 0ϱ-11 is the bit pagern observed at the
beginning of a hash value
Above suﬀers with high variance
Employ stochas<c averaging
Par<<on input stream into m sub-streams Si using ﬁrst p bits of hash values (m = 2p)
where
Cardinality

112
Use of 64-bit hash func<on
Total memory requirement 5 * 2p -> 6 * 2p, where p is the precision
Empirical bias correc<on
Uses empirically determined data for cardinali<es smaller than 5m and uses the unmodified raw
es<mate otherwise
Sparse representa<on
For n≪m, store an integer obtained by concatena<ng the bit pagerns for idx and ϱ(w)
Use variable length encoding for integers that uses variable number of bytes to represent
integers
Use difference encoding - store the difference between successive elements
Other op<miza<ons [1, 2]
Hypeloglog Optimizations
[1] hgp://druid.io/blog/2014/02/18/hyperloglog-op<miza<ons-for-real-world-systems.html
[2] hgp://an<rez.com/news/75
Cardinality

113
Self-Learning Bitmap (S-bitmap) [1]
Achieve constant rela<ve es<ma<on errors for unknown cardinali<es in a wide
range, say from 10s to >106
Bitmap obtained via adap<ve sampling process
Bits corresponding to the sampled items are set to 1
Sampling rates are learned from # dis<nct items already passed and reduced sequen<ally as
more bits are set to 1
For given input parameters Nmax and es<ma<on precision ε, size of bit mask
For r = 1 -2ε2(1+ε2)-1 and sampling probability pk = m (m+1-k)-1(1+ε2)rk, where k ∈ [1,m]
Rela<ve error ≣ ε
[1] Chen et al. “Dis<nct coun<ng with a self-learning bitmap”. Journal of the American Sta<s<cal Associa<on, 106(495):879–890, 2011.
Cardinality

114
Quantiles
Quan<les, Histograms
Large set of real-world applica<ons
Database applica<ons
Sensor networks
Opera<ons
Proper<es
Provide tunable and explicit guarantees on the precision of approxima<on
Single pass
Early work
[Greenwald and Khanna, 2001] - worst case space requirement
[Arasu and Manku, 2004] - sliding window based model, worst case space
requirement

115
q-digest [1]
Groups values in variable size buckets of almost
equal weights
Unlike a tradi<onal histogram, buckets can overlap
Key features
Detailed informa<on about frequent values preserved
Less frequent values lumped into larger buckets
Using message of size m, answer within an error of
Except root and leaf nodes, a node v ∈ q-digest iﬀ
Max signal
value
# Elements
Compression
Factor
Complete binary tree
[1] Shrivastava et al., Medians and Beyond: New Aggrega<on Techniques for Sensor Networks. In Proceedings of SenSys, 2004.
Quantiles

116
q-digest
Building a q-digest
q-digests can be constructed in a distributed fashion
Merge q-digests
Quantiles

117
t-digest [1]
Approxima<on of rank-based sta<s<cs
Compute quan<le q with an accuracy rela<ve to max(q, 1-q)
Compute hybrid sta<s<cs such as trimmed sta<s<cs
Key features
Robust with respect to highly skewed distribu<ons
Independent of the range of input values (unlike q-digest)
Rela<ve error is bounded
Non-equal bin sizes
Few samples contribute to the bins corresponding to the extreme quan<les
Merging independent t-digests
Reasonable accuracy
[1]T. Dunning and O. Ertl, “”Compu<ng Extremely Accurate Quan<les using t-digests”, 2017. hgps://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf
Quantiles

118
t-digest
Group samples into sub-sequences
Smaller sub-sequences near the ends
Larger sub-sequences in the middle
Scaling func<on
Mapping k is monotonic
k(0) = 1 and k(1) = δ
k-size of each subsequence < 1
No<onal
Index
Compression
parameterQuan<le
Quantiles

119
t-digest
Es<ma<ng quan<le via interpola<on
Sub-sequences contain centroid of the samples
Es<mate the boundaries of the sub-sequences
Error
Scales quadra<cally in # samples
Small # samples in the sub-sequences near q=0 and q=1 improves accuracy
Lower accuracy in the middle of the distribu<on
Larger sub-sequences in the middle
Two ﬂavors
Progressive merging (buﬀering based) and clustering variant
Quantiles

120
Frequent Elements
Applica<ons
Track bandwidth hogs
Determine popular tourist des<na<ons
Itemset mining
Entropy es<ma<on
Compressed sensing
Search log mining
Network data analysis
DBMS op<miza<on

Count-min Sketch [1]
A two-dimensional array counts with w columns and d rows
Each entry of the array is ini<ally zero
d hash func<ons are chosen uniformly at random from a pairwise independent
family
Update
For a new element i, for each row j and k = hj(i), increment the kth column by one
Point query where, sketch is the table
Parameters
121
),( δε
}1{}1{:,,1 wnhh d ……… →
!
!
"
#
#
$
=
ε
e
w
!
!
"
#
#
$
=
δ
1
lnd
[1] Cormode, Graham; S. Muthukrishnan (2005). "An Improved Data Stream Summary: The Count-Min Sketch and its Applica<ons". J. Algorithms 55: 29–38.
Frequent Elements

Variants of Count-min Sketch [1]
Count-Min sketch with conserva<ve update (CU sketch)
Update an item with frequency c
Avoid unnecessary upda<ng of counter values => Reduce over-es<ma<on error
Prone to over-es<ma<on error on low-frequency items
Lossy Conserva<ve Update (LCU) - SWS
Divide stream into windows
At window boundaries, ∀ 1 ≤ i ≤ w, 1 ≤ j ≤ d, decrement sketch[i,j] if 0 < sketch[i,j] ≤
122
[1] Cormode, G. 2009. Encyclopedia entry on ’Count-MinSketch’. In Encyclopedia of Database Systems. Springer., 511–516.
Frequent Elements

123
OPEN SOURCE TWITTER
YAHOO!
HUAWEI
streamDM^
SGD Learner and Perceptron
Naive Bayes
CluStream
Hoeﬀding Decision Trees
Bagging
Stream KM++
DATA SKETCHES*
Unique
Quan<le
Histogram
Sampling
Theta Sketches
Tuple Sketches
Most Frequent
ALGEBIRD#
Filtering
Unique
Histogram
Most Frequent
* hgps://datasketches.github.io/
# hgps://github.com/twiger/algebird
^ hgp://huawei-noah.github.io/streamDM/
** hgps://github.com/jiecchen/StreamLib
StreamLib**

124
Anomaly Detection
[1] A. S. Willsky, “A survey of design methods for failure detec<on systems,” Automa<ca, vol. 12, pp. 601–611, 1976.
Very rich - over 150 yrs - history
Manufacturing
Sta<s<cs
Econometrics, Financial engineering
Signal processing
Control systems, Autonomous systems - fault detec<on [1]
Networking
Computa<onal biology (e.g., microarray analysis)
Computer vision

125
Very rich - over 150 yrs - history
Anomalies are contextual in nature
“DISCORDANT observations may be deﬁned as those which
present the appearance of differing in respect of their law of
frequency from other observations with which they ale
combined. In the treatment of such observations there is great
diversity between authorities ; but this discordance of methods
may be reduced by the following reﬂection. Different methods
are adapted to different hypotheses about the cause of a
discordant observation; and different hypotheses are true, or
appropriate, according as the subject-matter, or the degree of
accuracy required, is different.”
F. Y. Edgeworth, “On Discordant Observations”, 1887.
Anomaly Detection

126
Anomaly Detection
CHARACTERISTICS
DIRECTION
Posi<ve, Nega<ve
FREQUENCY
Reliability
WIDTH
Ac<onability
MAGNITUDE
Severity
Global
Local
#
6
!
"

127
Anomaly Detection
COMMON APPROACHES
DOMAINS
STATS
MFG
OPS
NOT VALID
in real-life
Moving Averages
SMA, EWMA, PEWMA
Assumption
Normal Distribution
PARAMS
WIDTH
DECAY
Rule Based
µ ± σ
Stone 1868
Glaisher 1872
Edgeworth 1887
Stewart 1920
Irwin 1925
Jeﬀreys 1932
Rider 1933

128
Anomaly Detection
ROBUST MEASURES
MEDIAN MAD [1] MCD [2] MVEE [3,4]
Median Absolute
Deviation
Minimum Covariance
Determinant
Minimum Volume
Enclosing Ellipsoid
[1]P. J. Rousseeuw and C. Croux, “Alterna<ves to the Median Absolute Devia<on”, 1993.
[2] hgp://onlinelibrary.wiley.com/wol1/doi/10.1002/wics.61/abstract
[3] P. J. Rousseeuw and A. M. Leroy.,“Robust Regression and Outlier Detec<on”, 1987.
[4] M. J.Todda and E. A. Yıldırım , “On Khachiyan's algorithm for the computa<on of minimum-volume enclosing ellipsoids”, 2007.

129
Anomaly Detection
Challenges
NOISE STATIONARITY
SEASONALITY TREND BREAKOUT

130
Anomaly Detection
Challenges
Live Data
Mul<-dimensional
Low memory footprint
Accuracy vs. Speed trade-oﬀ
Encoding the context
Data types
Video, Audio, Text
Data veracity
Wearables
Smart ci<es, Connected Home, Internet of Things

132
Real Time Architectures
For the streaming world
Lambda
Run computa<on twice in diﬀerent systems
Kappa
Run computa<on once

133
Lambda Architecture
Overview

134
Lambda Architecture
Batch Layer
Accurate but delayed
HDFS/Mapreduce
Fast Layer
Inexact but fast
Storm/Kata
Query Merge Layer
Merge results from batch and fast layers at query <me

135
Lambda Architecture
Characteris<cs
During Inges<on, Data is cloned into two.
One goes to the batch layer
Other goes to the fast layer
Processing done at two layers
Expressed as Map-reduces in batch layer
Expressed as topologies in the speed layer

136
Lambda Architecture
Challenges
Inherently Ineﬃcient
Data is replicated twice
Computa<on is replicated twice
Opera<onally Ineﬃcient
Maintain both batch and streaming systems
Tune topologies for both systems

137
Kappa Architecture
Streaming is everything
Computa<on is expressed in a topology
Computa<on is mostly done only once when the
data arrives
Data moves into permanent storage

139
Kappa Architecture
Challenges
Data Reprocessing could be very expensive
Code/Logic Changes
Either Data needs to be brought back from Storage to the bus
Or Computa<on needs to be expressed to run on bulk-storage
Historic Analysis
How to do data analy<cs over all of last years data

141
Observations
Computa<on across batch/real<me is similar
Expressed as DAGS
Run parallely on the cluster
Intermediate results need not be materialized
Func<onal/Declara<ve APIs
Storage is the key
Messaging/Storage are two faces of the same coin
They serve the same data

142
Real-Time Storage Requirements
Requirements for a real-@me storage plahorm
Be able to write and read streams of records with low latency, storage durability
Data storage should be durable, consistent and fault tolerant
Enable clients to stream or tail ledgers to propagate data as they’re wrigen
Store and provide access to both historic and real-<me data

143
Apache BookKeeper - Stream Storage
A storage for log streams
Replicated, durable storage of log streams
Provide fast tailing/streaming facility
Op<mized for immutable data
Low-latency durability
Simple repeatable read consistency
High write and read availability

144
Record
Smallest I/O and Address Unit
A sequence of invisible records
A record is sequence of bytes
The smallest I/O unit, as well as the unit of address
Each record contains sequence numbers for addressing

145
Logs
Two Storage Primi@ves
Ledger: A ﬁnite sequence of records.
Stream: An inﬁnite sequence of records.

146
Ledger
Finite sequence of records
Ledger: A ﬁnite sequence of records that gets terminated
A client explicitly close it
A writer who writes records into it has crashed.

147
Stream
Inﬁnite sequence of records
Stream: An unbounded, inﬁnite sequence of records
Physically comprised of mul<ple ledgers

148
Bookies
Stores fragment of records
Bookie - A storage server to store data records
Ensemble: A group of bookies storing the data records of a ledger
Individual bookies store fragments of ledgers

149
Bookies
Stores fragment of records

150
Tying it all together
A typical installa@on of Apache BookKeeper

151
BookKeeper in Production
Enterprise Grade Stream Storage
4+ years at Twiger and Yahoo, 2+ years at Salesforce
Mul<ple use cases from messaging to storage
Database replica<on, Message store, Stream compu<ng …
600+ bookies in one single cluster
Data is stored from days to a year
Millions of log streams
1 trillion records/day, 17 PB/day

152
Companies using BookKeeper
Enterprise Grade Stream Storage

153
What do we really mean by Compute?
Compute Uniﬁca@on
Aim is to react to events as they happen in real-<me
Where do Events happen/arrive?
Message Bus
Whats a reac<on
An ac<on/transforma<on/func<on

154
Compute Representation
Abstract View
f(x)
Incoming Messages Output Messages

155
Traditional Compute representation
DAG
%
%
%
%
%
Source 1
Source 2
Action
Action
Action
Sink 1
Sink 2

156
Traditional Compute API
Func@onal
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);

157
Traditional Compute Runtime
Separate
Messaging Compute

158
Traditional Compute Systems
Pihalls
Powerful API but complicated
Does everyone really need to learn func<onal programming?
Conﬁgurable/Scaleable but management overhead
Edge systems have resource/manageability constraints

159
Lessons learnt
Developer Experience
A signiﬁcant percentage of transforma<ons are simple
ETL
Reac<ve Services
Classiﬁca<on
Real-<me Aggrega<on
Event Rou<ng
Microservices

160
Lessons learnt
Opera@onal Experience
Another system to operate is one too many
IOT deployment rou<nely have thousands of edge systems
Seman<c diﬀerence
Mismatch/Duplica<on between Systems
Creates Developer and Operator Fric<on

161
Whats needed:- Stream-Native Compute
Insight
Simplest possible API
Method/Procedure/Func<on
Mul< Language API
Scale developers
Message bus na<ve concepts
Input/Output/Log as topics
Flexible run<me
Simple standalone applica<ons vs system managed applica<ons

162
Introducing Pulsar Functions

163
Pulsar Functions
API
SDK less API
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}

164
Pulsar Functions
API
SDK API
import org.apache.pulsar.functions.api.PulsarFunction;
import org.apache.pulsar.functions.api.Context;
public class ExclamationFunction implements PulsarFunction<String, String> {
@Override
public String process(String input, Context context) {
return input + "!";
}
}

165
Pulsar Functions
Input and Output
Func<on executed for every message of input topic
Supports mul<ple topics as inputs
Func<on Output goes to the output topic
Func<on Output can be void/null
SerDe takes care of serializa<on/deserializa<on of messages
Custom SerDe can be provided by the users
Integrates with Schema Registry

166
Pulsar Functions
Processing Guarantees
ATMOST_ONCE
Message is acked to Pulsar as soon as we receive it
ATLEAST_ONCE
Message acked to Pulsar aber the func<on completes
Default behaviour:- Not many ppl want to loose data
EFFECTIVELY_ONCE
Uses Pulsar’s inbuilt eﬀec<vely once seman<cs
Controlled at run<me by user

167
Pulsar Functions
Built in State
Func<ons can store state in StreamStore
Framework provides an simple library around this
Support server side opera<ons like counters
Simpliﬁed applica<on development
No need to standup an extra system

168
Pulsar Functions
WordCount Topology
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.PulsarFunction;
public class CounterFunction implements PulsarFunction<String, Void> {
@Override
public Void process(String input, Context context) throws Exception {
for (String word : input.split(".")) {
context.incrCounter(word, 1);
}
return null;
}
}

169
Pulsar Functions
Running as a standalone applica@on
bin/pulsar-admin functions localrun
--input persistent://sample/standalone/ns1/test_input
--output persistent://sample/standalone/ns1/test_result
--className org.mycompany.ExclamationFunction
--jar myjar.jar
Runs as a standalone process
Run as many instances as you want. Framework automa<cally balances data
Run and manage via Mesos/K8/Nomad/your favorite tool

170
Pulsar Functions
Running inside Pulsar cluster
‘Create’ and ‘Delete’ Func<ons in a Pulsar Cluster
Pulsar brokers run func<ons as either threads/processes/docker containers
Uniﬁes Messaging and Compute cluster into one, signiﬁcantly improving
manageability
Ideal match for Edge or small startup environment
Serverless in a jar

171
Pulsar Functions
Running inside a Pulsar Cluster
bin/pulsar-admin functions create
--input persistent://sample/standalone/ns1/test_input
--output persistent://sample/standalone/ns1/test_result
--className org.mycompany.ExclamationFunction
--jar myjar.jar
bin/pulsar-admin functions gestatus
--name ExclamationFunction
--tenant sample —-namespace ns1
bin/pulsar-admin functions delete
--name ExclamationFunction
--tenant sample —-namespace ns1

172
Uniﬁcation
Stepping Back
Messaging & Storage
With StreamStore
Messaging & Compute
Pulsar Func<ons
Compute & Storage
Built in State Management

173
Uniﬁcation
Observa@ons
Lambda, Kappa or something else
Boundaries of streaming vs serverless
Its all about reducing barriers
APIs
Run<me
Opera<ons

174
RESOURCES
Sketching Algorithms
hgps://www.cs.upc.edu/~gavalda/papers/portoschool.pdf
hgps://mapr.com/blog/some-important-streaming-algorithms-you-should-know-about/
hgps://gist.github.com/debasishg/8172796
Synopses for Massive Data: Samples, Histograms, Wavelets,
Sketches
Data Streams: Models and Algorithms
Charu Aggarwal
hgp://www.springer.com/us/book/9780387287591
Data Streams: Algorithms and Applica@ons
Muthu Muthukrishnan
hgp://algo.research.googlepages.com/eight.ps
Graph Streaming Algorithms
A. McGregor
G. Cormode, M. Garofalakis and P. J. Haas
Sketching as a Tool for Numerical Linear Algebra
D. Woodruﬀ

175
Dhalion: Self-Regula<ng VLDB’17
Twiger Heron: Towards Extensible ICDE’17
Dhalion: Self-Regula<ng VLDB’17
MillWheel: VLDB’13
Readings
Stream Processing in Heron
Stream Processing in Heron
Streaming Engines
Twiger Heron: Stream SIGMOD’15
Processing at scale
Fault-Tolerant Stream Processing at Internet Scale
The Dataﬂow Model: A Prac<cal VLDB’15
Approach to Balancing Correctness,
Latency and Cost in Massive-Scale,
Unbounded Out-of-Order Data Processing
Anomaly Detec<on in Strata San Jose’17
Real-Time Data Streams Using Heron

176
Real Time is Messy and Unpredictable
Aggregation
Systems
Messaging
Systems
Result
Engine
HDFS
Queryable
Engines

177
Streamlio - Uniﬁed Architecture
Interactive
Querying
Storm API
Trident/Apache
Beam
SQL
Application
Builder
Pulsar
API
BK/
HDFS
API
Kubernetes
Metadata
Management
Operational
Monitoring
Chargeback
Security
Authentication
Quota
Management
Rules
Engine
Kafka
API

178
BookKeeper in Real-Time Solution
Durable Messaging, Scalable Compute and Stream Storage

179
BookKeeper - Use Cases
Combine messaging and storage
Stream Storage combines the func<onality of streaming and storage
WAL - Write Ahead Log Message Store Object Store
SnapshotsStream Processing

180
Readings
FOCS’00
Clustering Data Streams
SIGMOD’02
Querying and mining data streams:
You only get one look
SIAM Journal of Computing’09
Stream Order and Order Statistics: Quantile
Estimation in Random-Order Streams
PODS’02
Models and Issues in Data Stream Systems
SIGMOD’07
Statistical Analysis of Sketch Estimators
PODS’10
An optimal algorithm for the distinct
elements problem

181
Readings
SODA’10
Coresets and Sketches for high dimensional
subspace approximation problems
SIGMOD’16
Time Adaptive Sketches (Ada-Sketches) for
Summarizing Data Streams
SOSR’17
Heavy-Hitter Detection Entirely in the Data
Plane
PODS’12
Graph Sketches: Sparsiﬁcation, Spanners, and
Subgraphs
Arxiv’16
Coresets and Sketches
ACM Queue’17
Data Sketching: The approximate approach is
often faster and more efﬁcient

183
GET IN TOUCH
C O N T A C T U S
@arun_kejariwal
@kramasamy, @sanjeerk
@sijieg, @merlimat
@nlu90
karthik@stremlio.io
arun_kejariwal@acm.org

E N J O Y T H E P R E S E N T A T I O N
The End

Tutorial - Modern Real Time Streaming Architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tutorial - Modern Real Time Streaming Architectures

Similar to Tutorial - Modern Real Time Streaming Architectures (20)

More from Karthik Ramasamy

More from Karthik Ramasamy (10)

Recently uploaded

Recently uploaded (20)

Tutorial - Modern Real Time Streaming Architectures