SlideShare a Scribd company logo
1 of 185
A Tutorial
Modern Real-Time Streaming
Architectures
Karthik	Ramasamy*,	Sanjeev	Kulkarni*,	Arun	Kejariwal^	&	Sijie	Guo*
*Streamlio,	^MZ
2
MESSAGING STREAMING DATA SKETECHES
LAMBDA, KAPPA UNIFICATION
OUTLINE
3
Information Age
Real	<me	is	key	
Ká !
4
Increasingly Connected World
Internet of Things
30	B	connected	devices	by	2020
Health Care
153	Exabytes	(2013)	->	2314	Exabytes	(2020)
Machine Data
40%	of	digital	universe	by	2020
Connected Vehicles
Data	transferred	per	vehicle	per	month	
4	MB	->	5	GB
Digital Assistants (Predictive Analytics)
$2B	(2012)	->	$6.5B	(2019)	[1]	
Siri/Cortana/Google	Now
Augmented/Virtual Reality
$150B	by	2020	[2]	
Oculus/HoloLens/Magic	Leap
Ñ
!+
>
5
TO STREAMING
6
Traditional Data Processing
Challenges	
Introduces	too	much	“decision	latency”	
Responses	are	delivered	“aber	the	fact”	
Maximum	value	of	the	iden<fied	situa<on	is	lost	
Decisions	are	made	on	old	and	stale	data	
Data	at	Rest
Store Analyze Act
7
The New Era: Streaming Data/Fast Data
Events	are	analyzed	and	processed	in	real-<me	as	they	drive	
Decisions	are	<mely,	contextual	and	based	on	fresh	data	
Decision	latency	is	eliminated	
Data	in	mo<on
Use Cases
9
Product Safety
Observa@ons	
Fight	spammy	content,	engagements,	and	behaviors	in	Twiger	
Spam	campaign	comes	in	large	batch	
Despite	randomized	tweaks,	enough	similarity	among	spammy	en<<es	are	
preserved	
Requirement	
Real	<me	-	a	compe<<on	with	spammers	(i.e)	“detect”	vs	“mutate”	
Generic	-	need	to	support	all	common	feature	representa<ons
10
Product Safety - System Overview
KV	store	for	
clustering
Messaging	
System	
(Event	Bus)
Similarity	Clustering	(Heron)
11
Real Time Ads
KV	store
Messaging	
System	
(Event	Bus)
Ads	Serving	(Heron)
Ads	Predic<on	(Heron)
Impressions
Spend
Ads	Analy<cs	(Heron)
Engagements
Spend
Ads	Requests
Ads	Responses
Impressions
Spend
12
Connected Cars
KV	store	for	
clustering
Messaging	
System
Traffic	Pagerns
Messaging	
System Data	Capture/Filter
Fuel	Efficiency
13
Real Time Use Cases
Algorithmic	trading	
Online	fraud	detec<on	
Geo	fencing	
Proximity/loca<on	tracking	
Intrusion	detec<on	systems	
Traffic	management
Real	<me	recommenda<ons	
Churn	detec<on	
Internet	of	things	
Social	media/data	analy<cs	
Gaming	data	feed
Architectural Pattern
15
Recurring Pattern
ProcessMessaging
Storage
Data	Inges<on Data	Processing
Results	StorageData	Storage
Data	
Serving
16
State of the World
Aggregation
Systems
Messaging
Systems
Result
Engine
HDFS
Queryable
Engines
17
Towards Unification and Simplification
Interactive
Querying
Storm API Streamlets SQL
Application
Builder
Pulsar
API
BK/
HDFS
API
Kubernetes
Metadata
Management
Operational
Monitoring
Chargeback
Security
Authentication
Quota
Management
Kafka
API
18
Real Time Stack
REAL TIME
STACK
Collectors
s
Compute
J
Messaging
a
Storage
b
19
of MESSAGING FRAMEWORKS
An In-Depth
Apache Pulsar
21
Current Messaging Systems
01 02 03 04
05 06 07 08
ActiveMQ RabbitMQ Pulsar RocketMQ
Azure
Event Hub
Google
Pub-Sub
Satori Kafka
22
Why Apache Pulsar?
Ordering	
Guaranteed	ordering
Mul@-tenancy	
A	single	cluster	can	support	many	
tenants	and	use	cases
High	throughput	
Can	reach	1.8	M	messages/s	in	a	
single	par@@on
Durability	
Data	replicated	and	synced	to	disk
Geo-replica@on	
Out	of	box	support	for	geographically	
distributed	applica@ons
Unified	messaging	model	
Support	both	Topic	&	Queue	
seman@c	in	a	single	model
Delivery	Guarantees	
At	least	once,	at	most	once	and	effec@vely	
once
Low	Latency	
Low	publish	latency	of	5ms	at	99pct
Highly	scalable	
Can	support	millions	of	topics
23
Unified Messaging Model
Producer	(X)	
Producer	(Y)
Topic	(T)
Subscrip<on	(A)	
Subscrip<on	(B)	
Consumer	(A1)
Consumer	(B2)
Consumer	(B1)
Consumer	(B3)
24
Pulsar Producer
PulsarClient client = PulsarClient.create(
“http://broker.usw.example.com:8080”);
Producer producer = client.createProducer(
“persistent://my-property/us-west/my-namespace/my-topic”);
// handles retries in case of failure
producer.send("my-message".getBytes());
// Async version:
producer.sendAsync("my-message".getBytes()).thenRun(() -> {
// Message was persisted
});
25
Pulsar Consumer
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Consumer consumer = client.subscribe(
"persistent://my-property/us-west/my-namespace/my-topic",
"my-subscription-name");
while (true) {
// Wait for a message
Message msg = consumer.receive();
System.out.println("Received message: " + msg.getData());
// Acknowledge the message so that it can be deleted by broker
consumer.acknowledge(msg);
}
26
Pulsar Architecture
Separa@on	of	Storage	and	Serving
SERVING
Brokers can be added independently
Traffic can be shifted quickly across brokers
STORAGE	
Bookies can be added independently
New bookies will ramp up traffic quickly
27
Segment Centric Architecture
28
Broker Failure Recovery
• Topic is reassigned to
an available broker
based on load
• Can reconstruct the
previous state
consistently
• No data needs to be
copied
• Failover handled
transparently by client
library
29
Bookie Failure Recovery
• After a write failure,
BookKeeper will
immediately switch write to
a new bookie, within the
same segment.
30
Bookie Failure Recovery
• In background, starts a
many-to-many recovery
process to regain the
configured replication
factor
31
Pulsar Architecture
Geo	Replica@on
GEO REPLICATION
Asynchronous replication
Integrated in the broker message flow
Simple configuration to add/remove regions
Topic	(T1) Topic	(T1)
Topic	(T1)
Subscrip<on	
(S1)
Subscrip<on	
(S1)
Producer		
(P1)
Consumer		
(C1)
Producer		
(P3)
Producer		
(P2)
Consumer		
(C2)
Data	Center	A Data	Center	B
Data	Center	C
32
Pulsar Use Cases - Message Queue
Online	Events Topic	(T)
Worker	1
Worker	2
Decouple	Online/Offline
Topic	(T)
Worker	3
MESSAGE QUEUES
Decouple online or background
High availability
Reliable data transport
No@fica@ons
Long	running	tasks
Low	latency		
publish
33
Pulsar Use Cases - Feedback System
Event Topic	(T)
Propagate	States
Controller
Topic	(T)
Serving	System Serving	System Serving	System
FEEDBACK SYSTEM
Coordinate large number of machines
Propagate states
Examples
State propagation
Personalization
Ad-systems
Feedback
Updates
34
Pulsar in Production
3+	years	
Serves	2.3	million	topics	
100	billion	messages/day	
Average	latency	<	5	ms	
99%	15	ms	(strong	durability	guarantees)	
Zero	data	loss	
80+	applica<ons	
Self	served	provisioning	
Full-mesh	cross-datacenter	replica<on	-	
8+	data	centers
35
Companies using Pulsar
36
of STREAMING FRAMEWORKS
An In-Depth
37
Current Streaming Frameworks
01 02 03 04
05 06 07 08
Beam S-Store Spark Flink
Heron Storm Apex
KAFKA
STREAMS
Apache Heron
39
Heron Terminology
Topology
Directed	acyclic	graph		
ver<ces	=	computa<on,	and		
edges	=	streams	of	data	tuples
Spouts
Sources	of	data	tuples	for	the	topology	
Examples	-	Pulsar/Kata/MySQL/Postgres
Bolts
Process	incoming	tuples,	and	emit	outgoing	tuples	
Examples	-	filtering/aggrega<on/join/any	func<on
,
%
40
Heron Topology
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
41
Heron Topology - Physical Execution
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
%%
%%
%%
%%
%%
42
Heron Groupings
01 02 03 04
Shuffle Grouping
Random distribution of tuples
Fields Grouping
Group tuples by a field or
multiple fields
All Grouping
Replicates tuples to all tasks
Global Grouping
Send the entire stream to one
task
/
.
-
,
43
Heron Topology - Physical Execution
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
%%
%%
%%
%%
%%
Shuffle Grouping
Shuffle Grouping
Fields Grouping
Fields Grouping
Fields Grouping
Fields Grouping
44
Writing Heron Topologies
Procedural - Low Level API
Directly	write	your	spouts	and	
bolts
Functional - Mid Level API
Use	of	maps,	flat	maps,	transform,	windows
Declarative - SQL (coming)
Use	of	declara<ve	language	-	specify	what	you	
want,	system	will	figure	it	out.
,
%
45
Streamlet Functional API
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);
46
Heron Design Goals
Efficiency	
Reduce	resource	consump<on
Support	for	diverse	workloads	
Throughput	vs	latency	sensi<ve
Support	for	mul@ple	seman@cs	
Atmost	once,	Atleast	once,	
Effec<vely	once
Na@ve	Mul@-Language	Support	
C++,	Java,	Python
Task	Isola@on	
Ease	of	debug-ability/isola<on/profiling	
Support	for	back	pressure	
Topologies	should	be	self	adjus<ng
Use	of	containers	
Runs	in	schedulers	-	Kubernetes	&	DCOS	&	
many	more
Mul@-level	APIs	
Procedural,	Func<onal	and	Declara<ve	for	
diverse	applica<ons
Diverse	deployment	models	
Run	as	a	service	or	pure	library
47
Heron Architecture
Scheduler
Topology 1 Topology 2 Topology N
Topology
Submission
48
Topology Master
Monitoring of containers Gateway for metrics Assigns role
49
Topology Architecture
Topology Master
ZK
Cluster
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
DATA CONTAINER DATA CONTAINER
Metrics
Manager
Metrics
Manager
Metrics
Manager
Health
Manager
MASTER
CONTAINER
50
Stream Manager
Routes tuples Implements backpressure Ack management
51
Stream Manager
Sample	topology
% %
S1 B2 B3
%
B4
52
Stream Manager
Physical	execu@on
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
53
Stream Manager Backpressure
Spout based back pressureTCP backpressure Stage by stage back pressure
54
Stream Manager Backpressure
TCP	based	backpressure
Slows upstream and downstream instances
S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
55
Stream Manager Backpressure
Spout	based	backpressure
S1 S1
S1S1S1 S1
S1S1 B2
B3
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
B2
B3 B4
B2
B3
B2
B3 B4
B4
56
Heron Instance
Runs only one task (spout/bolt)
Exposes Heron API
Collects several metrics
API
G
57
Heron Instance
Stream
Manager
Metrics
Manager
Gateway
Thread
Task Execution
Thread
data-in queue
data-out queue
metrics-out queue
58
Heron Deployment
Topology 1
Topology 2
Topology N
Heron
Tracker
Heron
VIZ
Heron
Web
ZK
Cluster
Aurora Services
Observability
59
Heron Visualization
60
Heron @Twitter
LARGEST	CLUSTER
100’s	of	TOPOLOGIES
BILLIONS	OF	MESSAGES100’s	OF	TERABYTESREDUCED	INCIDENTS
GOOD	NIGHT	SLEEP
3X - 5X reduction in resource usage
61
Companies using Heron
62
AND
of STREAM PROCESSING APPLICATIONS
63
Pulsar Operations
Reac<ng	to	Failures	
Brokers	
Bookies	
Common	Issues	
Consumer	Backlog	
I/O	Priori<za<on	and	Throgling	
Mul<-Tenancy
64
Reacting to Failures - Brokers
Brokers	don’t	have	durable	state	
Easily	replaceable	
Topics	are	immediately	reassigned	to	healthy	brokers	
Expanding	capacity	
Simply	add	new	broker	node	
If	other	brokers	are	overloaded,	traffic	will	be	automa<cally	assigned	
Load	manager	
Monitor	traffic	load	on	all	brokers	(CPU,	memory,	network,	topics)	
Ini<ally	place	topics	to	least	loaded	brokers	
Reassign	topics	when	a	broker	is	overloaded
65
Reacting to Failures - Bookies
When	a	bookie	fails,	brokers	will	immediately	con<nue	on	other	bookies	
Auto-Recovery	mechanism	will	re-establish	the	replica<on	factor	in	background	
If	a	bookie	keeps	giving	errors	or	<meouts,	it	will	be	“quaran<ned”	
Not	considered	for	new	ledgers	for	some	period	of	<me
66
Consumer Backlog
Metrics	are	available	to	make	assessments:	
When	problem	started	
How	big	is	backlog?	Messages?	Disk	space?		
How	fast	is	draining?	
What’s	the	ETA	to	catch	up	with	publishers?	
Establish	where	is	the	bogleneck	
Applica<on	is	not	fast	enough	
Disk	read	IO
67
Enforcing Multi-Tenancy
Ensure	tenants	don’t	cause	performance	issues	on	other	tenants	
Backlog	quotas	
Sob-Isola<on	
Flow	control	
Throgling	
In	cases	when	user	behavior	is	triggering	performance	degrada<on	
Hard-isola<on	as	a	last	resource	for	quick	reac<on	while	proper	fix	is	deployed	
Isolate	tenant	on	a	subset	of	brokers	
Can	be	also	applied	at	the	BookKeeper	level
68
Heron Use Cases
Monitoring
Real	Time		
Machine	Learning
Ads
Real	Time	Trends
Product	Safety
Real	Time	Business	
Intelligence
69
Heron Topology Complexity
70
Heron Topology Scale
CONTAINERS - 1 TO 600 INSTANCES - 10 TO 6000
71
Heron Happy Facts :)
!	No	more	pages	during	midnight	for	Heron	team	
"	Very	rare	incidents	for	Heron	customer	teams	
#	Easy	to	debug	during	incident	for	quick	turn	around	
$	Reduced	resource	u<liza<on	saving	cost
72
Heron Developer Issues
01 02
Container	resource	alloca@on
Parallelism	tuning
73
Heron Operational Issues
01 02 03
Slow Hosts Network Issues Data Skew
/
.
-
04
Load Variations
,
05
SLA Violations
/
74
Slow Hosts
Memory Parity Errors Impending Disk Failures Lower GHZ
75
Network Issues
Network Slowness
Network Partitioning
G
76
Network Slowness
01 02 03
Delays	processing
Data	is	accumula@ng
Timelines	of	results	
Is	affected
77
Data Skew
Multiple Keys
Several	keys	map	into	single	
instance	and	their	count	is	
high
Single Key
Single	 key	 maps	 into	 a	 instance	
and	its	count	is	high
H
C
78
Load Variations
Spikes
Sudden	surge	of	data	-	short	
lived	vs	last	for	several	
minutes
Daily Patterns
Predictable	change	in	traffic
H
C
79
Self Regulating Streaming Systems
Automate	Tuning
SLO		
Maintenance
Self	Regula@ng	
Streaming	Systems
Tuning
Manual,	<me-consuming	and	error-
prone	task	of	tuning	various	systems	
knobs	to	achieve	SLOs
SLO
Maintenance	of	SLOs	in	the	face	of	
unpredictable	 load	 varia<ons	 and	
hardware	 or	 sobware	 performance	
degrada<on
Self	Regula@ng	Streaming	Systems
System	that	adjusts	itself	to	the	environmental	changes	and	
con<nue	to	produce	results
80
Self Regulating Streaming Systems
Self tuning Self stabilizing Self healing
G "g
Several tuning knobs
Time consuming tuning phase
The system should take
as input an SLO and
automatically configure
the knobs.
The system should
react to external shocks
a n d a u t o m a t i c a l l y
reconfigure itself
Stream jobs are long running
Load variations are common
The system should
identify internal faults
and attempt to recover
from them
System performance affected
by hardware or software
delivering degraded quality of
service
81
Enter Dhalion
Dhalion periodically executes
well-specified policies that
optimize execution based on
some objective.
We created policies that
dynamically provision resources
in the presence of load variations
and auto-tune streaming
applications so that a throughput
SLO is met.
Dhalion is a policy based
framework integrated into Heron
82
Dhalion Policy Framework
Symptom
Detector 1
Symptom
Detector 2
Symptom
Detector 3
Symptom
Detector N
....
Diagnoser 1
Diagnoser 2
Diagnoser M
....
Resolver
Invocation
D
iagnosis
1
Diagnosis 2
D
iagnosis
M
Symptom 1
Symptom 2
Symptom 3
Symptom N
Symptom
Detection
Diagnosis
Generation
Resolution
Resolver 1
Resolver 2
Resolver M
....
Resolver
Selection
Metrics
83
Dynamic Resource Provisioning
Policy
This	policy	reacts	to	unexpected	
load	varia<ons	(workload	spikes)
Goal
Goal	is	to	scale	up	and	scale	down	
the	topology	resources	as	needed	-	
while	 keeping	 the	 topology	 in	 a	
steady	state	where	back	pressure	is	
not	observed
H
C
Policy
84
Dynamic Resource Provisioning
Pending Tuples
Detector
Backpressure
Detector
Processing Rate
Skew Detector
Resource Over
provisioning
Diagnoser
Resource Under
Provisioning
Diagnoser
Data Skew
Diagnoser
Resolver
Invocation
Diagnosis
Symptoms
Symptom
Detection
Diagnosis
Generation
Resolution
Metrics
Slow Instances
Diagnoser
Bolt	Scale		
Down	Resolver
Bolt	Scale		
Up	Resolver
Data	Skew	
Resolver
Restart	
Instances	
Resolver
Implementa@on
Streaming and One-Pass
Algorithms
DATA	CHARACTERISTICS
UNBOUNDED
DATA	CHARACTERISTICS
UNORDERED	
				Varying	Skew	
								Event	@me	
								Processing	@me	
				Correctness	
				Completeness
88
Nehlix	,	HBO,	Hulu,	YouTube,	Dailymo@on,	ESPN,	Sling	TV
Satori,	Facebook	Live,	Periscope
Spo@fy,	Pandora,	Apple	Music,	Tidal
Amazon	Twitch,	YouTube	Gaming,	Microsoj	Mixer
KEY	CHARACTERISTICS
LOW	LATENCY
HIGH	VELOCITY	DATA
90
KEY	CHARACTERISTICS
In-Order O(1) Storage O(n) Time
ONE PASS
91
DISTRIBUTED COMPUTATION SCALE OUT
ROBUST
CHARACTERISTICS
FAULT TOLERANCE
KEY
92
DATA SKETCHES
Early	work
The space complexity of approximating
the frequency moments
Counting
Frequent Elements
[Misra and Gries, 1982]
Flajolet and Martin 1985]
Computing on Data Streams
[Henzinger et al. 1998]
[Alon et al. 1996]
Counting
[Morris, 1977]
Median of a sequence
[Munro and Paterson, 1980]
Membership
[Bloom, 1970]
DATA	SKETCHES
UNIQUE	
FILTER	
COUNT	
HISTOGRAM	
QUANTILE	
MOMENTS	
TOP-K
ADVANCED		DATA	SKETCHES
RANDOM	PROJECTIONS,	FREQUENT	DIRECTIONS	
				Dimensionality	Reduc<on	
RANDOMIZED	NUMERICAL	ALGEBRA	
					Matrix	mul<ply	
GRAPHS	
				Summarize	adjacency		
								Connec<vity,	k-connec<vity,	Spanners,	Sparsifica<on	
GEOMETRIC	
				Diameter,	Lp	distances,	Min-cost	matchings	
				Informa<on	Distances,	e.g.,	Hellinger	distance	
SKETCHING	SKETCHES	
				Tes<ng	independence
95
1 2 3
5 4
SAMPLING
A/B	
Tes<ng
FILTERING
Set	
membership
CORRELATION
Fraud	
Detec<on
QUANTILES
Network		
Analysis
CARDINALITY
Site	Audience	
Analysis
Applications
96
6 7 8
10 9
MOMENTS
Database
FREQUENT		
ELEMENTS
Trending	
Hashtags
CLUSTERING
Medical	
Imaging
ANOMALY		
DETECTION
Sensor	Networks
SUBSEQUENCES
Traffic	
Analysis
Applications
!
!
"
"
97
Sampling
[1]	J.	S.	Viger.	Random	Sampling	with	a	Reservoir.	ACM	Transac<ons	on	Mathema<cal	Sobware,	Vol.	11(1):37–57,	March	1985.
Obtain	a	representa<ve	sample	from	a	data	stream	
Maintain	dynamic	sample	
A	data	stream	is	a	con<nuous	process	
Not	known	in	advance	how	many	points	may	elapse	before	an	analyst	may	need	to	use	a	
representa<ve	sample	
	Reservoir	sampling	[1]	
Probabilis<c	inser<ons	and	dele<ons	on	arrival	of	new	stream	points	
Probability	of	successive	inser<on	of	new	points	reduces	with	progression	of	the	stream	
An	unbiased	sample	contains	a	larger	and	larger	frac<on	of	points	from	the	distant	history	of	the	stream	
Prac<cal	perspec<ve	
Data	stream	may	evolve	and	hence,	the	majority	of	the	points	in	the	sample	may	represent	the	
stale	history
98
Sampling
Sliding	window	approach	(sample	size	k,	window	width	n)	
Sequence-based		
Replace	expired	element	with	newly	arrived	element		
Disadvantage:	highly	periodic	
Chain-sample	approach		
Select	element	ith	with	probability	Min(i,n)/n	
Select	uniformly	at	random	an	index	from	[i+1,	i+n]	of	the	element	which	will	replace	the	ith	item	
	Maintain	k	independent	chain	samples	
	Timestamp-based		
#	elements	in	a	moving	window	may	vary	over	<me	
Priority-sample	approach
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
[1]	B.	Babcock.	Sampling	From	a	Moving	Window	Over	Streaming	Data.	In	Proceedings	of	SODA,	2002.
99
Sampling
[1]	C.	C.	Aggarwal.On	Biased	Reservoir	Sampling	in	the	presence	of	Stream	Evolu<on.	in	Proceedings	of	VLDB,	2006.
	Biased	Reservoir	Sampling	[1]	
Use	a	temporal	bias	func<on	-	recent	points	have	higher	probability	of		
being	represented	in	the	sample	reservoir	
Memory-less	bias	func<ons	
Future	probability	of	retaining	a	current	point	in	the	reservoir	is	independent	of	its	past	history	or		
arrival	<me		
Probability	of	an	rth	point	belonging	to	the	reservoir	at	the	<me	t	is	propor<onal	to	the	bias	func<on				
Exponen<al	bias	func<ons	for	rth	data	point	at	<me	t,																																				where,	r	≤	t,		λ	∈	[0,	1]	is	the		
bias	rate	
Maximum	reservoir	requirement	R(t)	is	bounded
100
Filtering
Set	Membership	
				Determine,	with	some	false	probability,	if	an	item	in	a	data	stream	has	
				been	seen	before	
Databases	(e.g.,	speed	up	semi-join	opera<ons),	Caches,	Routers,	Storage	Systems	
Reduce	space	requirement	in	probabilis<c	rou<ng	tables	
Speedup	longest-prefix	matching	of	IP	addresses	
Encode	mul<cast	forwarding	informa<on	in	packets	
Summarize	content	to	aid	collabora<ons	in	overlay	and	peer-to-peer	networks	
Improve	network	state	management	and	monitoring
101
Filtering
Set	Membership
Applica<on	to	hyphena<on		
programs	
Early	UNIX	spell	checkers
[1]	Illustra<on	borrowed	from	hgp://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
[1]
102
Filtering
Set	Membership
Natural	generaliza<on	of	hashing		
False	posi<ves	are	possible	
		
No	false	nega<ves	
No	dele<ons	allowed	
For	false	posi<ve	rate	ε,	#	hash	func<ons	=	log2(1/ε)
where,	n	=	#	elements,		
													k	=	#	hash	func<ons	
													m	=	#	bits	in	the	array
103
Filtering
Set	Membership
Minimizing	false	posi<ve	rate	ε	w.r.t.	k	[1]	
k	=	ln	2	*	(m/n)	
ε	=	(1/2)k	≈	(0.6185)m/n	
1.44	*	log2(1/ε)	bits	per	item	
Independent	of	item	size	or	#	items	
Informa<on-theore<c	minimum:	log2(1/ε)	bits	per	item	
44%	overhead		
X	=	#	0	bits
where
[1]	A.	Broder	and	M.	Mitzenmacher.	Network	Applica<ons	of	Bloom	Filters:	A	Survey.	In	Internet	Mathema<cs	Vol.	1,	No.	4,	2005.
104
Filtering
Set	Membership:	Cuckoo	Filter	[1]
Key	Highlights	
Add	and	remove	items	dynamically		
For	false	posi<ve	rate	ε	<	3%,	more	space	efficient	than	Bloom	filter	
Higher	performance	than	Bloom	filter	for	many	real	workloads	
Asympto<cally	worse	performance	than	Bloom	filter	
	Min	fingerprint	size	α	log	(#	entries	in	table)	
Overview		
Stores	only	a	fingerprint	of	an	item	inserted	
Original	key	and	value	bits	of	each	item	not	retrievable		
Set	membership	query	for	item	x:	search	hash	table	for	fingerprint	of	x
[1]	Fan	et	al.,	Cuckoo	Filter:	Prac<cally	Beger	Than	Bloom.	In	Proceedings	of	the	10th	ACM	Interna<onal	on	Conference	on	Emerging	Networking	Experiments	and	Technologies,	2014.
105
Filtering
Set	Membership
[1]	R.	Pagh	and	F.	Rodler.	Cuckoo	hashing.	Journal	of	Algorithms,	51(2):122-144,	2004.	
[2]	Illustra<on	borrowed	from	“Fan	et	al.,	Cuckoo	Filter:	Prac<cally	Beger	Than	Bloom.	In	Proceedings	of	the	10th	ACM	Interna<onal	on	Conference	on	Emerging	Networking	Experiments	and	Technologies,	2014.”
[2]
Illustra<on	of	Cuckoo	hashing	[2]
Cuckoo Hashing [1]
High	space	occupancy	
Prac<cal	implementa<ons:	mul<ple	items/bucket	
Example	uses:	Sobware-based	Ethernet	switches	
Cuckoo Filter [2]
Uses	a	mul<-way	associa<ve	Cuckoo	hash	table	
Employs	par<al-key	cuckoo	hashing	
Store	fingerprint	of	an	item		
Relocate	exis<ng	fingerprints	to	their	alterna<ve	loca<ons
[2]
Dele<on	
Item	must	have	been	previously	inserted
106
Filtering
Set	Membership
Cuckoo Filter
Par<al-key	cuckoo	hashing	
Fingerprint	hashing	ensures	uniform	
distribu<on	of	items	in	the	table	
Length	of	fingerprint	<<	Size	of	h1	or	h2	
Possible	 to	 have	 mul<ple	 entries	 of	 a	
fingerprint	in	a	bucket
Alternate	
bucket
Significantly	shorter	
than	h1	and	h2
107
Filtering
Set	Membership
Comparison
k	➛ # hash functions, d	➛ # partitions
108
Cardinality
Dis<nct	Elements	
Database	systems/Search	engines	
#	dis<nct	queries	
Network	monitoring	applica<ons	
Natural	language	processing	
#	dis<nct	mo<fs	in	a	DNA	sequence	
#	dis<nct	elements	of	RFID/sensor	networks
109
Previous	work	
Probabilis<c	coun<ng	[Flajolet	and	Mar<n,	1985]	
	LogLog	coun<ng	[Durand	and	Flajolet,	2003]	
	HyperLogLog	[Flajolet	et	al.,	2007]	
	Sliding	HyperLogLog	[Chabchoub	and	Hebrail,	2010]	
	HyperLogLog	in	Prac<ce	[Heule	et	al.,	2013]	
	Self-Organizing	Bitmap	[Chen	and	Cao,	2009]	
	Discrete	Max-Count	[Ting,	2014]	
Sequence	of	sketches	forms	a	Markov	chain	when	h	is	a	strong	universal	hash	
Es<mate	cardinality	using	a	mar<ngale
Cardinality
110
Comparison
N	≤	109
Cardinality
111
Hyperloglog	
Apply	hash	func<on	h	to	every	element	in	a	mul<set		
Cardinality	of	mul<set	is	2max(ϱ)	where	0ϱ-11	is	the	bit	pagern	observed	at	the	
beginning	of	a	hash	value	
Above	suffers	with	high	variance	
Employ	stochas<c	averaging	
Par<<on	input	stream	into	m	sub-streams	Si	using	first	p	bits	of	hash	values	(m	=	2p)
where
Cardinality
112
Use	of	64-bit	hash	func<on		
Total	memory	requirement	5	*	2p	->	6	*	2p,	where	p	is	the	precision	
Empirical	bias	correc<on	
Uses	empirically	determined	data	for	cardinali<es	smaller	than	5m	and	uses	the	unmodified	raw	
es<mate	otherwise	
Sparse	representa<on	
	For	n≪m,	store	an	integer	obtained	by	concatena<ng	the	bit	pagerns	for	idx	and	ϱ(w)	
	Use	variable	length	encoding	for	integers	that	uses	variable	number	of	bytes	to	represent	
integers	
	Use	difference	encoding	-	store	the	difference	between	successive	elements	
Other	op<miza<ons	[1,	2]
Hypeloglog Optimizations
[1]	hgp://druid.io/blog/2014/02/18/hyperloglog-op<miza<ons-for-real-world-systems.html	
[2]	hgp://an<rez.com/news/75
Cardinality
113
Self-Learning	Bitmap	(S-bitmap)	[1]	
Achieve	constant	rela<ve	es<ma<on	errors	for	unknown	cardinali<es	in	a	wide	
range,	say	from	10s	to	>106	
	Bitmap	obtained	via	adap<ve	sampling	process	
	Bits	corresponding	to	the	sampled	items	are	set	to	1	
	Sampling	rates	are	learned	from	#	dis<nct	items	already	passed	and	reduced	sequen<ally	as	
more	bits	are	set	to	1	
	For	given	input	parameters	Nmax	and	es<ma<on	precision	ε,	size	of	bit	mask	
For	r	=	1	-2ε2(1+ε2)-1	and	sampling	probability	pk	=	m	(m+1-k)-1(1+ε2)rk,	where	k	∈	[1,m]	
					Rela<ve	error	≣	ε
[1]	Chen	et	al.	“Dis<nct	coun<ng	with	a	self-learning	bitmap”.	Journal	of	the	American	Sta<s<cal	Associa<on,	106(495):879–890,	2011.
Cardinality
114
Quantiles
Quan<les,	Histograms	
Large	set	of	real-world	applica<ons	
Database	applica<ons	
Sensor	networks	
Opera<ons	
Proper<es		
Provide	tunable	and	explicit	guarantees	on	the	precision	of	approxima<on	
Single	pass	
Early	work	
[Greenwald	and	Khanna,	2001]	-	worst	case	space	requirement		
[Arasu	and	Manku,	2004]	-	sliding	window	based	model,	worst	case	space		
requirement
115
q-digest	[1]	
Groups	values	in	variable	size	buckets	of	almost		
equal	weights	
Unlike	a	tradi<onal	histogram,	buckets	can	overlap	
Key	features	
Detailed	informa<on	about	frequent	values	preserved	
Less	frequent	values	lumped	into	larger	buckets	
Using	message	of	size	m,	answer	within	an	error	of	
Except	root	and	leaf	nodes,	a	node	v	∈	q-digest	iff
Max	signal	
value
#	Elements
Compression	
Factor
Complete	binary	tree
[1]	Shrivastava	et	al.,	Medians	and	Beyond:	New	Aggrega<on	Techniques	for	Sensor	Networks.	In	Proceedings	of	SenSys,	2004.
Quantiles
116
q-digest	
Building	a	q-digest	
q-digests	can	be	constructed	in	a	distributed	fashion	
	Merge	q-digests
Quantiles
117
t-digest	[1]	
Approxima<on	of	rank-based	sta<s<cs	
Compute	quan<le	q	with	an	accuracy	rela<ve	to	max(q,	1-q)	
Compute	hybrid	sta<s<cs	such	as	trimmed	sta<s<cs	
Key	features	
Robust	with	respect	to	highly	skewed	distribu<ons	
Independent	of	the	range	of	input	values	(unlike	q-digest)	
Rela<ve	error	is	bounded	
Non-equal	bin	sizes	
Few	samples	contribute	to	the	bins	corresponding	to	the	extreme	quan<les	
Merging	independent	t-digests	
Reasonable	accuracy
[1]T.	Dunning	and	O.	Ertl,	“”Compu<ng	Extremely	Accurate	Quan<les	using	t-digests”,	2017.	hgps://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf	
Quantiles
118
t-digest	
Group	samples	into	sub-sequences	
Smaller	sub-sequences	near	the	ends	
Larger	sub-sequences	in	the	middle	
Scaling	func<on	
Mapping	k	is	monotonic	
k(0)	=	1	and	k(1)	=	δ	
k-size	of	each	subsequence	<	1	
No<onal	
Index
Compression	
parameterQuan<le
Quantiles
119
t-digest	
Es<ma<ng	quan<le	via	interpola<on	
Sub-sequences	contain	centroid	of	the	samples	
Es<mate	the	boundaries	of	the	sub-sequences	
Error	
Scales	quadra<cally	in	#	samples	
Small	#	samples	in	the	sub-sequences	near	q=0	and	q=1	improves	accuracy	
Lower	accuracy	in	the	middle	of	the	distribu<on		
Larger	sub-sequences	in	the	middle	
Two	flavors	
Progressive	merging	(buffering	based)	and	clustering	variant
Quantiles
120
Frequent Elements
Applica<ons	
Track	bandwidth	hogs	
Determine	popular	tourist	des<na<ons	
Itemset	mining	
Entropy	es<ma<on		
Compressed	sensing		
Search	log	mining	
Network	data	analysis	
DBMS	op<miza<on
Count-min	Sketch	[1]	
A	two-dimensional	array	counts	with	w	columns	and	d	rows	
Each	entry	of	the	array	is	ini<ally	zero	
d	 hash	 func<ons	 are	 chosen	 uniformly	 at	 random	 from	 a	 pairwise	 independent	
family	
Update	
	For	a	new	element	i,	for	each	row	j	and	k	=	hj(i),	increment	the	kth	column	by	one	
Point	query																																																					where,	sketch	is	the	table	
Parameters
121
),( δε
}1{}1{:,,1 wnhh d ……… →
!
!
"
#
#
$
=
ε
e
w
!
!
"
#
#
$
=
δ
1
lnd
[1]	Cormode,	Graham;	S.	Muthukrishnan	(2005).	"An	Improved	Data	Stream	Summary:	The	Count-Min	Sketch	and	its	Applica<ons".	J.	Algorithms	55:	29–38.
Frequent Elements
Variants	of	Count-min	Sketch	[1]	
Count-Min	sketch	with	conserva<ve	update	(CU	sketch)	
	Update	an	item	with	frequency	c	
	Avoid	unnecessary	upda<ng	of	counter	values	=>	Reduce	over-es<ma<on	error	
	Prone	to	over-es<ma<on	error	on	low-frequency	items		
Lossy	Conserva<ve	Update	(LCU)	-	SWS	
	Divide	stream	into	windows	
	At	window	boundaries,	∀	1	≤	i	≤	w,	1	≤	j	≤	d,	decrement	sketch[i,j]	if	0	<	sketch[i,j]	≤	
122
[1]	Cormode,	G.	2009.	Encyclopedia	entry	on	’Count-MinSketch’.	In	Encyclopedia	of	Database	Systems.	Springer.,	511–516.
Frequent Elements
123
OPEN SOURCE TWITTER
YAHOO!
HUAWEI
streamDM^
SGD	Learner	and	Perceptron	
Naive	Bayes	
CluStream	
Hoeffding	Decision	Trees	
Bagging	
Stream	KM++
DATA	SKETCHES*
Unique	
Quan<le	
Histogram	
Sampling	
Theta	Sketches	
Tuple	 Sketches	
Most	Frequent
ALGEBIRD#
Filtering	
Unique	
Histogram	
Most	Frequent
*	hgps://datasketches.github.io/	
#	hgps://github.com/twiger/algebird	
^	hgp://huawei-noah.github.io/streamDM/	
**	hgps://github.com/jiecchen/StreamLib
StreamLib**
124
Anomaly Detection
[1]	A.	S.	Willsky,	“A	survey	of	design	methods	for	failure	detec<on	systems,”	Automa<ca,	vol.	12,	pp.	601–611,	1976.
Very	rich	-	over	150	yrs	-	history	
Manufacturing		
	Sta<s<cs	
	Econometrics,	Financial	engineering	
	Signal	processing	
	Control	systems,	Autonomous	systems	-	fault	detec<on	[1]	
	Networking	
	Computa<onal	biology	(e.g.,	microarray	analysis)	
	Computer	vision
125
Very	rich	-	over	150	yrs	-	history	
Anomalies	are	contextual	in	nature
“DISCORDANT observations may be defined as those which
present the appearance of differing in respect of their law of
frequency from other observations with which they ale
combined. In the treatment of such observations there is great
diversity between authorities ; but this discordance of methods
may be reduced by the following reflection. Different methods
are adapted to different hypotheses about the cause of a
discordant observation; and different hypotheses are true, or
appropriate, according as the subject-matter, or the degree of
accuracy required, is different.”
F. Y. Edgeworth, “On Discordant Observations”, 1887.
Anomaly Detection
126
Anomaly Detection
CHARACTERISTICS
DIRECTION
Posi<ve,	Nega<ve
FREQUENCY
Reliability
WIDTH
Ac<onability
MAGNITUDE
Severity
Global
Local
#
6
!
"
127
Anomaly Detection
COMMON	APPROACHES
DOMAINS
STATS	
MFG	
OPS
NOT	VALID
in	real-life
Moving Averages	
SMA, EWMA, PEWMA
Assumption	
Normal Distribution
PARAMS
WIDTH	
DECAY
Rule Based	
µ ± σ
Stone	1868	
Glaisher	1872	
Edgeworth	1887	
Stewart	1920	
Irwin	1925	
Jeffreys	1932	
Rider	1933
128
Anomaly Detection
ROBUST	MEASURES
MEDIAN MAD [1] MCD [2] MVEE [3,4]
Median Absolute
Deviation
Minimum Covariance
Determinant
Minimum Volume
Enclosing Ellipsoid
[1]P.	J.	Rousseeuw	and	C.	Croux,	“Alterna<ves	to	the	Median	Absolute	Devia<on”,	1993.	
[2]	hgp://onlinelibrary.wiley.com/wol1/doi/10.1002/wics.61/abstract	
[3]	P.	J.	Rousseeuw	and	A.	M.	Leroy.,“Robust	Regression	and	Outlier	Detec<on”,	1987.	
[4]	M.	J.Todda	and	E.	A.	Yıldırım	,	“On	Khachiyan's	algorithm	for	the	computa<on	of	minimum-volume	enclosing	ellipsoids”,	2007.
129
Anomaly Detection
Challenges
NOISE STATIONARITY
SEASONALITY TREND BREAKOUT
130
Anomaly Detection
Challenges	
Live	Data	
Mul<-dimensional		
Low	memory	footprint		
Accuracy	vs.	Speed	trade-off	
Encoding	the	context	
Data	types	
Video,	Audio,	Text	
Data	veracity	
Wearables	
Smart	ci<es,	Connected	Home,	Internet	of	Things
TYING	IT	ALL	TOGETHER
132
Real Time Architectures
For	the	streaming	world	
Lambda	
Run	computa<on	twice	in	different	systems	
Kappa	
Run	computa<on	once
133
Lambda Architecture
Overview
134
Lambda Architecture
Batch	Layer	
Accurate	but	delayed	
HDFS/Mapreduce	
Fast	Layer	
Inexact	but	fast	
Storm/Kata	
Query	Merge	Layer	
Merge	results	from	batch	and	fast	layers	at	query	<me
135
Lambda Architecture
Characteris<cs	
During	Inges<on,	Data	is	cloned	into	two.	
One	goes	to	the	batch	layer	
Other	goes	to	the	fast	layer	
Processing	done	at	two	layers	
Expressed	as	Map-reduces	in	batch	layer	
Expressed	as	topologies	in	the	speed	layer
136
Lambda Architecture
Challenges	
Inherently	Inefficient	
Data	is	replicated	twice	
Computa<on	is	replicated	twice	
Opera<onally	Inefficient	
Maintain	both	batch	and	streaming	systems	
Tune	topologies	for	both	systems
137
Kappa Architecture
Streaming	is	everything	
Computa<on	is	expressed	in	a	topology	
Computa<on	is	mostly	done	only	once	when	the	
data	arrives	
Data	moves	into	permanent	storage
138
Kappa Architecture
139
Kappa Architecture
Challenges	
Data	Reprocessing	could	be	very	expensive	
Code/Logic	Changes	
Either	Data	needs	to	be	brought	back	from	Storage	to	the	bus	
Or	Computa<on	needs	to	be	expressed	to	run	on	bulk-storage	
Historic	Analysis	
How	to	do	data	analy<cs	over	all	of	last	years	data
140
Unifica<on
141
Observations
Computa<on	across	batch/real<me	is	similar	
Expressed	as	DAGS	
Run	parallely	on	the	cluster	
Intermediate	results	need	not	be	materialized	
Func<onal/Declara<ve	APIs	
Storage	is	the	key	
Messaging/Storage	are	two	faces	of	the	same	coin	
They	serve	the	same	data
142
Real-Time Storage Requirements
Requirements	for	a	real-@me	storage	plahorm
Be	able	to	write	and	read	streams	of	records	with	low	latency,	storage	durability	
Data	storage	should	be	durable,	consistent	and	fault	tolerant	
Enable	clients	to	stream	or	tail	ledgers	to	propagate	data	as	they’re	wrigen	
Store	and	provide	access	to	both	historic	and	real-<me	data
143
Apache BookKeeper - Stream Storage
A	storage	for	log	streams
Replicated,	durable	storage	of	log	streams	
Provide	fast	tailing/streaming	facility	
Op<mized	for	immutable	data	
Low-latency	durability	
Simple	repeatable	read	consistency	
High	write	and	read	availability
144
Record
Smallest	I/O	and	Address	Unit
A	sequence	of	invisible	records	
A	record	is	sequence	of	bytes	
The	smallest	I/O	unit,	as	well	as	the	unit	of	address	
Each	record	contains	sequence	numbers	for	addressing
145
Logs
Two	Storage	Primi@ves
Ledger:	A	finite	sequence	of	records.	
Stream:	An	infinite	sequence	of	records.
146
Ledger
Finite	sequence	of	records
Ledger:	A	finite	sequence	of	records	that	gets	terminated	
A	client	explicitly	close	it	
A	writer	who	writes	records	into	it	has	crashed.
147
Stream
Infinite	sequence	of	records
Stream:	An	unbounded,	infinite	sequence	of	records	
Physically	comprised	of	mul<ple	ledgers
148
Bookies
Stores	fragment	of	records
Bookie	-	A	storage	server	to	store	data	records	
Ensemble:	A	group	of	bookies	storing	the	data	records	of	a	ledger	
Individual	bookies	store	fragments	of	ledgers
149
Bookies
Stores	fragment	of	records
150
Tying it all together
A	typical	installa@on	of	Apache	BookKeeper
151
BookKeeper in Production
Enterprise	Grade	Stream	Storage
4+	years	at	Twiger	and	Yahoo,	2+	years	at	Salesforce	
Mul<ple	use	cases	from	messaging	to	storage	
Database	replica<on,	Message	store,	Stream	compu<ng	…	
600+	bookies	in	one	single	cluster	
Data	is	stored	from	days	to	a	year	
Millions	of	log	streams	
1	trillion	records/day,	17	PB/day
152
Companies using BookKeeper
Enterprise	Grade	Stream	Storage
153
What do we really mean by Compute?
Compute	Unifica@on
Aim	is	to	react	to	events	as	they	happen	in	real-<me	
Where	do	Events	happen/arrive?	
Message	Bus	
Whats	a	reac<on	
An	ac<on/transforma<on/func<on
154
Compute Representation
Abstract	View
f(x)
Incoming	Messages Output	Messages
155
Traditional Compute representation
DAG
%
%
%
%
%
Source 1
Source 2
Action
Action
Action
Sink 1
Sink 2
156
Traditional Compute API
Func@onal
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);
157
Traditional Compute Runtime
Separate
Messaging Compute
158
Traditional Compute Systems
Pihalls
Powerful	API	but	complicated	
Does	everyone	really	need	to	learn	func<onal	programming?	
Configurable/Scaleable	but	management	overhead	
Edge	systems	have	resource/manageability	constraints
159
Lessons learnt
Developer	Experience
A	significant	percentage	of	transforma<ons	are	simple	
ETL	
Reac<ve	Services	
Classifica<on	
Real-<me	Aggrega<on	
Event	Rou<ng	
Microservices
160
Lessons learnt
Opera@onal	Experience
Another	system	to	operate	is	one	too	many	
IOT	deployment	rou<nely	have	thousands	of	edge	systems	
Seman<c	difference	
Mismatch/Duplica<on	between	Systems	
Creates	Developer	and	Operator	Fric<on
161
Whats needed:- Stream-Native Compute
Insight
Simplest	possible	API	
Method/Procedure/Func<on	
Mul<	Language	API	
Scale	developers	
Message	bus	na<ve	concepts	
Input/Output/Log	as	topics	
Flexible	run<me	
Simple	standalone	applica<ons	vs	system	managed	applica<ons
162
Introducing Pulsar Functions
163
Pulsar Functions
API
SDK	less	API	
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
164
Pulsar Functions
API
SDK	API	
import org.apache.pulsar.functions.api.PulsarFunction;
import org.apache.pulsar.functions.api.Context;
public class ExclamationFunction implements PulsarFunction<String, String> {
@Override
public String process(String input, Context context) {
return input + "!";
}
}
165
Pulsar Functions
Input	and	Output
Func<on	executed	for	every	message	of	input	topic	
Supports	mul<ple	topics	as	inputs	
Func<on	Output	goes	to	the	output	topic	
Func<on	Output	can	be	void/null	
SerDe	takes	care	of	serializa<on/deserializa<on	of	messages	
Custom	SerDe	can	be	provided	by	the	users	
Integrates	with	Schema	Registry
166
Pulsar Functions
Processing	Guarantees
ATMOST_ONCE	
Message	is	acked	to	Pulsar	as	soon	as	we	receive	it	
ATLEAST_ONCE	
Message	acked	to	Pulsar	aber	the	func<on	completes	
Default	behaviour:-	Not	many	ppl	want	to	loose	data	
EFFECTIVELY_ONCE	
Uses	Pulsar’s	inbuilt	effec<vely	once	seman<cs	
Controlled	at	run<me	by	user
167
Pulsar Functions
Built	in	State
Func<ons	can	store	state	in	StreamStore	
Framework	provides	an	simple	library	around	this	
Support	server	side	opera<ons	like	counters	
Simplified	applica<on	development	
No	need	to	standup	an	extra	system
168
Pulsar Functions
WordCount	Topology
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.PulsarFunction;
public class CounterFunction implements PulsarFunction<String, Void> {
@Override
public Void process(String input, Context context) throws Exception {
for (String word : input.split(".")) {
context.incrCounter(word, 1);
}
return null;
}
}
169
Pulsar Functions
Running	as	a	standalone	applica@on
bin/pulsar-admin functions localrun 
--input persistent://sample/standalone/ns1/test_input 
--output persistent://sample/standalone/ns1/test_result 
--className org.mycompany.ExclamationFunction 
--jar myjar.jar
Runs	as	a	standalone	process	
Run	as	many	instances	as	you	want.	Framework	automa<cally	balances	data	
Run	and	manage	via	Mesos/K8/Nomad/your	favorite	tool
170
Pulsar Functions
Running	inside	Pulsar	cluster
‘Create’	and	‘Delete’	Func<ons	in	a	Pulsar	Cluster	
Pulsar	brokers	run	func<ons	as	either	threads/processes/docker	containers	
Unifies	Messaging	and	Compute	cluster	into	one,	significantly	improving	
manageability		
Ideal	match	for	Edge	or	small	startup	environment	
Serverless	in	a	jar
171
Pulsar Functions
Running	inside	a	Pulsar	Cluster
bin/pulsar-admin functions create 
--input persistent://sample/standalone/ns1/test_input 
--output persistent://sample/standalone/ns1/test_result 
--className org.mycompany.ExclamationFunction 
--jar myjar.jar
bin/pulsar-admin functions gestatus 
--name ExclamationFunction 
--tenant sample —-namespace ns1
bin/pulsar-admin functions delete 
--name ExclamationFunction 
--tenant sample —-namespace ns1
172
Unification
Stepping	Back
Messaging	&	Storage	
With	StreamStore	
Messaging	&	Compute	
Pulsar	Func<ons	
Compute	&	Storage	
Built	in	State	Management
173
Unification
Observa@ons
Lambda,	Kappa	or	something	else	
Boundaries	of	streaming	vs	serverless	
Its	all	about	reducing	barriers	
APIs	
Run<me	
Opera<ons
174
RESOURCES
Sketching	Algorithms
hgps://www.cs.upc.edu/~gavalda/papers/portoschool.pdf		
hgps://mapr.com/blog/some-important-streaming-algorithms-you-should-know-about/	
hgps://gist.github.com/debasishg/8172796
Synopses	for	Massive	Data:	Samples,	Histograms,	Wavelets,		
Sketches
Data	Streams:	Models	and	Algorithms
Charu	Aggarwal	
hgp://www.springer.com/us/book/9780387287591
Data	Streams:	Algorithms	and	Applica@ons
Muthu	Muthukrishnan	
hgp://algo.research.googlepages.com/eight.ps
Graph	Streaming	Algorithms
A.	McGregor
G.	Cormode,	M.	Garofalakis	and	P.	J.	Haas
Sketching	as	a	Tool		for	Numerical	Linear	Algebra
D.	Woodruff
175
Dhalion:	Self-Regula<ng	VLDB’17
Twiger	Heron:	Towards	Extensible	ICDE’17
Dhalion:	Self-Regula<ng	VLDB’17
MillWheel:		VLDB’13
Readings
Stream	Processing	in	Heron
Stream	Processing	in	Heron
Streaming	Engines
Twiger	Heron:	Stream	SIGMOD’15
Processing	at	scale
Fault-Tolerant	Stream	Processing	at	Internet	Scale
The	Dataflow	Model:	A	Prac<cal	VLDB’15
Approach	to	Balancing	Correctness,		
Latency	and	Cost	in	Massive-Scale,	
Unbounded	Out-of-Order	Data	Processing
Anomaly	Detec<on	in	Strata	San	Jose’17
Real-Time	Data	Streams	Using	Heron
176
Real Time is Messy and Unpredictable
Aggregation
Systems
Messaging
Systems
Result
Engine
HDFS
Queryable
Engines
177
Streamlio - Unified Architecture
Interactive
Querying
Storm API
Trident/Apache
Beam
SQL
Application
Builder
Pulsar
API
BK/
HDFS
API
Kubernetes
Metadata
Management
Operational
Monitoring
Chargeback
Security
Authentication
Quota
Management
Rules
Engine
Kafka
API
178
BookKeeper in Real-Time Solution
Durable	Messaging,	Scalable	Compute	and	Stream	Storage
179
BookKeeper - Use Cases
Combine	messaging	and	storage
Stream	Storage	combines	the	func<onality	of	streaming	and	storage
WAL	-	Write	Ahead	Log Message	Store Object	Store
SnapshotsStream	Processing
180
Readings
FOCS’00	
Clustering Data Streams
SIGMOD’02	
Querying and mining data streams:
You only get one look
SIAM Journal of Computing’09	
Stream Order and Order Statistics: Quantile
Estimation in Random-Order Streams
PODS’02	
Models and Issues in Data Stream Systems
SIGMOD’07	
Statistical Analysis of Sketch Estimators
PODS’10	
An optimal algorithm for the distinct
elements problem
181
Readings
SODA’10	
Coresets and Sketches for high dimensional
subspace approximation problems
SIGMOD’16	
Time Adaptive Sketches (Ada-Sketches) for
Summarizing Data Streams
SOSR’17	
Heavy-Hitter Detection Entirely in the Data
Plane
PODS’12	
Graph Sketches: Sparsification, Spanners, and
Subgraphs
Arxiv’16	
Coresets and Sketches
ACM Queue’17	
Data Sketching: The approximate approach is
often faster and more efficient
183
GET	IN	TOUCH
C O N T A C T 	 U S
@arun_kejariwal	
@kramasamy,		@sanjeerk	
@sijieg,		@merlimat	
@nlu90
karthik@stremlio.io	
arun_kejariwal@acm.org
E N J O Y T H E P R E S E N T A T I O N
The End

More Related Content

What's hot

Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafkaconfluent
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101Data Con LA
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...StreamNative
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberYing Zheng
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovAltinity Ltd
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
 
Securing your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris KelloggSecuring your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris KelloggStreamNative
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architecturesArun Kejariwal
 

What's hot (20)

Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)Intro to Pinot (2016-01-04)
Intro to Pinot (2016-01-04)
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
 
Securing your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris KelloggSecuring your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris Kellogg
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architectures
 

Similar to Tutorial - Modern Real Time Streaming Architectures

Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsHeiko Joerg Schick
 
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeperStreaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeperKarthik Ramasamy
 
Real Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyReal Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyData Con LA
 
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ..."Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...DevClub_lv
 
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and PancakesBig Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and PancakesOsama Khan
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?confluent
 
Agents In An Exponential World Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World FosterIan Foster
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Masoud Nikravesh
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
The Evolution of Data Architecture
The Evolution of Data ArchitectureThe Evolution of Data Architecture
The Evolution of Data ArchitectureWei-Chiu Chuang
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and ComputationTal Lavian Ph.D.
 
Sql or NoSql: that is the question...
Sql or NoSql: that is the question...Sql or NoSql: that is the question...
Sql or NoSql: that is the question...alikonweb
 
Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Dr. Kim (Kyllesbech Larsen)
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 

Similar to Tutorial - Modern Real Time Streaming Architectures (20)

Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeperStreaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
 
Real Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik RamasamyReal Time Processing Using Twitter Heron by Karthik Ramasamy
Real Time Processing Using Twitter Heron by Karthik Ramasamy
 
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ..."Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...
 
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and PancakesBig Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
 
Modern Data Pipelines
Modern Data PipelinesModern Data Pipelines
Modern Data Pipelines
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Agents In An Exponential World Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World Foster
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
The Evolution of Data Architecture
The Evolution of Data ArchitectureThe Evolution of Data Architecture
The Evolution of Data Architecture
 
1. GRID COMPUTING
1. GRID COMPUTING1. GRID COMPUTING
1. GRID COMPUTING
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
Sql or NoSql: that is the question...
Sql or NoSql: that is the question...Sql or NoSql: that is the question...
Sql or NoSql: that is the question...
 
Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
S B Goyal
S B GoyalS B Goyal
S B Goyal
 

More from Karthik Ramasamy

Scaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/dayScaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/dayKarthik Ramasamy
 
Pulsar summit-keynote-final
Pulsar summit-keynote-finalPulsar summit-keynote-final
Pulsar summit-keynote-finalKarthik Ramasamy
 
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupApache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupKarthik Ramasamy
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarKarthik Ramasamy
 
Creating Data Fabric for #IOT with Apache Pulsar
Creating Data Fabric for #IOT with Apache PulsarCreating Data Fabric for #IOT with Apache Pulsar
Creating Data Fabric for #IOT with Apache PulsarKarthik Ramasamy
 
Linked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache PulsarLinked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache PulsarKarthik Ramasamy
 
Exactly once in Apache Heron
Exactly once in Apache HeronExactly once in Apache Heron
Exactly once in Apache HeronKarthik Ramasamy
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Karthik Ramasamy
 
Storm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperStorm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperKarthik Ramasamy
 
Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Karthik Ramasamy
 

More from Karthik Ramasamy (11)

Scaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/dayScaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/day
 
Apache Pulsar @Splunk
Apache Pulsar @SplunkApache Pulsar @Splunk
Apache Pulsar @Splunk
 
Pulsar summit-keynote-final
Pulsar summit-keynote-finalPulsar summit-keynote-final
Pulsar summit-keynote-final
 
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupApache Pulsar Seattle - Meetup
Apache Pulsar Seattle - Meetup
 
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache PulsarUnifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar
 
Creating Data Fabric for #IOT with Apache Pulsar
Creating Data Fabric for #IOT with Apache PulsarCreating Data Fabric for #IOT with Apache Pulsar
Creating Data Fabric for #IOT with Apache Pulsar
 
Linked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache PulsarLinked In Stream Processing Meetup - Apache Pulsar
Linked In Stream Processing Meetup - Apache Pulsar
 
Exactly once in Apache Heron
Exactly once in Apache HeronExactly once in Apache Heron
Exactly once in Apache Heron
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
 
Storm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperStorm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paper
 
Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014Storm@Twitter, SIGMOD 2014
Storm@Twitter, SIGMOD 2014
 

Recently uploaded

Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
prediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachprediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachAdekunleJoseph4
 
testingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdftestingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdfDSP Mutual Fund
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbaAdobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbas73678sri
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Adobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvwAdobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvws73678sri
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
Inference rules in artificial intelligence
Inference rules in artificial intelligenceInference rules in artificial intelligence
Inference rules in artificial intelligencePriyadharshiniG41
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 

Recently uploaded (20)

Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
prediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approachprediction of default payment next month using a logistic approach
prediction of default payment next month using a logistic approach
 
testingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdftestingsdadadadaaddadadadadadadadaad.pdf
testingsdadadadaaddadadadadadadadaad.pdf
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsbaAdobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
Adobe Scan 06-Mar-2024 (1).pdfwvsbbsbsba
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Adobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvwAdobe Scan 06-Mar-2024 (1).pdf shavashwvw
Adobe Scan 06-Mar-2024 (1).pdf shavashwvw
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
 
Inference rules in artificial intelligence
Inference rules in artificial intelligenceInference rules in artificial intelligence
Inference rules in artificial intelligence
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 

Tutorial - Modern Real Time Streaming Architectures