Building Modern Data Pipelines Using
Apache Pulsar, Heron and BookKeeper
Karthik	Ramasamy	
Cofounder
Streamlio
2
Information Age
Streaming	Data	is	key	
Ká !
3
Increasingly Connected World
Internet of Things
30	B	connected	devices	by	2020
Health Care
153	Exabytes	(2013)	->	2314	Exabytes	(2020)
Machine Data
40%	of	digital	universe	by	2020
Connected Vehicles
Data	transferred	per	vehicle	per	month	
4	MB	->	5	GB
Digital Assistants (Predictive Analytics)
$2B	(2012)	->	$6.5B	(2019)	[1]	
Siri/Cortana/Google	Now
Augmented/Virtual Reality
$150B	by	2020	[2]	
Oculus/HoloLens/Magic	Leap
Ñ
!+
>
4
Traditional Batch Processing
Challenges	
Introduces	too	much	“decision	latency”	
Responses	are	delivered	“a[er	the	fact”	
Maximum	value	of	the	idenfied	situaon	is	lost	
Decisions	are	made	on	old	and	stale	data	
Data	at	Rest
Store Analyze Act
5
The New Era: Streaming Data/Fast Data
Events	are	analyzed	and	processed	in	real-me	as	they	arrive	
Decisions	are	mely,	contextual	and	based	on	fresh	data	
Decision	latency	is	eliminated	
Data	in	moon
Modern	Data	Pipelines Streaming	Data	Pipelines
6
Streaming Data Pipeline
A	Streaming	Data	Pipeline		
captures	events	for	analysis,		
process	as	they	arrive	and		
produce	con=nuous	results
7
Streaming Data Pipeline
Data	Lake
Product	Features
Adhoc	Features
Counng
Machine	Learning
Extract	Transform	Load	(ETL)
Click	Data
Engagement	Data
Email	sends
User	events
8
Streaming Use Cases
Algorithmic	trading	
Online	fraud	detecon	
Geo	fencing	
Proximity/locaon	tracking	
Intrusion	detecon	systems	
Traffic	management
Real	me	recommendaons	
Churn	detecon	
Internet	of	things	
Social	media/data	analycs	
Gaming	data	feed	
IoT	Edge	Analycs
9
Streaming Stack
REAL TIME
STACK
Collectors
s
Compute
J
Messaging
a
Storage
b
10
State of the World
Aggregation
Systems
Messaging
Systems
Result
Engine
HDFS
Queryable
Engines
11
Towards Simplification & Unification
Data	Ingeson Data	Processing
Results	StorageData	Storage
12
Why Apache Pulsar?
Ordering	
Guaranteed	ordering
Mul=-tenancy	
A	single	cluster	can	support	many	
tenants	and	use	cases
High	throughput	
Can	reach	1.8	M	messages/s	in	a	
single	par==on
Durability	
Data	replicated	and	synced	to	disk
Geo-replica=on	
Out	of	box	support	for	geographically	
distributed	applica=ons
Unified	messaging	model	
Support	both	Topic	&	Queue	
seman=c	in	a	single	model
Delivery	Guarantees	
At	least	once,	at	most	once	and	effec=vely	
once
Low	Latency	
Low	publish	latency	of	5ms	at	99pct
Highly	scalable	
Can	support	millions	of	topics
13
Pulsar Architecture
Stateless	Serving
BROKER	
Clients interact only with brokers
No state is stored in brokers
BOOKIES	
Apache BookKeeper as the storage
Storage is append only
Provides high performance, low latency
Durability	
No data loss. fsync before acknowledgement
14
Pulsar Architecture
Separa=on	of	Storage	and	Serving
SERVING
Brokers can be added independently
Traffic can be shifted quickly across brokers
STORAGE	
Bookies can be added independently
New bookies will ramp up traffic quickly
15
Segment Centric Architecture
16
Pulsar Architecture
Geo	Replica=on
GEO REPLICATION
Asynchronous replication
Integrated in the broker message flow
Simple configuration to add/remove regions
Topic	(T1) Topic	(T1)
Topic	(T1)
Subscripon	
(S1)
Subscripon	
(S1)
Producer		
(P1)
Consumer		
(C1)
Producer		
(P3)
Producer		
(P2)
Consumer		
(C2)
Data	Center	A Data	Center	B
Data	Center	C
17
Pulsar in Production
3+	years	
Serves	2.3	million	topics	
100	billion	messages/day	
Average	latency	<	5	ms	
99%	15	ms	(strong	durability	guarantees)	
Zero	data	loss	
80+	applicaons	
Self	served	provisioning	
Full-mesh	cross-datacenter	replicaon	-	
8+	data	centers
18
Heron Design Goals
Efficiency	
Reduce	resource	consumpon
Support	for	diverse	workloads	
Throughput	vs	latency	sensive
Support	for	mul=ple	seman=cs	
Atmost	once,	Atleast	once,	
Effecvely	once
Na=ve	Mul=-Language	Support	
C++,	Java,	Python
Task	Isola=on	
Ease	of	debug-ability/isolaon/profiling	
Support	for	back	pressure	
Topologies	should	be	self	adjusng
Use	of	containers	
Runs	in	schedulers	-	Kubernetes	&	DCOS	&	
many	more
Mul=-level	APIs	
Procedural,	Funconal	and	Declarave	for	
diverse	applicaons
Diverse	deployment	models	
Run	as	a	service	or	pure	library
19
Heron Data Model
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
20
Writing Heron Topologies
Procedural - Low Level API
Directly	write	your	spouts	and	bolts
Functional - Mid Level API
Use	of	maps,	flat	maps,	transform,	windows
Declarative - SQL (in the works)
Use	of	declarave	language	-	specify	what	you	
want,	system	will	figure	it	out.
,
%
21
Topology Execution
Topology Master
ZK
Cluster
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
DATA CONTAINER DATA CONTAINER
Metrics
Manager
Metrics
Manager
Metrics
Manager
Health
Manager
MASTER
CONTAINER
22
Heron @Twitter
LARGEST	CLUSTER
100’s	of	TOPOLOGIES
BILLIONS	OF	MESSAGES100’s	OF	TERABYTESREDUCED	INCIDENTS
GOOD	NIGHT	SLEEP
3X - 5X reduction in resource usage
23
Heron Topology Complexity
24
Heron Topology Scale
CONTAINERS - 1 TO 600 INSTANCES - 10 TO 6000
25
Heron Happy Facts :)
!	No	more	pages	during	midnight	for	Heron	team	
"	Very	rare	incidents	for	Heron	customer	teams	
#	Easy	to	debug	during	incident	for	quick	turn	around	
$	Reduced	resource	ulizaon	saving	cost
26
Observations
Computaon	across	batch/streaming	is	similar	
Expressed	as	DAGS	
Run	in	parallel	on	the	cluster	
Intermediate	results	need	not	be	materialized	
Funconal/Declarave	APIs	
Storage	is	the	key	
Messaging/Storage	are	two	faces	of	the	same	coin	
They	serve	the	same	data
27
Storage Requirements
Requirements	for	a	real-=me	storage	placorm
Be	able	to	write	and	read	streams	of	records	with	low	latency,	storage	durability	
Data	storage	should	be	durable,	consistent	and	fault	tolerant	
Enable	clients	to	stream	or	tail	ledgers	to	propagate	data	as	they’re	wriren	
Store	and	provide	access	to	both	historic	and	real-me	data
28
BookKeeper in Pipelines
Durable	Messaging,	Scalable	Compute	and	Stream	Storage
29
BookKeeper in Production
Enterprise	Grade	Stream	Storage
4+	years	at	Twirer	and	Yahoo,	2+	years	at	Salesforce	
Mulple	use	cases	from	messaging	to	storage	
Database	replicaon,	Message	store,	Stream	compung	…	
600+	bookies	in	one	single	cluster	
Data	is	stored	from	days	to	a	year	
Millions	of	log	streams	
1	trillion	records/day,	17	PB/day
30
Companies using the projects
Enterprise	Grade	Stream	Storage
31
Streamlio - Unified Architecture
Interactive
Querying
Storm API
Trident/Apache
Beam
SQL
Application
Builder
Pulsar
API
BK/
HDFS
API
Kubernetes
Metadata
Management
Operational
Monitoring
Chargeback
Security
Authentication
Quota
Management
Rules
Engine
Kafka
API
33
GET	IN	TOUCH
C O N T A C T 	 U S
@kramasamy
955	Alma	Street,	Palo	Alto,	CA
karthik@streaml.io
Modern Data Pipelines

Modern Data Pipelines