In-Memory	Streaming,	Storage	&	Analy4cs	
	Apache	Apex	+	Apache	Geode		
	
Thomas	Weise	 Ashish	Tadose
•  In-memory	Stream	Processing	
•  Par22oning	and	Scaling	out	
•  Windowing	Support	(temporal)	
•  Stateful	Fault-tolerance,	Operability	
•  Processing	Guarantees	
•  Compute	Locality	
•  Dynamic	updates	
	
Apex	Features	…
Apex	PlaGorm	Overview
Applica2on	Programming	Model	Applica2on	Programming	Model	
§  Stream is a sequence of data tuples
§  Operator takes one or more input streams, performs computations & emits one or more output streams
–  Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
–  Operator has many instances that run in parallel and each instance is single-threaded
§  Directed Acyclic Graph (DAG) is made up of operators and streams
–  Iterative processing supported
Directed Acyclic Graph (DAG)
Output StreamTuple Tuple
er	
Operator	
er	
Operator	
er	
Operator	
er	
Operator
Apache	Apex-Malhar
Apex	Na2ve	Hadoop	Integra2on	
YARN	is	the	resource	
manager	
	
HDFS	used	for	storing	
any	persistent	state
•  Operator	state	is	checkpointed	to	a	persistent	store	
–  Automa2cally	performed	by	engine,	no	addi2onal	work	needed	by	operator	
–  In	case	of	failure	operators	are	restarted	from	checkpoint	state	
–  Frequency	configurable	per	operator	
–  Asynchronous	and	distributed	by	default	
–  Default	store	is	HDFS	
•  Automa2c	detec2on	and	recovery	of	failed	operators	
–  Heartbeat	mechanism	
•  Buffering	mechanism	to	ensure	replay	of	data	from	recovered	point	so	that	
there	is	no	loss	of	data	
•  Applica2on	master	state	checkpointed	
	
Apex	Fault	Tolerance
At-least-once	
• On	recovery	data	will	be	replayed	from	a	previous	checkpoint	
–  No	messages	lost	
–  Default,	suitable	for	most	applica2ons	
• Can	be	used	to	ensure	data	is	wriUen	once	to	store	
–  Transac2ons	with	meta	informa2on,	Rewinding	output,	Feedback	from	external	en2ty,	
Idempotent	opera2ons	
At-most-once	
• On	recovery	the	latest	data	is	made	available	to	operator	
–  Useful	where	data	loss	is	acceptable	and	latest	data	is	sufficient	
Exactly-once	
–  At-least-once	+	idempotency	+	transac2onal	mechanisms	(operator	logic)	to	achieve	end-to-end	
exactly	once	behavior	
	
Apex	Processing	Seman2cs
•  Data	flow	in-memory,	no	disk	
•  Incremental	recovery	–	buffer	server	
•  In-memory	data	for	querying	data	
	
	
IMC	Benefits	for	Apex
Streaming	meets	In	Memory	Data	Grid
Apex	+	Geode	Integra2on		
Completed		
	
•  Operator	check-poin2ng	in	Geode.	
•  Output	operator	to	store	tuples	in	Geode	region.		
	
Proposed		
	
•  Geode	output	operator	with	Transac2onal	support.	
•  Ingest	data	from	Geode	to	Apex	DAG.	
•  Distributed	Cache	Operator.	
•  Scan	operator	for	parallel	query	execu2on	&	result	retrieval.
Operator	Checkpoin2ng	in	Geode			
Apex	Operator	check-poin4ng	in	an	IMDG	(Geode	store)	
	
• Checkpoin2ng	is	an	essen2al	mechanism	to	ensure	Fault	Tolerance	
• Apex	checkpoints	operator	state	to	HDFS	
• Slower	HDFS	checkpoin2ng	hurts	applica2on	performance	
• Checkpoin2ng	in	Geode	ensures	that	applica2on	performance	is	not	impacted		
• Geode	has	beUer	latency	for	write	opera2ons	than	HDFS.	
Implementa4on:	 	 	GeodeStorageAgent	
hUps://issues.apache.org/jira/browse/APEXCORE-283
Data	Streams	to	Geode	Store			
Apex	Output	Operator	to	write	to	Geode	store	
	
•  Apex	Output	operator	–	Egress	data	from	Apex	DAG	to	external	store	
•  Store	incoming	tuples	in	binary	/	POJO	format	in	Geode	region			
•  Geode	Efficient	Query	integra2on	–	OQL	
•  Geode	region	supports	data	replica2on,	overflow	to	disk,	persistence	&	many	evic2on	
strategies	
Implementa4on:	 	 	GeodeStore	
GeodePOJOPutOperator	
AbstractGeodePutOperator	
hUps://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942
Geode	Transac2ons		writes	
Apex	Output	Operator	to	write	to	Geode	store	with	Transac4ons	
	
• Apex		DAG	uses	Transac2onableStore	to	provide	guarantee	that	records	are	wriUen	are	
exactly	once.	E.g.	JdbcTransac2onalStore	
• Geode	provides	Transac2on	support	for	efficient	and	safe	coordinated	opera2ons	
• Geode	store	using	transac2ons	guarantee	that	records	are	wriUen	exactly	once	
• Put	operator	backed	by	GeodeTransac2onal	store	can	help	to	achieve	Exactly	once	
seman2cs	
Implementa4on:	 	 	GeodeWindowStore	as	Transac2onableStore
Streaming	Geode	data	in	Apex	
Apex	Input	Operator	to	read	from	Geode	store	
• Apex	Input	operators	–	Ingest	data	from	external	sources	into	Apex	DAG	
• Geode	provides	versa2le	and	reliable	event	distribu2on	to	provide	Real	Time	
updates	to	data	
•  Use	case	–	Apex	operator	to	stream	async	events	from	Geode	in	DAG	
•  Call	back	events	reduce	polling	cycles	over	network	
Implementa4on:	 	 	GeodeRegionStreamOperator		
	 	 	 	 	receives	a	newly	added	tuples	and	emits	in	DAG
Geode	Cache	Operator			
Apex	Geode	Cache	Operator		
• Geode	provides	efficient	Events	&	No2fica2ons		
•  Register	interest	–	update	local	copies		
•  Con2nuous	Query		
•  Receive	no2fica2on	when	Query	condi2on		met	on	server	
•  Eg.g	SELECT	*	FROM	/tradeOrder	t	WHERE	t.price	>	100.00			
• Use	Geode	events	no2fica2on	framework	to	maintain	&	invalidate	cache.	
Implementa4on:	 	 	GeodeCacheOperator	
	 	 	 	 	maintains	consistent	cache	based	on	subscribed	keyset/query
Geode	Scan	Operator				
Apex	Geode	Scan	Operator		
• Func2on	Execu2on	provides	Parallel	Query	Execu2on	
• MapReduce	like	execu2on	-	concurrent	execu2on	on	members	&	results	are	
collected	from	members	&	sent		to	caller.			
• Use	case:	Streaming	applica2on	depending	on	large	scan	result	from	external	store	
Implementa4on: 	 	GeodeQueryOperator		
	 	 	 	 	execute	data	dependent	queries	on	distributed	region	
	 	 	 	 	emit	results	in	DAG
Join the Apache Geode Community!
•  Check out: http://geode.incubator.apache.org
•  Subscribe: user-subscribe@geode.incubator.apache.org
•  Download: http://geode.incubator.apache.org/releases/
Ques4ons	???	
Thank	You	…

Apex & Geode: In-memory streaming, storage & analytics