Streaming	Transformations	Using	Oracle	Data	Integration
Michael	Rainey	|	BIWA	Summit	2017
• Michael	Rainey	-	Technical	Advisor	
• Spreading	the	good	word	about	Gluent	

products	with	the	world		
• Oracle	Data	Integration	expertise	
• Oracle	ACE	Director	
• mRainey.co
2
Introduction
we liberate enterprise data
What	is	“Streaming”
• The	processing	and	analysis	of	structured	or	“unstructured”	data	in	real-time
• Why	Streaming?	
• When	speed	(velocity)	of	data	is	key	
• Streaming	data	is	processed	in	“time	windows”,	in	memory,	across	a	cluster	of	servers
• Examples:	
• Calculating	a	retail	buying	opportunity	
• Real-time	cost	calculations	
• IoT	data	analysis
4
What	is	“Streaming”
“Publish-subscribe	messaging	rethought	as	a	distributed	commit	log”
5
Streaming	data	-	Apache	Kafka
Image source: kafka.apache.org/
Enterprise	Data	Bus
6
Enterprise	Data	Bus
6
• Scalable,	fault-tolerant,	high-throughput	stream	processing	
• Spark	Streaming	receives	live	input	data	streams	from	various	sources	
• Continuous	stream	of	data	is	known	as	a	discretized	stream	or	DStream	
• Data	is	divided	into	mini-batches	and	processed	by	the	Spark	engine	
• Operations	such	as	join,	filter,	map,	count,	windowed	computations,	etc	are	used	to	
transform	data	in-flight
7
Stream	processing	-	Apache	Spark
Why	Oracle	Data	Integration?
• Enterprise	has	invested	heavily	in	ODI	and/or	GoldenGate	
• Getting	started	with	development	languages	(Python/pySpark,	Java,	etc)
• Centralized	metadata	management	
• Integrate	with	other	data	sources	using	a	single	interface
• Realized	cost	savings	
• According	to	Gartner,	200%	increase	in	maintenance	costs	when	custom	coding	
(https://www.gartner.com/doc/3432617/does-customcoded-data-integration-stack)
9
Why	Oracle	Data	Integration?
10
Streaming	with	Oracle	Data	Integration
10
Streaming	with	Oracle	Data	Integration
Real-time	
data	
replication
Streaming	
integration:	
OGG	->	Kafka
Streaming	integration:		
Kafka	->	Spark	Streaming
11
Relational	database	transactions	to	Kafka
• GoldenGate	
• …is	non-invasive	
• …has	checkpoints	for	recovery	
• …moves	data	quickly	
• …is	easy	to	setup	
12
Why	GoldenGate	with	Kafka?
• Heterogeneous	sources	and	targets	
• Built	to	integrate	all	data
• Flexibility	
• Reusable	code	templates	

(Knowledge	Modules)	
• Reusable	Mappings	
• ODI	can	adapt	to	your	data	warehouse	-	and	not	the	other	way	around
• Flow	based	mappings
13
Why	Oracle	Data	Integrator	with	Spark	Streaming?
Getting	started	with	streaming	using	
Oracle	Data	Integration
• Standard	GoldenGate	Extract	/	Pump	processes	to	capture	RDBMS	data	
• Replicat	for	Java	parameter	file	&	process	group	created	and	setup	
• Kakfa	Producer	properties	and	Kafka	Handler	configuration	setup
15
Oracle	GoldenGate	for	Big	Data	-	Kafka	Handler	Setup
• Kafka	handler	properties	
• Set	properties	for	how	GoldenGate	interacts	with	Kafka	
• Format,	transaction	vs	operation	mode,	etc
• Kafka	producer	configuration
16
GoldenGate	for	Kafka	setup
http://mrainey.co/ogg-kafka-oow
17
Kafka	and	Oracle	Data	Integrator	setup
17
Kafka	and	Oracle	Data	Integrator	setup
• Create	Model	using	Kafka	Logical	Schema	
• Create	Datastore	
• Similar	to	standard	“File”	

datastore,	define	file	format	and	

setup	columns	
• Only	support	for	CSV	
• Future	formats	may	include	JSON,	Avro,	etc	
• Add	Datastore	to	mapping
18
Kafka	and	Oracle	Data	Integrator
• Create	Spark	Data	Server,	Physical	/	Logical	Schema	
• Set	Hadoop	Data	Server		
• Add	properties,	such	as	checkpointing,	asynchronous	execution	mode,	etc	
• Additional	properties	can	be	added:

http://spark.apache.org/docs/latest/configuration.html
• Spark	Server	is	setup	as	Staging	location	
• Source	Datastore	from	Kafka,	Oracle	DB,	etc	
• Target	Datastore	is	Cassandra,	Oracle	DB,	etc
• Code	generated	by	KM	is	pySpark	
• pySpark	code	can	be	added	to	filters,	joins,	other	components	for	transformations	
• Additional	languages	(Scala,	Java)	may	be	coming	soon
19
Spark	Streaming	and	Oracle	Data	Integrator
20
Spark	Streaming	and	Oracle	Data	Integrator
Enable	the	Streaming	
flag	in	the	Physical	
design	of	a	mapping.
To	generate	Spark	code,	set	the	Execute	
On	Hint	option	to	use	the	Spark	data	
server	as	the	staging	location	for	your	
mapping
Target	IKM	should	not	be	set.	
Spark	generated	code	will	handle	
integration	and	load	into	target.
21
Tracking	the	process
When	executing,	the	process	
will	run	continuously	in	the	
ODI	Operator.
If	the	connection	between	
the	ODI	Agent	and	Spark	
Agent	is	lost,	it	will	
reestablish	itself	after	
recovery.
• Streaming	is	the	“velocity”	in	data.	AKA	“Fast	Data”	
• Oracle	Data	Integrator	and	Oracle	GoldenGate	provide	a	framework	for	
development	and	management	of	data	streaming	processes	
• Big	Data	add-ons	continue	to	support	new	technologies	
• Build	a	streaming	architecture	using	GoldenGate	and	ODI:	
• Metadata	management	
• Integration	of	RDBMS	data	with	“schema	on	read”	data	
• Build	upon	the	skills	in-house	
22
Recap
23
we liberate enterprise data
thank you!

Streaming with Oracle Data Integration