Great	Ideas….Simple	Solutions
Data	Ingestion	Platform	(DiP)
Neeraj Sabharwal	@allaboutbdata
About	me
Xavient	Corporate	Overview2
• Head	of	Cloud,	Data	&	Analytics	@Xavient
• Spent	couple	of	years	@Hortonworks
• Over	a	decade	in	Cloud	&	Data	domain
• Started	career	as	Oracle	DBA	
Disclosure– More	memes	coming	up…
Agenda
Xavient	Corporate	Overview3
Platform
Data	
Access
Hybrid	
Cloud
Data	Ingestion	Platform	 (DiP)4
Before	we	start	…
**	Near	real	time	is	ok	as	I	am	easy	going	but	no	more	hours	or	days	wait	on	data
Problem
Xavient	Corporate	Overview5
UI/API Platform
Data
Access
No…near	
real-time	
access
Cloud
Great	Ideas….Simple	Solutions
Shifting	the	gear	– Let’s	get	technical
Streaming	Blueprint
Xavient	Corporate	Overview7
Data	Collection
Messaging	Tier Streaming	Engine Analysis	Tier
In	memory
Data	Store
Data	Access
**	Near	real	time	is	ok	as	I	am	easy	going	but	no	more	hours	or	days	wait	on	data
Messaging	Bus
Xavient	Corporate	Overview8
• Open-source	message	broker
• Unified,	high-throughput,	low-latency	platform	for	handling	real-time	data	feeds
• Massively	scalable	pub/sub	message	queue	architected	as	a	distributed	transaction	
log
Emotions
Xavient	Corporate	Overview9
Streaming	engines
Xavient	Corporate	Overview10
Storm - Distributed	real-time	computation	system	for	processing	large	volumes	of	high-
velocity	data	
Flink - Streaming	dataflow	engine that	provides	data	distribution,	communication,	and	
fault	tolerance	for	distributed	computations	over	data	streams
Apex- Enterprise-grade	unified	stream	and	batch	processing	engine
Spark	Streaming	- Apache	Spark's language-integrated	API to	stream	processing,	letting	
you	write	streaming	jobs	the	same	way	you	write	batch	jobs.	It	supports	Java,	Scala	and	
Python
CTM
Xavient	Corporate	Overview11
Great	Ideas….Simple	Solutions
Platform	(DiP)
Data	Ingestion	Platform	 (DiP)13
Features
Easy	to	use	UI
Multiple	Streaming	
Engines
Supports	xml,	json
and	tsv data	formats
Manual	data	
entry	via	UI
Upload	files	for	
batch	processing
Hybrid	Cloud
Batch	and	Real	time	
views	of	data
Data	visualization	
and	analytics
YARN	featuresData	
Ingestion	
Platform
Data	Ingestion	Platform	 (DiP)14
Use	Cases	– Any	Data
Sentimental	Analysis Log	Analysis
Click	Stream	Analysis
Analyze	Machine	and	
Sensor	Data
Social	Media	and	
Customer	Sentiment
UI
Xavient	Corporate	Overview15
https://techblog.xavient.com/
What	was	in	the	previous	slide?	Is	that	for	real?
Xavient	Corporate	Overview16
No	more	Memes	…Enough	now	J
Data	Ingestion	Platform	 (DiP)17
DiP	Technology	Stack
Messaging	System
Target	System
Reporting	System
Source	System
Streaming	API’s
Programming	
Language
IDE
Build	tool
Operating	System
Apache	Kafka
HDFS,	NoSql,	Apache	Hive
Apache	Phoenix,	Apache	Zeppelin
Web	Client
Apache	Apex,	Apache	Flink,	
Apache Spark	and	Apache	Storm
Java
Eclipse
Apache	Maven
CentOS	7
Data	Ingestion	Platform	 (DiP)18
DiP	High	Level	Architecture
Data	Ingestion	Platform	 (DiP)19
DiP	using	Storm
• Multiple	processing	paradigm	- Real-time	,	Interactive	and	Batch	processes
• Reliable – each	unit	of	data	(tuple)	will	be	processed	at	least	once	or	exactly	once.
• ​Fast and	scalable	- parallel	calculations	are	run	across	a	cluster	of	machines.
• Fault-tolerant - workers	automatically	restarts	in	case	they	die	.
Apache	Storm	features
Data	Ingestion	Platform	 (DiP)20
DiP	using	Spark​	Streaming
• Multiple	processing	paradigm	- Batch	and	Interactive
• Ease	of	Use	–contains	high-level	operators	written	in	Java,	Scala	and	Python
• Fault	Tolerance	- lost	work	and	operator	state	can	be	recovered	with	no	extra	code	
• Code	Reusability	– same	code	can	be	used		for	batch	processing,	join	streams	against	historical	data,	or	to	run	ad-
hoc	queries	on	stream	state
Spark	Streaming features
Data	Ingestion	Platform	 (DiP)21
DiP	using	Apex​
Modular - Malhar,	a	library	of	operators	,	comes	bundled	with	Apex	for	quick	development	cycles
• Supports	both	stream	and	batch	processing
• Supports	operator	exchange	at	runtime
• Supports	fault	tolerance	and	dynamic	scaling
Apache	Apex features
Data	Ingestion	Platform	 (DiP)22
DiP	using	Flink
Multiple	processing	paradigm	- distributed,	stream	and	batch	processing.
Several	APIsfor	creating	applications	are	supported
• Data	Stream	API for	unbounded	streams	embedded	in	Java	and	Scala
• Data	Set	API for	static	data	embedded	in	Java,	Scala,	and	Python,
• Table	API with	a	SQL-like	expression	language	embedded	in	Java	and	Scala.
Fault	tolerance	for	distributed	computations	over	data	streams	
Apache	Flink features
Data	Ingestion	Platform	 (DiP)23
DiP-Druid	Architecture	(High	Level)	
Credit:	https://imply.io/docs/latest/
https://techblog.xavient.com/kafka-druid-integration-with-ingestion-dip-real-time-data
Data	Ingestion	Platform	 (DiP)24
Data	Access
Apache	Zeppelin/	Custom	UI
• Data	Stored	on	HDFS	as	Hive	External	
Tables
• Data	stored	on	HBaseas	Phoenix	View
Custom	UI	“Co-Dev”
Xavient	Corporate	Overview25
• Integrated	with	elastic	
search
• Enterprise	security	and	
SSO
• Recommendation	model	
based	on	user	profile,	tags	
and	activity
• Chat	
• Blog/Droplet	features
• Tasks	creation	and	follow-
up
• Notifications
• Smart	phone	app
Data	Ingestion	Platform	 (DiP)26
DiP	@	Hallwaze.com
Data	Ingestion	Platform	 (DiP)27
Get	involved
https://github.com/XavientInformationSystems/Data-Ingestion-Platform
Co-Dev	:	Reach	out	in	case	you	want	to	customize	the	platform,	choose	the	right	
streaming	engine	based	on	latency,	use	case	and	custom	UI/reporting.
Great	Ideas….Simple	Solutions
Hybrid	Cloud
Hadoop	and	Cloud
Xavient	Corporate	Overview29
Apache	Falcon	
Xavient	Corporate	Overview30
DiP Hadoop
On-prem
Cloud
Apache	Falconis	a	data	management	tool	for	overseeing	data	pipelines	in	Hadoop	
clusters.	It	can	be	used	to	replicate	data	from	one	cluster	to	another.	
Hadoop
Kafka	Mirroring
Xavient	Corporate	Overview31
The Kafka mirroring feature is used for creating the replica of an existing cluster, for example, for the
replication of an active datacenter into a passivedatacenter. Kafka providesa mirror maker tool for
mirroring the source cluster intotarget cluster.
Data	Ingestion	Platform	 (DiP)32
Kafka	Mirroring	– Hybrid	Cloud	Environment
Cassandra
Xavient	Corporate	Overview33
DiP
Cassandra
Cassandra
On-prem
Cloud
• RDBMS	migration	
• DSE	advance	replication
• Kafka
Data	Ingestion	Platform	 (DiP)34
WIP
• Integration	with	Kafka	Connect	and	Kafka	Streaming
• Data	Munging,	Validation
• Machine	Learning	
• Search	– Elastic	,	Solr
Thanks!
@allaboutbdata
nsabharwal@xavient.com

Real time data ingestion and Hybrid Cloud