Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Great	Ideas….Simple	Solutions
Data	Ingestion	Platform	(DiP)
Neeraj Sabharwal	@allaboutbdata
About	me
Xavient	Corporate	Overview2
• Head	of	Cloud,	Data	&	Analytics	@Xavient
• Spent	couple	of	years	@Hortonworks
• Ove...
Agenda
Xavient	Corporate	Overview3
Platform
Data	
Access
Hybrid	
Cloud
Data	Ingestion	Platform	 (DiP)4
Before	we	start	…
**	Near	real	time	is	ok	as	I	am	easy	going	but	no	more	hours	or	days	wai...
Problem
Xavient	Corporate	Overview5
UI/API Platform
Data
Access
No…near	
real-time	
access
Cloud
Great	Ideas….Simple	Solutions
Shifting	the	gear	– Let’s	get	technical
Streaming	Blueprint
Xavient	Corporate	Overview7
Data	Collection
Messaging	Tier Streaming	Engine Analysis	Tier
In	memory
Da...
Messaging	Bus
Xavient	Corporate	Overview8
• Open-source	message	broker
• Unified,	high-throughput,	low-latency	platform	fo...
Emotions
Xavient	Corporate	Overview9
Streaming	engines
Xavient	Corporate	Overview10
Storm - Distributed	real-time	computation	system	for	processing	large	volum...
CTM
Xavient	Corporate	Overview11
Great	Ideas….Simple	Solutions
Platform	(DiP)
Data	Ingestion	Platform	 (DiP)13
Features
Easy	to	use	UI
Multiple	Streaming	
Engines
Supports	xml,	json
and	tsv data	forma...
Data	Ingestion	Platform	 (DiP)14
Use	Cases	– Any	Data
Sentimental	Analysis Log	Analysis
Click	Stream	Analysis
Analyze	Mach...
UI
Xavient	Corporate	Overview15
https://techblog.xavient.com/
What	was	in	the	previous	slide?	Is	that	for	real?
Xavient	Corporate	Overview16
No	more	Memes	…Enough	now	J
Data	Ingestion	Platform	 (DiP)17
DiP	Technology	Stack
Messaging	System
Target	System
Reporting	System
Source	System
Stream...
Data	Ingestion	Platform	 (DiP)18
DiP	High	Level	Architecture
Data	Ingestion	Platform	 (DiP)19
DiP	using	Storm
• Multiple	processing	paradigm	- Real-time	,	Interactive	and	Batch	proces...
Data	Ingestion	Platform	 (DiP)20
DiP	using	Spark​	Streaming
• Multiple	processing	paradigm	- Batch	and	Interactive
• Ease	...
Data	Ingestion	Platform	 (DiP)21
DiP	using	Apex​
Modular - Malhar,	a	library	of	operators	,	comes	bundled	with	Apex	for	qu...
Data	Ingestion	Platform	 (DiP)22
DiP	using	Flink
Multiple	processing	paradigm	- distributed,	stream	and	batch	processing.
...
Data	Ingestion	Platform	 (DiP)23
DiP-Druid	Architecture	(High	Level)	
Credit:	https://imply.io/docs/latest/
https://techbl...
Data	Ingestion	Platform	 (DiP)24
Data	Access
Apache	Zeppelin/	Custom	UI
• Data	Stored	on	HDFS	as	Hive	External	
Tables
• D...
Custom	UI	“Co-Dev”
Xavient	Corporate	Overview25
• Integrated	with	elastic	
search
• Enterprise	security	and	
SSO
• Recomme...
Data	Ingestion	Platform	 (DiP)26
DiP	@	Hallwaze.com
Data	Ingestion	Platform	 (DiP)27
Get	involved
https://github.com/XavientInformationSystems/Data-Ingestion-Platform
Co-Dev	...
Great	Ideas….Simple	Solutions
Hybrid	Cloud
Hadoop	and	Cloud
Xavient	Corporate	Overview29
Apache	Falcon	
Xavient	Corporate	Overview30
DiP Hadoop
On-prem
Cloud
Apache	Falconis	a	data	management	tool	for	overseeing...
Kafka	Mirroring
Xavient	Corporate	Overview31
The Kafka mirroring feature is used for creating the replica of an existing c...
Data	Ingestion	Platform	 (DiP)32
Kafka	Mirroring	– Hybrid	Cloud	Environment
Cassandra
Xavient	Corporate	Overview33
DiP
Cassandra
Cassandra
On-prem
Cloud
• RDBMS	migration	
• DSE	advance	replication
...
Data	Ingestion	Platform	 (DiP)34
WIP
• Integration	with	Kafka	Connect	and	Kafka	Streaming
• Data	Munging,	Validation
• Mac...
Thanks!
@allaboutbdata
nsabharwal@xavient.com
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Open Source Big Data Ingestion - Without the Heartburn!
Next

of

Real time data ingestion and Hybrid Cloud Slide 1 Real time data ingestion and Hybrid Cloud Slide 2 Real time data ingestion and Hybrid Cloud Slide 3 Real time data ingestion and Hybrid Cloud Slide 4 Real time data ingestion and Hybrid Cloud Slide 5 Real time data ingestion and Hybrid Cloud Slide 6 Real time data ingestion and Hybrid Cloud Slide 7 Real time data ingestion and Hybrid Cloud Slide 8 Real time data ingestion and Hybrid Cloud Slide 9 Real time data ingestion and Hybrid Cloud Slide 10 Real time data ingestion and Hybrid Cloud Slide 11 Real time data ingestion and Hybrid Cloud Slide 12 Real time data ingestion and Hybrid Cloud Slide 13 Real time data ingestion and Hybrid Cloud Slide 14 Real time data ingestion and Hybrid Cloud Slide 15 Real time data ingestion and Hybrid Cloud Slide 16 Real time data ingestion and Hybrid Cloud Slide 17 Real time data ingestion and Hybrid Cloud Slide 18 Real time data ingestion and Hybrid Cloud Slide 19 Real time data ingestion and Hybrid Cloud Slide 20 Real time data ingestion and Hybrid Cloud Slide 21 Real time data ingestion and Hybrid Cloud Slide 22 Real time data ingestion and Hybrid Cloud Slide 23 Real time data ingestion and Hybrid Cloud Slide 24 Real time data ingestion and Hybrid Cloud Slide 25 Real time data ingestion and Hybrid Cloud Slide 26 Real time data ingestion and Hybrid Cloud Slide 27 Real time data ingestion and Hybrid Cloud Slide 28 Real time data ingestion and Hybrid Cloud Slide 29 Real time data ingestion and Hybrid Cloud Slide 30 Real time data ingestion and Hybrid Cloud Slide 31 Real time data ingestion and Hybrid Cloud Slide 32 Real time data ingestion and Hybrid Cloud Slide 33 Real time data ingestion and Hybrid Cloud Slide 34 Real time data ingestion and Hybrid Cloud Slide 35

YouTube videos are no longer supported on SlideShare

View original on YouTube

Upcoming SlideShare
Open Source Big Data Ingestion - Without the Heartburn!
Next
Download to read offline and view in fullscreen.

Share

Real time data ingestion and Hybrid Cloud

Download to read offline

Apache Kafka, Spark, Flink, Apex, Druid, Cassandra ...Data Ingestion in real time and building hybrid cloud

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Real time data ingestion and Hybrid Cloud

  1. 1. Great Ideas….Simple Solutions Data Ingestion Platform (DiP) Neeraj Sabharwal @allaboutbdata
  2. 2. About me Xavient Corporate Overview2 • Head of Cloud, Data & Analytics @Xavient • Spent couple of years @Hortonworks • Over a decade in Cloud & Data domain • Started career as Oracle DBA Disclosure– More memes coming up…
  3. 3. Agenda Xavient Corporate Overview3 Platform Data Access Hybrid Cloud
  4. 4. Data Ingestion Platform (DiP)4 Before we start … ** Near real time is ok as I am easy going but no more hours or days wait on data
  5. 5. Problem Xavient Corporate Overview5 UI/API Platform Data Access No…near real-time access Cloud
  6. 6. Great Ideas….Simple Solutions Shifting the gear – Let’s get technical
  7. 7. Streaming Blueprint Xavient Corporate Overview7 Data Collection Messaging Tier Streaming Engine Analysis Tier In memory Data Store Data Access ** Near real time is ok as I am easy going but no more hours or days wait on data
  8. 8. Messaging Bus Xavient Corporate Overview8 • Open-source message broker • Unified, high-throughput, low-latency platform for handling real-time data feeds • Massively scalable pub/sub message queue architected as a distributed transaction log
  9. 9. Emotions Xavient Corporate Overview9
  10. 10. Streaming engines Xavient Corporate Overview10 Storm - Distributed real-time computation system for processing large volumes of high- velocity data Flink - Streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams Apex- Enterprise-grade unified stream and batch processing engine Spark Streaming - Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python
  11. 11. CTM Xavient Corporate Overview11
  12. 12. Great Ideas….Simple Solutions Platform (DiP)
  13. 13. Data Ingestion Platform (DiP)13 Features Easy to use UI Multiple Streaming Engines Supports xml, json and tsv data formats Manual data entry via UI Upload files for batch processing Hybrid Cloud Batch and Real time views of data Data visualization and analytics YARN featuresData Ingestion Platform
  14. 14. Data Ingestion Platform (DiP)14 Use Cases – Any Data Sentimental Analysis Log Analysis Click Stream Analysis Analyze Machine and Sensor Data Social Media and Customer Sentiment
  15. 15. UI Xavient Corporate Overview15 https://techblog.xavient.com/
  16. 16. What was in the previous slide? Is that for real? Xavient Corporate Overview16 No more Memes …Enough now J
  17. 17. Data Ingestion Platform (DiP)17 DiP Technology Stack Messaging System Target System Reporting System Source System Streaming API’s Programming Language IDE Build tool Operating System Apache Kafka HDFS, NoSql, Apache Hive Apache Phoenix, Apache Zeppelin Web Client Apache Apex, Apache Flink, Apache Spark and Apache Storm Java Eclipse Apache Maven CentOS 7
  18. 18. Data Ingestion Platform (DiP)18 DiP High Level Architecture
  19. 19. Data Ingestion Platform (DiP)19 DiP using Storm • Multiple processing paradigm - Real-time , Interactive and Batch processes • Reliable – each unit of data (tuple) will be processed at least once or exactly once. • ​Fast and scalable - parallel calculations are run across a cluster of machines. • Fault-tolerant - workers automatically restarts in case they die . Apache Storm features
  20. 20. Data Ingestion Platform (DiP)20 DiP using Spark​ Streaming • Multiple processing paradigm - Batch and Interactive • Ease of Use –contains high-level operators written in Java, Scala and Python • Fault Tolerance - lost work and operator state can be recovered with no extra code • Code Reusability – same code can be used for batch processing, join streams against historical data, or to run ad- hoc queries on stream state Spark Streaming features
  21. 21. Data Ingestion Platform (DiP)21 DiP using Apex​ Modular - Malhar, a library of operators , comes bundled with Apex for quick development cycles • Supports both stream and batch processing • Supports operator exchange at runtime • Supports fault tolerance and dynamic scaling Apache Apex features
  22. 22. Data Ingestion Platform (DiP)22 DiP using Flink Multiple processing paradigm - distributed, stream and batch processing. Several APIsfor creating applications are supported • Data Stream API for unbounded streams embedded in Java and Scala • Data Set API for static data embedded in Java, Scala, and Python, • Table API with a SQL-like expression language embedded in Java and Scala. Fault tolerance for distributed computations over data streams Apache Flink features
  23. 23. Data Ingestion Platform (DiP)23 DiP-Druid Architecture (High Level) Credit: https://imply.io/docs/latest/ https://techblog.xavient.com/kafka-druid-integration-with-ingestion-dip-real-time-data
  24. 24. Data Ingestion Platform (DiP)24 Data Access Apache Zeppelin/ Custom UI • Data Stored on HDFS as Hive External Tables • Data stored on HBaseas Phoenix View
  25. 25. Custom UI “Co-Dev” Xavient Corporate Overview25 • Integrated with elastic search • Enterprise security and SSO • Recommendation model based on user profile, tags and activity • Chat • Blog/Droplet features • Tasks creation and follow- up • Notifications • Smart phone app
  26. 26. Data Ingestion Platform (DiP)26 DiP @ Hallwaze.com
  27. 27. Data Ingestion Platform (DiP)27 Get involved https://github.com/XavientInformationSystems/Data-Ingestion-Platform Co-Dev : Reach out in case you want to customize the platform, choose the right streaming engine based on latency, use case and custom UI/reporting.
  28. 28. Great Ideas….Simple Solutions Hybrid Cloud
  29. 29. Hadoop and Cloud Xavient Corporate Overview29
  30. 30. Apache Falcon Xavient Corporate Overview30 DiP Hadoop On-prem Cloud Apache Falconis a data management tool for overseeing data pipelines in Hadoop clusters. It can be used to replicate data from one cluster to another. Hadoop
  31. 31. Kafka Mirroring Xavient Corporate Overview31 The Kafka mirroring feature is used for creating the replica of an existing cluster, for example, for the replication of an active datacenter into a passivedatacenter. Kafka providesa mirror maker tool for mirroring the source cluster intotarget cluster.
  32. 32. Data Ingestion Platform (DiP)32 Kafka Mirroring – Hybrid Cloud Environment
  33. 33. Cassandra Xavient Corporate Overview33 DiP Cassandra Cassandra On-prem Cloud • RDBMS migration • DSE advance replication • Kafka
  34. 34. Data Ingestion Platform (DiP)34 WIP • Integration with Kafka Connect and Kafka Streaming • Data Munging, Validation • Machine Learning • Search – Elastic , Solr
  35. 35. Thanks! @allaboutbdata nsabharwal@xavient.com
  • UpendraSinha

    Aug. 13, 2018
  • AjaySharma668

    Apr. 9, 2017
  • bunkertor

    Oct. 28, 2016

Apache Kafka, Spark, Flink, Apex, Druid, Cassandra ...Data Ingestion in real time and building hybrid cloud

Views

Total views

737

On Slideshare

0

From embeds

0

Number of embeds

21

Actions

Downloads

42

Shares

0

Comments

0

Likes

3

×