Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Real time data ingestion and Hybrid Cloud Slide 1 Real time data ingestion and Hybrid Cloud Slide 2 Real time data ingestion and Hybrid Cloud Slide 3 Real time data ingestion and Hybrid Cloud Slide 4 Real time data ingestion and Hybrid Cloud Slide 5 Real time data ingestion and Hybrid Cloud Slide 6 Real time data ingestion and Hybrid Cloud Slide 7 Real time data ingestion and Hybrid Cloud Slide 8 Real time data ingestion and Hybrid Cloud Slide 9 Real time data ingestion and Hybrid Cloud Slide 10 Real time data ingestion and Hybrid Cloud Slide 11 Real time data ingestion and Hybrid Cloud Slide 12 Real time data ingestion and Hybrid Cloud Slide 13 Real time data ingestion and Hybrid Cloud Slide 14 Real time data ingestion and Hybrid Cloud Slide 15 Real time data ingestion and Hybrid Cloud Slide 16 Real time data ingestion and Hybrid Cloud Slide 17 Real time data ingestion and Hybrid Cloud Slide 18 Real time data ingestion and Hybrid Cloud Slide 19 Real time data ingestion and Hybrid Cloud Slide 20 Real time data ingestion and Hybrid Cloud Slide 21 Real time data ingestion and Hybrid Cloud Slide 22 Real time data ingestion and Hybrid Cloud Slide 23 Real time data ingestion and Hybrid Cloud Slide 24 Real time data ingestion and Hybrid Cloud Slide 25 Real time data ingestion and Hybrid Cloud Slide 26 Real time data ingestion and Hybrid Cloud Slide 27 Real time data ingestion and Hybrid Cloud Slide 28 Real time data ingestion and Hybrid Cloud Slide 29 Real time data ingestion and Hybrid Cloud Slide 30 Real time data ingestion and Hybrid Cloud Slide 31 Real time data ingestion and Hybrid Cloud Slide 32 Real time data ingestion and Hybrid Cloud Slide 33 Real time data ingestion and Hybrid Cloud Slide 34 Real time data ingestion and Hybrid Cloud Slide 35

YouTube videos are no longer supported on SlideShare

View original on YouTube

Upcoming SlideShare
Open Source Big Data Ingestion - Without the Heartburn!
Next
Download to read offline and view in fullscreen.

3

Share

Download to read offline

Real time data ingestion and Hybrid Cloud

Download to read offline

Apache Kafka, Spark, Flink, Apex, Druid, Cassandra ...Data Ingestion in real time and building hybrid cloud

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Real time data ingestion and Hybrid Cloud

  1. 1. Great Ideas….Simple Solutions Data Ingestion Platform (DiP) Neeraj Sabharwal @allaboutbdata
  2. 2. About me Xavient Corporate Overview2 • Head of Cloud, Data & Analytics @Xavient • Spent couple of years @Hortonworks • Over a decade in Cloud & Data domain • Started career as Oracle DBA Disclosure– More memes coming up…
  3. 3. Agenda Xavient Corporate Overview3 Platform Data Access Hybrid Cloud
  4. 4. Data Ingestion Platform (DiP)4 Before we start … ** Near real time is ok as I am easy going but no more hours or days wait on data
  5. 5. Problem Xavient Corporate Overview5 UI/API Platform Data Access No…near real-time access Cloud
  6. 6. Great Ideas….Simple Solutions Shifting the gear – Let’s get technical
  7. 7. Streaming Blueprint Xavient Corporate Overview7 Data Collection Messaging Tier Streaming Engine Analysis Tier In memory Data Store Data Access ** Near real time is ok as I am easy going but no more hours or days wait on data
  8. 8. Messaging Bus Xavient Corporate Overview8 • Open-source message broker • Unified, high-throughput, low-latency platform for handling real-time data feeds • Massively scalable pub/sub message queue architected as a distributed transaction log
  9. 9. Emotions Xavient Corporate Overview9
  10. 10. Streaming engines Xavient Corporate Overview10 Storm - Distributed real-time computation system for processing large volumes of high- velocity data Flink - Streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams Apex- Enterprise-grade unified stream and batch processing engine Spark Streaming - Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python
  11. 11. CTM Xavient Corporate Overview11
  12. 12. Great Ideas….Simple Solutions Platform (DiP)
  13. 13. Data Ingestion Platform (DiP)13 Features Easy to use UI Multiple Streaming Engines Supports xml, json and tsv data formats Manual data entry via UI Upload files for batch processing Hybrid Cloud Batch and Real time views of data Data visualization and analytics YARN featuresData Ingestion Platform
  14. 14. Data Ingestion Platform (DiP)14 Use Cases – Any Data Sentimental Analysis Log Analysis Click Stream Analysis Analyze Machine and Sensor Data Social Media and Customer Sentiment
  15. 15. UI Xavient Corporate Overview15 https://techblog.xavient.com/
  16. 16. What was in the previous slide? Is that for real? Xavient Corporate Overview16 No more Memes …Enough now J
  17. 17. Data Ingestion Platform (DiP)17 DiP Technology Stack Messaging System Target System Reporting System Source System Streaming API’s Programming Language IDE Build tool Operating System Apache Kafka HDFS, NoSql, Apache Hive Apache Phoenix, Apache Zeppelin Web Client Apache Apex, Apache Flink, Apache Spark and Apache Storm Java Eclipse Apache Maven CentOS 7
  18. 18. Data Ingestion Platform (DiP)18 DiP High Level Architecture
  19. 19. Data Ingestion Platform (DiP)19 DiP using Storm • Multiple processing paradigm - Real-time , Interactive and Batch processes • Reliable – each unit of data (tuple) will be processed at least once or exactly once. • ​Fast and scalable - parallel calculations are run across a cluster of machines. • Fault-tolerant - workers automatically restarts in case they die . Apache Storm features
  20. 20. Data Ingestion Platform (DiP)20 DiP using Spark​ Streaming • Multiple processing paradigm - Batch and Interactive • Ease of Use –contains high-level operators written in Java, Scala and Python • Fault Tolerance - lost work and operator state can be recovered with no extra code • Code Reusability – same code can be used for batch processing, join streams against historical data, or to run ad- hoc queries on stream state Spark Streaming features
  21. 21. Data Ingestion Platform (DiP)21 DiP using Apex​ Modular - Malhar, a library of operators , comes bundled with Apex for quick development cycles • Supports both stream and batch processing • Supports operator exchange at runtime • Supports fault tolerance and dynamic scaling Apache Apex features
  22. 22. Data Ingestion Platform (DiP)22 DiP using Flink Multiple processing paradigm - distributed, stream and batch processing. Several APIsfor creating applications are supported • Data Stream API for unbounded streams embedded in Java and Scala • Data Set API for static data embedded in Java, Scala, and Python, • Table API with a SQL-like expression language embedded in Java and Scala. Fault tolerance for distributed computations over data streams Apache Flink features
  23. 23. Data Ingestion Platform (DiP)23 DiP-Druid Architecture (High Level) Credit: https://imply.io/docs/latest/ https://techblog.xavient.com/kafka-druid-integration-with-ingestion-dip-real-time-data
  24. 24. Data Ingestion Platform (DiP)24 Data Access Apache Zeppelin/ Custom UI • Data Stored on HDFS as Hive External Tables • Data stored on HBaseas Phoenix View
  25. 25. Custom UI “Co-Dev” Xavient Corporate Overview25 • Integrated with elastic search • Enterprise security and SSO • Recommendation model based on user profile, tags and activity • Chat • Blog/Droplet features • Tasks creation and follow- up • Notifications • Smart phone app
  26. 26. Data Ingestion Platform (DiP)26 DiP @ Hallwaze.com
  27. 27. Data Ingestion Platform (DiP)27 Get involved https://github.com/XavientInformationSystems/Data-Ingestion-Platform Co-Dev : Reach out in case you want to customize the platform, choose the right streaming engine based on latency, use case and custom UI/reporting.
  28. 28. Great Ideas….Simple Solutions Hybrid Cloud
  29. 29. Hadoop and Cloud Xavient Corporate Overview29
  30. 30. Apache Falcon Xavient Corporate Overview30 DiP Hadoop On-prem Cloud Apache Falconis a data management tool for overseeing data pipelines in Hadoop clusters. It can be used to replicate data from one cluster to another. Hadoop
  31. 31. Kafka Mirroring Xavient Corporate Overview31 The Kafka mirroring feature is used for creating the replica of an existing cluster, for example, for the replication of an active datacenter into a passivedatacenter. Kafka providesa mirror maker tool for mirroring the source cluster intotarget cluster.
  32. 32. Data Ingestion Platform (DiP)32 Kafka Mirroring – Hybrid Cloud Environment
  33. 33. Cassandra Xavient Corporate Overview33 DiP Cassandra Cassandra On-prem Cloud • RDBMS migration • DSE advance replication • Kafka
  34. 34. Data Ingestion Platform (DiP)34 WIP • Integration with Kafka Connect and Kafka Streaming • Data Munging, Validation • Machine Learning • Search – Elastic , Solr
  35. 35. Thanks! @allaboutbdata nsabharwal@xavient.com
  • UpendraSinha

    Aug. 13, 2018
  • AjaySharma668

    Apr. 9, 2017
  • bunkertor

    Oct. 28, 2016

Apache Kafka, Spark, Flink, Apex, Druid, Cassandra ...Data Ingestion in real time and building hybrid cloud

Views

Total views

726

On Slideshare

0

From embeds

0

Number of embeds

21

Actions

Downloads

42

Shares

0

Comments

0

Likes

3

×