Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real time data ingestion and Hybrid Cloud


Published on

Apache Kafka, Spark, Flink, Apex, Druid, Cassandra ...Data Ingestion in real time and building hybrid cloud

Published in: Technology
  • Be the first to comment

Real time data ingestion and Hybrid Cloud

  1. 1. Great Ideas….Simple Solutions Data Ingestion Platform (DiP) Neeraj Sabharwal @allaboutbdata
  2. 2. About me Xavient Corporate Overview2 • Head of Cloud, Data & Analytics @Xavient • Spent couple of years @Hortonworks • Over a decade in Cloud & Data domain • Started career as Oracle DBA Disclosure– More memes coming up…
  3. 3. Agenda Xavient Corporate Overview3 Platform Data Access Hybrid Cloud
  4. 4. Data Ingestion Platform (DiP)4 Before we start … ** Near real time is ok as I am easy going but no more hours or days wait on data
  5. 5. Problem Xavient Corporate Overview5 UI/API Platform Data Access No…near real-time access Cloud
  6. 6. Great Ideas….Simple Solutions Shifting the gear – Let’s get technical
  7. 7. Streaming Blueprint Xavient Corporate Overview7 Data Collection Messaging Tier Streaming Engine Analysis Tier In memory Data Store Data Access ** Near real time is ok as I am easy going but no more hours or days wait on data
  8. 8. Messaging Bus Xavient Corporate Overview8 • Open-source message broker • Unified, high-throughput, low-latency platform for handling real-time data feeds • Massively scalable pub/sub message queue architected as a distributed transaction log
  9. 9. Emotions Xavient Corporate Overview9
  10. 10. Streaming engines Xavient Corporate Overview10 Storm - Distributed real-time computation system for processing large volumes of high- velocity data Flink - Streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams Apex- Enterprise-grade unified stream and batch processing engine Spark Streaming - Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python
  11. 11. CTM Xavient Corporate Overview11
  12. 12. Great Ideas….Simple Solutions Platform (DiP)
  13. 13. Data Ingestion Platform (DiP)13 Features Easy to use UI Multiple Streaming Engines Supports xml, json and tsv data formats Manual data entry via UI Upload files for batch processing Hybrid Cloud Batch and Real time views of data Data visualization and analytics YARN featuresData Ingestion Platform
  14. 14. Data Ingestion Platform (DiP)14 Use Cases – Any Data Sentimental Analysis Log Analysis Click Stream Analysis Analyze Machine and Sensor Data Social Media and Customer Sentiment
  15. 15. UI Xavient Corporate Overview15
  16. 16. What was in the previous slide? Is that for real? Xavient Corporate Overview16 No more Memes …Enough now J
  17. 17. Data Ingestion Platform (DiP)17 DiP Technology Stack Messaging System Target System Reporting System Source System Streaming API’s Programming Language IDE Build tool Operating System Apache Kafka HDFS, NoSql, Apache Hive Apache Phoenix, Apache Zeppelin Web Client Apache Apex, Apache Flink, Apache Spark and Apache Storm Java Eclipse Apache Maven CentOS 7
  18. 18. Data Ingestion Platform (DiP)18 DiP High Level Architecture
  19. 19. Data Ingestion Platform (DiP)19 DiP using Storm • Multiple processing paradigm - Real-time , Interactive and Batch processes • Reliable – each unit of data (tuple) will be processed at least once or exactly once. • ​Fast and scalable - parallel calculations are run across a cluster of machines. • Fault-tolerant - workers automatically restarts in case they die . Apache Storm features
  20. 20. Data Ingestion Platform (DiP)20 DiP using Spark​ Streaming • Multiple processing paradigm - Batch and Interactive • Ease of Use –contains high-level operators written in Java, Scala and Python • Fault Tolerance - lost work and operator state can be recovered with no extra code • Code Reusability – same code can be used for batch processing, join streams against historical data, or to run ad- hoc queries on stream state Spark Streaming features
  21. 21. Data Ingestion Platform (DiP)21 DiP using Apex​ Modular - Malhar, a library of operators , comes bundled with Apex for quick development cycles • Supports both stream and batch processing • Supports operator exchange at runtime • Supports fault tolerance and dynamic scaling Apache Apex features
  22. 22. Data Ingestion Platform (DiP)22 DiP using Flink Multiple processing paradigm - distributed, stream and batch processing. Several APIsfor creating applications are supported • Data Stream API for unbounded streams embedded in Java and Scala • Data Set API for static data embedded in Java, Scala, and Python, • Table API with a SQL-like expression language embedded in Java and Scala. Fault tolerance for distributed computations over data streams Apache Flink features
  23. 23. Data Ingestion Platform (DiP)23 DiP-Druid Architecture (High Level) Credit:
  24. 24. Data Ingestion Platform (DiP)24 Data Access Apache Zeppelin/ Custom UI • Data Stored on HDFS as Hive External Tables • Data stored on HBaseas Phoenix View
  25. 25. Custom UI “Co-Dev” Xavient Corporate Overview25 • Integrated with elastic search • Enterprise security and SSO • Recommendation model based on user profile, tags and activity • Chat • Blog/Droplet features • Tasks creation and follow- up • Notifications • Smart phone app
  26. 26. Data Ingestion Platform (DiP)26 DiP @
  27. 27. Data Ingestion Platform (DiP)27 Get involved Co-Dev : Reach out in case you want to customize the platform, choose the right streaming engine based on latency, use case and custom UI/reporting.
  28. 28. Great Ideas….Simple Solutions Hybrid Cloud
  29. 29. Hadoop and Cloud Xavient Corporate Overview29
  30. 30. Apache Falcon Xavient Corporate Overview30 DiP Hadoop On-prem Cloud Apache Falconis a data management tool for overseeing data pipelines in Hadoop clusters. It can be used to replicate data from one cluster to another. Hadoop
  31. 31. Kafka Mirroring Xavient Corporate Overview31 The Kafka mirroring feature is used for creating the replica of an existing cluster, for example, for the replication of an active datacenter into a passivedatacenter. Kafka providesa mirror maker tool for mirroring the source cluster intotarget cluster.
  32. 32. Data Ingestion Platform (DiP)32 Kafka Mirroring – Hybrid Cloud Environment
  33. 33. Cassandra Xavient Corporate Overview33 DiP Cassandra Cassandra On-prem Cloud • RDBMS migration • DSE advance replication • Kafka
  34. 34. Data Ingestion Platform (DiP)34 WIP • Integration with Kafka Connect and Kafka Streaming • Data Munging, Validation • Machine Learning • Search – Elastic , Solr
  35. 35. Thanks! @allaboutbdata