Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight

372 views

Published on

Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight

Published in: Technology
  • Be the first to comment

Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight

  1. 1. Reliable open-source 99.9% availability SLA Monitoring (OMS) Visual Studio, IntelliJ and Eclipse support for developers and data scientists Enterprise grade Security Kerberos Apache Ranger Install & use big data applications Azure Marketplace Azure HDInsight Cloud Spark and Hadoop service for your enterprise (Spark, Hive, MR, LLAP, Kafka, HBase, Storm) *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
  2. 2. and many more…
  3. 3. • Managed Kafka clusters with 99.9% service level SLA • Native integration with Azure Managed Disks. Allows for exponentially lower costs, and higher scale. • Scalable On Demand clusters - Kafka clusters with 16 TB/node and Zookeeper up and running in 15 minutes • Rack awareness for Kafka on the Azure cloud • Alerting and predictive cluster maintenance through Azure Operations Management Suite • Extensibility via one click deploy of leading ISVs such as StreamSets • Disaster recovery support via MirrorMaker • Deploy End to End streaming pipelines with Storm, Spark, Storage via automated ARM templates in the same VNET.
  4. 4. Modern Data Warehouse: Real-time analytics Unstructured data Azure storage Azure HDInsight (LLAP) Azure HDInsight (Kafka) Analytic Dashboards Model & ServePrep & TrainStoreIngest Intelligence SQL DW Azure Databricks (Spark) Azure HDInsight (Spark)
  5. 5. Kafka is a distributed, horizontally-scalable, fault-tolerant pub-sub store Broker 1 Producer 1 IoT Hub Storm Spark Streaming 1 2 3 ZK 1 ZK 2 ZK 3 Broker 2 Broker 3 3 1 2 Topic 1 Topic 2 Topic 1 Topic 2 Topic 2 Topic 1
  6. 6. 4 5 Setup the broker configuration Publish the message The consumer reads the messages
  7. 7. Azure Gateway Services Open source Stream Processing on Azure HDInsight Real-time applications Long term storage Real-time dashboards IoT Hubs Azure VNet Boundary
  8. 8. Siphon on HDInsight Kafka 8 million EVENTS PER SECOND PEAK INGRESS 800 TB (10 GB per Sec) INGRESS PER DAY 1,800; 450 PRODUCTION KAFKA BROKERS; TOPICS 15 Sec 99th PERCENTILE LATENCY KEY CUSTOMER SCENARIOS Ads Monetization (Fast BI) O365 Customer Fabric NRT – Tenant & User insights BingNRT Operational Intelligence Presto (Fast SML) interactive analysis Delve Analytics 0 5 10 15 20 25 30 35 40 45 Jan-15 Feb-15 Mar-15 Apr-15 May-15 Jun-15 Jul-15 Aug-15 Sep-15 Oct-15 Nov-15 Dec-15 Jan-16 Feb-16 Mar-16 Apr-16 May-16 Jun-16 Jul-16 Aug-16 Sep-16 Oct-16 Nov-16 Dec-16 Throughput(inGBps) Siphon Data Volume (Ingress and Egress) Volume published (GBps) Volume subscribed (GBps) 0 5 10 15 20 25 Jan-15 Feb-15 Mar-15 Apr-15 May-15 Jun-15 Jul-15 Aug-15 Sep-15 Oct-15 Nov-15 Dec-15 Jan-16 Feb-16 Mar-16 Apr-16 May-16 Jun-16 Jul-16 Aug-16 Sep-16 Oct-16 Nov-16 Dec-16 Throughput(eventspersec)Millions Siphon Events per second (Ingress and Egress) EPS In Eps Out
  9. 9. Getting Started with Kafka for HDInsight Structured Streaming with HDInsight Kafka and Spark Deploy HDInsight Kafka + Storm Stream data from on-premise to HDInsight Kafka in the cloud https://academy.microsoft.com/en-us/professional-program/big-data/ https://www.pluralsight.com/courses/spark-kafka-cassandra-applying-lambda-architecture https://azure.microsoft.com/en-us/blog/announcing-apache-kafka-for-azure- hdinsight-general-availability/ https://azure.microsoft.com/en-us/blog/announcing-public-preview-of-apache-kafka- on-hdinsight-with-azure-managed-disks/
  10. 10. https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-high- availability
  11. 11. Costin$ Throughput MBps Kafka Cost Estimator Non Managed Disks Managed Disks #KAFKANODES THROUGHPUT MBPS Kafka scale forecast Kafka nodes (OS VHDs) Kafka nodes (managed disks)
  12. 12. https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-mirroring
  13. 13. https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-connect-vpn-gateway
  14. 14. https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-connect-vpn-gateway Azure VNet Boundary
  15. 15. Message Rate 10,000 messages/sec Message size 150 KB upperbound Replica count 3 Retention Policy 12 hours

×