Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[Apache Kafka Meetup Japan #7] Kafka on Azure

405 views

Published on

Apache Kafka Meetup Japan #7 @LINE (2019/07/17)

https://kafka-apache-jp.connpass.com/event/134127/

[Apache Kafka Meetup Japan #7] Kafka on Azure

https://satonaoki.wordpress.com/2019/07/18/kafka-azure/

Published in: Software
  • Be the first to comment

  • Be the first to like this

[Apache Kafka Meetup Japan #7] Kafka on Azure

  1. 1. ApacheKafkaMeetupJapan#7 Kafka on Azure ~ MicrosoftAzureが提供するマネージドKafkaサービスを使ってみよう~ SATO Naoki / 佐藤 直生 Azure Technologist / Cloud Solution Architect, Microsoft Twitter @satonaoki / https://satonaoki.wordpress.com/
  2. 2. © Microsoft Corporation AI built-in | Most secure | Lowest TCO Data warehouses Data lakes Operational databases Data warehouses Data lakes Operational databasesIndustry leader 4 years in a row #1 TPC-H performance T-SQL query over any data 70 percent faster than Aurora More global reach than any other No Limits and 99.9 percent SLA Easiest lift and shift with no code changes The Microsoft offering SQL Server Hybrid Azure Data Services Security and performanceFlexibility of choiceReason over any data, anywhere SocialLOB Graph IoTImageCRM
  3. 3. © Microsoft Corporation Azure Data Factory Azure Import/Export service Azure SDKAzure CLI Cognitive ServicesBot service Azure Search Azure Data Catalog Azure ExpressRoute Azure network security groups Azure Functions Visual StudioOperations Management Suite Azure Active Directory Azure key management service Azure Blob Storage Azure Data Lake Store Azure IoT Hub Azure Event Hubs Kafka on Azure HDInsight Azure SQL Data WarehouseAzure SQL DB Azure Cosmos DB Azure Analysis Services Power BI Azure Data Lake Analytics Azure HDInsight Azure Databricks Azure HDInsight Azure Databricks Azure Stream Analytics Azure ML Azure Databricks ML Server The Azure data landscape
  4. 4. © Microsoft Corporation Azure Data Factory Azure Import/Export service Azure SDKAzure CLI Cognitive ServicesBot service Azure Search Azure Data Catalog Azure ExpressRoute Azure network security groups Azure Functions Visual StudioOperations Management Suite Azure Active Directory Azure key management service Azure Blob Storage Azure Data Lake Store Azure IoT Hub Azure Event Hubs Kafka on Azure HDInsight Azure SQL Data WarehouseAzure SQL DB Azure Cosmos DB Azure Analysis Services Power BI Azure Data Lake Analytics Azure HDInsight Azure Databricks Azure HDInsight Azure Databricks Azure Stream Analytics Azure ML Azure Databricks ML Server The Azure Big Data landscape
  5. 5. © Microsoft Corporation Solution scenarios Big Data and advanced analytics SQL Modern data warehousing “We want to integrate all our data—including Big Data—with our data warehouse” Advanced analytics “We’re trying to predict when our customers churn” Real-time analytics “We’re trying to get insights from our devices in real-time”
  6. 6. © Microsoft Corporation Real-time analytics Real-time analytics—also called stream analytics—is the practice of processing data as soon as it’s generated in order to enable very quick analysis and insight for timely action SQL Modern data warehousing “We want to integrate all our data—including Big Data—with our data warehouse” Advanced analytics “We’re trying to predict when our customers churn” Real-time analytics “We’re trying to get insights from our devices in real-time”
  7. 7. © Microsoft Corporation Stream analytics scenarios
  8. 8. © Microsoft Corporation Canonical operations Streaming Connect, collect, and store Ingest Process and analyze Analytics Connect, collect, and store Actions A B C
  9. 9. © Microsoft Corporation Big Data streaming pattern with Azure Real-time applications Real-time dashboards Sensors and IoT (unstructured) Event hubs IoT hub Kafka on HDInsight Azure Stream Analytics Storm on HDInsight Azure Databricks (Spark Streaming) Azure ML Studio R Server Azure Databricks (Spark ML) Machine learning Stream ingestion Long-term storage Stream analytics Data Lake Store SQL DB Cosmos DB Azure Blob Storage Business/custom apps (structured) Logs, files, and media (unstructured) Power BI
  10. 10. © Microsoft Corporation Apache Kafka on HDInsight
  11. 11. Azure is the only public cloud to offer Apache Kafka as a managed service Can be provisioned directly from the Azure Portal Apache Kafka is one of the HDInsight cluster types Clusters can be scaled within minutes 99.9 percent SLA No additional charge for running Kafka clusters Out-of-box management using Azure Monitor Logs Apache Kafka on HDInsight A open-source, scalable, stream ingestion platform offered as a managed service on Azure HDInsight
  12. 12. © Microsoft Corporation Provisioning Apache Kafka on HDInsight A typical HDInsight Kafka cluster consists of: Three or more worker nodes—at least three for data high availability Two head nodes—for redundancy Three zookeeper nodes Kafka is I/O heavy, so Azure Managed Disks are used for high throughput and more storage per node Can deploy Apache Kafka on HDInsight clusters with managed disks straight from Azure Portal Disks or nodes can be configured during HDInsight cluster creation—up to 16 TB per node
  13. 13. Kafka for Azure HDInsight • Managed Kafka clusters with 99.9% service level SLA • Native integration with Azure Managed Disks. Allows for exponentially lower costs, and higher scale. • Scalable On Demand clusters - Kafka clusters with 16 TB/node and Zookeeper up and running in 15 minutes • Rack awareness for Kafka on the Azure cloud • Alerting and predictive cluster maintenance through Azure Monitor Logs • Extensibility via one click deploy of leading ISVs such as StreamSets • Disaster recovery support via MirrorMaker • Deploy End to End streaming pipelines with Storm, Spark, Storage via automated ARM templates in the same VNET.
  14. 14. Kafka is a distributed, horizontally-scalable, fault-tolerant pub-sub store Broker 1 Producer 1 IoT Hub Storm Spark Streaming 1 2 3 ZK 1 ZK 2 ZK 3 Broker 2 Broker 3 3 1 2 Topic 1 Topic 2 Topic 1 Topic 2 Topic 2 Topic 1 Data Ingestion using Kafka on HDInsight
  15. 15. 4 5 Setup the broker configuration Publish the message The consumer reads the messages Kafka: Producers and Consumers
  16. 16. © Microsoft Corporation Choosing Apache Kafka on HDInsight When you want… Description A proven ingestion service Apache Kafka is the de-facto leader in the Big Data stream ingestion space. It’s used by the who’s who of modern internet companies. Powered by Apache Kafka lists companies using Apache Kafka. A hybrid, multi-cloud solution with choice of deployment models You can run Apache Kafka in multiple ways: On-premises, as a managed service on Azure, as an IaaS solution on Azure VMs, or even on other public clouds—including AWS and Google Cloud Service. An open-source solution Kafka is an open-sourced product licensed under Apache License 2.0. It’s implemented in Java and Scala. A highly reliable, fault-tolerant, scalable service Kafka is reported to scale to handle ingestion rates of 1.1 trillion messages a day at LinkedIn. Kafka is a horizontally scalable service—you can scale Apache Kafka on HDInsight by dynamically adding more nodes to the cluster. Extensibility, with support for a large number of data sources and sinks Kafka Connect is a tool for scaling and reliably streaming data between Apache Kafka and other systems. It makes it simple to quickly define connectors that move large collections of data into and out of Kafka. Pre-built connectors to a number of data sources are available. You can extend this list by building custom connectors. When Apache Kafka can be a good option
  17. 17. Azure Gateway Services Open source Stream Processing on Azure HDInsight Real-time applications Long term storage Real-time dashboards IoT Hubs Azure VNet Boundary Connected Car Architecture Powered by HDInsight
  18. 18. Siphon on HDInsight Kafka 8 million EVENTS PER SECOND PEAK INGRESS 800 TB (10 GB per Sec) INGRESS PER DAY 1,800; 450 PRODUCTION KAFKA BROKERS; TOPICS 15 Sec 99th PERCENTILE LATENCY KEY CUSTOMER SCENARIOS Ads Monetization (Fast BI) O365 Customer Fabric NRT – Tenant & User insights BingNRT Operational Intelligence Presto (Fast SML) interactive analysis Delve Analytics 0 5 10 15 20 25 30 35 40 45 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 Throughput(inGBps) Siphon Data Volume (Ingress and Egress) Series1 Series2 0 5 10 15 20 25 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 1-00 Throughput(eventspersec)Millions Siphon Events per second (Ingress and Egress) Series1 Series2
  19. 19. © Microsoft Corporation Apache Spark 2.4 and Apache Kafka 2.1 support on Azure HDInsight https://azure.microsoft.com/updates/apache-spark-2-4-and-apache-kafka-2-1-support-on-azure-hdinsight/
  20. 20. © Microsoft Corporation Azure Event Hubs
  21. 21. © Microsoft Corporation Big Data streaming pattern with Azure Real-time applications Real-time dashboards Sensors and IoT (unstructured) Event hubs IoT hub Kafka on HDInsight Azure Stream Analytics Storm on HDInsight Azure Databricks (Spark Streaming) Azure ML Studio R Server Azure Databricks (Spark ML) Machine learning Stream ingestion Long-term storage Stream analytics Data Lake Store SQL DB Cosmos DB Azure Blob Storage Business/custom apps (structured) Logs, files, and media (unstructured) Power BI
  22. 22. © Microsoft Corporation ✓ Input Capacity: 1 MB/s per TU* ✓ Output Capacity: 2 MB/s per TU* ✓ Latency: 50 ms avg, 99% < 100ms ✓ Events/second: 1,000 ✓ Max message size: 256 KB *In Azure Event Hubs, capacity is purchased in throughput units (TU). Add TUs to increase capacity. Event publisher Partition Partition Partition Reader Reader Reader Event Consumer Event hubs Azure Event Hubs: Scale and performance Azure Event Hubs A highly scalable, fully-managed telemetry ingestion service
  23. 23. © Microsoft Corporation Based on the concept of event producers and consumers Producers send data to an event hub via AMQP 1.0 or HTTPS Consumers read event data from an event hub via AMQP 1.0 SAS tokens identifies and authenticates the event publisher Data can be captured automatically in either Azure Blob Storage or Azure Data Lake Store (in AVRO format) Data is stored for 24 hours by default 84 GB storage included per throughput unit Azure Event Hubs capabilities overview
  24. 24. © Microsoft Corporation Partition consumer conceptual architecture HTTP AMQP Kafka
  25. 25. Event Hubs for Kafka Ecosystems
  26. 26. © Microsoft Corporation When you want… Description To automatically scale capacity Auto-inflate enables you to start small with the minimum required throughput units. It then scales automatically to the maximum limit of throughput units, depending on the increase in traffic A serverless solution Azure Event Hubs is a serverless service. Your ability to fine tune the performance is limited To integrate easily with Azure Stream Analytics You can configure Azure Events Hubs as a streaming data input to Azure Stream Analytics via the Azure Portal without any coding A low-latency ingestion service Azure Event Hubs latency can be less than 50 ms on average, with latency under 100 ms 99 percent of the time* To store ingested data in Azure Blob Storage or Azure Data Lake Store Azure Events Hubs has built in integration with these two Azure storage services * Note that other services might have a similar latency, but there are no publicly available numbers. Choosing Event Hubs When Azure Event Hubs can be a good option
  27. 27. Event Hubs in the real world: Halo 5 80 million requests per minute within 24 hours of release All game telemetry and statistics run through Azure Event Hubs, processed, and sent back to console 1 Dedicated Capacity cluster (3 CUs) Zero administration by Halo team
  28. 28. Azure provides everything you need for streaming data – no matter how you do it
  29. 29. © Microsoft Corporation • Azure Free Account: https://azure.microsoft.com/free/ • Azure Marketplace (VM Images, VM Cluster Templates, Container Images, Helm Chart): https://azuremarketplace.microsoft.com/en-us/marketplace/apps?search=Kafka • Third Party Managed Kafka Clusters • Confluent Cloud: https://confluent.jp/confluent-cloud/ • Instaclustr: https://www.instaclustr.com/solutions/microsoft-azure/ • Azure HDInsight: https://docs.microsoft.com/azure/hdinsight/kafka/apache-kafka-introduction • Azure Event Hubs: https://docs.microsoft.com/azure/event-hubs/event-hubs-for-kafka-ecosystem-overview • Kafka Connect • Azure Blob Storage: https://docs.confluent.io/current/connect/kafka-connect-azure-blob-storage/ • Azure SQL Database (SQL Server): https://docs.confluent.io/current/connect/kafka-connect-cdc-mssql/ • Azure IoT Hub: https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-connector-iot-hub Additional Information
  30. 30. © Copyright Microsoft Corporation. All rights reserved.

×