Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight

2,949 views

Published on

Microsoft Azure's Hadoop cloud service, HDInsight, offers Hadoop, Storm, and HBase as fully managed clusters. In this talk, you'll explore the architecture of HBase clusters in Azure, which is optimized for the cloud, and a set of unique challenges and advantages that come with that architecture. We'll also talk about common patterns and use cases utilizing HBase on Azure.

Published in: Software
  • Be the first to comment

HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight

  1. 1. Optimizing HBase for the cloud in Microsoft Azure HDInsight Maxim Lukiyanov, Microsoft, Senior Program Manager Ashit Gosalia, Microsoft, Principal Software Engineering Manager May 7th 2015, HBaseCon 2015
  2. 2. About Us Maxim Lukiyanov Senior Program Manager, Big Data team Microsoft Contact email: maxluk@microsoft @maxiluk Ashit Gosalia Principal Software Engineering Manager, Big Data team Microsoft Contact email: ashitg@microsoft Maxim Lukiyanov, Ashit Gosalia2
  3. 3. Outline Motivation Use Cases Performance Tuning Demo Maxim Lukiyanov, Ashit Gosalia3
  4. 4. Context
  5. 5. Lifetime of the service June 2014 Aug 2014 May 2015 GAPreview Today Maxim Lukiyanov, Ashit Gosalia5
  6. 6. Lifetime of the service June 2014 Aug 2014 May 2015 GAPreview Today Usage in Compute Hours 4x growth since GA Maxim Lukiyanov, Ashit Gosalia6
  7. 7. Motivation HBase can be expensive Cloud Storage is cheap Lower Cost HBase on Cloud Storage! => Maxim Lukiyanov, Ashit Gosalia7
  8. 8. HBase in the cloud RS RS RS RS HBase Storage Latency? Consistency? Network Maxim Lukiyanov, Ashit Gosalia8 Bandwidth?
  9. 9. HBase in the cloud RS RS RS RS HBase Storage HDD-like latency 50 Tb+ aggregate bandwidth[1] Strong consistency Network Maxim Lukiyanov, Ashit Gosalia9 [1] Azure Flat Network Architecture
  10. 10. Throughput Optimization = Cost Minimization Capacity Price Decoupling of compute and storage Removes capacity constraint Which allows minimization of cluster size to the exact level of throughput required by workload Local VM Storage Cloud Storage Maxim Lukiyanov, Ashit Gosalia10
  11. 11. Cost Comparison Price of 6 node cluster / month 6 hs1.8xlarge VM = $21,000 6 Large VM = $1,400 Price of 100TB / month Azure Blob Storage = $2,300 Total Price of Cluster / month $21,000 $3,700 Maxim Lukiyanov, Ashit Gosalia11 6x cheaper than local HDFS
  12. 12. Use Cases Maxim Lukiyanov, Ashit Gosalia12
  13. 13. Use Cases Key value store Sensor data store Time series store Maxim Lukiyanov, Ashit Gosalia13
  14. 14. Use case #1: key value store Example Product recommendation engine Map-reduce populates HBase with reference data Recommendation service reads reference data from HBase 10TB of data in 2 node cluster Cloud optimization In general throughput requirements vary greatly by workload In this extreme example: 40 nodes* -> 2 nodes $9000/month -> $700/month = 12x * All nodes in use case examples are Azure A3: 4 cores, 7GB RAM, 1TB HDD Maxim Lukiyanov, Ashit Gosalia14 12x
  15. 15. Use case #2: sensor data store Example Metric store for online advertising platform Storm cluster computes metrics on the link click counts, etc over the stream of user activity events Storm stores aggregates in HBase 8TB of data in 4 node cluster Cloud optimization 32 nodes -> 4 nodes $7000/month -> $1100/month = 6x Maxim Lukiyanov, Ashit Gosalia15 6x
  16. 16. Use case #3: time series store Example Performance metric time series 30TB in 40 node HBase cluster Cloud optimization – step 1 120 nodes -> 40 nodes $27,000/month -> $9,700/month = 2.8x Row key: metric + timestamp Region updates: Cloud optimization – step 2 120 nodes -> 10 nodes $27,000/month -> $2,800/month = 10x 30TB -> 400TB Row key: day + metric + timestamp Region updates: Maxim Lukiyanov, Ashit Gosalia16 10x 3x
  17. 17. Performance Tuning Maxim Lukiyanov, Ashit Gosalia17
  18. 18.    GW1 GW2 ZK1 Master1 ZK2 Master2 ZK3 Master3 Region Servers Region Servers Region Server 1 Region Server N S S Blob Storage Account RESTREST Head Node Yarn, M/R services Web Front End 1 Web App HBase Web Front End N Virtual Network
  19. 19. Read Latency File System WASB Block Transfer Size Read Latency 99 percentile, millisec WASB 4096 KB 400 WASB 256 KB 75 WASB 64 KB 50 (+66% over HDFS) HDFS 30 Maxim Lukiyanov, Ashit Gosalia19 Results from 2014: YCSB read test, 32GB of 1K byte rows (non-cached reads), 3 nodes (A3): 4 cores, 7GB memory, 1TB HDD, 1Gb NIC 100 RPC Handlers
  20. 20. Write Throughput HFiles -> Azure Block Blobs WAL -> Azure Page Blobs Optimized for random writes Coalesces parallel writes into streaming write on the server side Enabling parallel writes improves throughput WASB parallel throughput 15% lower than HDFS YCSB write test, 4GB of 100 byte rows, uncompressed, 3 nodes (A3): 4 cores, 7GB memory, 1TB HDD, 1Gb NIC 100 RPC Handlers 100 Sync threads 100 Parallel writers Maxim Lukiyanov, Ashit Gosalia20 Avg. HDFS 15MBbs Avg. Parallel 13MBps Avg. Serial 9MBps
  21. 21. Announcement Maxim Lukiyanov, Ashit Gosalia21
  22. 22. Announcing HBase on Azure Data Lake Azure Data Lake A hyper scale repository for big data workloads HDFS for the cloud Unlimited capacity High throughput, low latency Strong consistency Durable and highly available Sing up page for Public Preview http://azure.microsoft.com/en-us/campaigns/data-lake/ Maxim Lukiyanov, Ashit Gosalia22
  23. 23. Demo Maxim Lukiyanov, Ashit Gosalia23
  24. 24. Summary Cost Azure HBase offers new low cost deployment option, up to 10x cheaper for some workloads, by direct integration with cloud storage Performance Comparable to HDD-based clusters (66% worse storage- backed read latency) Flexibility Easy to shrink or recreate cluster without data loss Maxim Lukiyanov, Ashit Gosalia24 Capacity Price Local VM Storage Cloud Storage

×