Nitin Verma, Pravin Mittal, and Maxim Lukiyanov (Microsoft)
This session presents our success story of enabling a big internal customer on Microsoft Azure’s HBase service along with the methodology and tools used to meet high-throughput goals. We will also present how new features in HBase (like BucketCache and MultiWAL) are helping our customers in the medium-latency/high-bandwidth cloud-storage scenario.
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
1. Optimizing HBase for Cloud
Storage in Microsoft Azure
HDInsight
Nitin Verma, Pravin Mittal, Maxim Lukiyanov
May 24th 2016, HBaseCon 2016
2. About Us
Nitin Verma
Senior Software Development Engineer – Microsoft, Big Data Platform
Contact: nitinver@microsoft
Pravin Mittal
Principal Software Engineering Manager – Microsoft, Big Data
Contact: pravinm@microsoft
Maxim Lukiyanov
Senior Program Manager – Microsoft, Big Data Platform
Contact: maxluk@microsoft
3. Outline
Overview of HBase Service in HDInsight
Customer Case Study
Performance Debugging
Key Takeaways
4. What is HDInsight HBase Service
On demand cluster with few clicks
Out of the box performance
Supports both Linux & Windows
Enterprise SLA of 99.9% availability
Active Health Monitoring via
Telemetry
24/7 Customer Support
Unique Features
Storage is decoupled from compute
Flexibility to scale-out and scale-in
Write/read unlimited amount of data
irrespective of cluster size
Data is preserved and accessible
even when cluster is down or deleted
7. Azure Data Lake Storage: Built For Cloud
Maxim Lukiyanov, Ashit Gosalia7
Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).
Native format Must permit data to be stored in its ‘native format’ to track lineage and for data provenance.
Low latency Must have low latency for high-frequency operations.
Must support multiple analytic frameworks—Batch, Real-time, Streaming, Machine Learning, etc.
No one analytic framework can work for all data and all types of analysis.
Multiple analytic
frameworks
Details Must be able to store data with all details; aggregation may lead to loss of details.
Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark.
Reliable Must be highly available and reliable (no permanent loss of data).
Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up.
All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.
9. Microsoft’s Real Time Analytics Platform
Modern self-service telemetry
platform
Near real-time analytics
Product health and user engagement
monitoring with custom dashboards
Performs large-scale indexing on
HDInsight Hbase
10. 4.01 million
EVENTS PER SECOND AT PEAK
12.8 petabytes
INGESTION PER MONTH
>500 million
WEEKLY UNIQUE DEVICES AND MACHINES
450 + 2600
PRODUCTION + INT/SANDBOX
SELF-SERVE TENANTS
__________________________________________
1,600
STORAGE ACCOUNTS
500,000
AZURE STORAGE TRANSACTIONS / SEC
0
20
40
60
80
100
Feb-21 Feb-22 Feb-23
TBingress/hr
Azure Storage traffic
0
1000
2000
3000
4000
5000
6000
7000
Feb-21 Feb-22 Feb-23
Millionstransactions/hr
Table Blob Queue
11. Results of Early HBase Evaluation
Customer had very high throughput need for key-value store
Performance was ~10X lower than their requirement
Bigger concern: Throughput didn’t scale from 15 -> 30 nodes
12. Developing a Strategy
Understand the architecture
Run the workload
Collect Metrics & Profile
Profile relevant components
Make performance fixes
Isolate/divide the problem (unit test)
Reproduce at lower scale
Fixed?
Identify Performance Bottlenecks
Automation can save time
YES
Iterative
process
13. Pipelineofdataingestion
VM VM VM VM
VM VM VM VM
Data Ingestion Client App [PaaS]
Multiple Storage Accounts and Queues
300 VM’s
VM VM HDI Gateway &
Load Balancer
REST SERVERS
REGION SERVERS
HBASE CLUSTER
Cloud
Storage30 x large
worker nodes
1000+ cores
Medium Latency
High Bandwidth
REST REQUEST
Batch Size = 1000
Row Size = 1400 bytes
14. Initial Iterations
1. High CPU utilization with REST being top consumer
2. GZ compression was turned ON in REST
3. Heavy GC activity on REST and REGION processes
4. Starvation of REGION process by REST [busy wait for network IO]
Throughput improved by 10-30% after each iteration
15. Initial Iterations (contd.)
5. REST server threads waiting on network IO
Collected TCP dump on all the nodes of cluster
REST
REGION SERVER 1
REGION SERVER 2
REGION SERVER 3
REGION SERVER 30
BATCH
REST server was fanning-out batch
request to all the region servers
Slowest region server governed the
throughput
Used SALT_BUCKET scheme to
improve the locality
SLOWEST
Insight from tcpdump
16. Improvement
Throughput improved by 2.75X
Measurement window = ~72 hours
Avg. Cluster CPU utilization = ~60%
But no scaling from 30 node to 60
node cluster
Time to get back to the architecture
17. Pipelineofdataingestion
VM VM VM VM
VM VM VM VM
Data Ingestion Client App [PaaS]
Multiple Storage Accounts and Queues
300 VM’s
VM VM HDI Gateway &
Load Balancer
REST SERVERS
REGION SERVERS
HBASE CLUSTER
WASB
60 x large
worker nodes
1000+ cores
Medium Latency
High Bandwidth
REST REQUEST
Batch Size = 1000
Row Size = 1400 bytes
Could GW be a
Bottleneck at
Such high ingestion
rate?
18. We Had Gateway Bottleneck
And the guess was right!!
Collected perfmon data on GW nodes
Core#0 was 100% busy
RSS is a trick to balance the DPC’s
Performance improved but not significant
Both CPU and networking was a
bottleneck
Time to scale-up the gateway VM size
19. Configuring private gateway
We provisioned custom gateway on large VM’s using NginX
We confirmed that gateway issue was indeed fixed,
Throughput problem was still not solved and continued to give us
new puzzles
20. 20
Pipelineofdataingestion
VM VM VM VM
VM VM VM VM
Data Ingestion Client App [PaaS]
Multiple Storage Accounts and Queues
300 VM’s
VM VM
Programmed
NGINX as Gateway
and Load Balancer
REST SERVERS
REGION SERVERS
HBASE CLUSTER
WASB
60 x D14
worker nodes
1040 cores
Could customer app
be a
Bottleneck?
21. 21
Pipelineofdataingestion
VM VM VM VM
VM VM VM VM
Data Ingestion Client App [PaaS]
Multiple Storage Accounts and Queues
300 VM’s
VM VM
Programmed
NGINX as Gateway
and Load Balancer
REST SERVERS
REGION SERVERS
HBASE CLUSTER
WASB
60 x D14
worker nodes
1040 cores
DISCONNECTED
RETURN 200
22. New strategy
We divided the data pipeline into two parts and debugged them in
isolation
1) Client Gateway [solved]
2) Rest Region WASB [unsolved]
For fast turn-around, we decided to use YCSB for debugging #2
We configured YCSB with characteristics of customer’s workload
We ran YCSB locally inside HBase cluster
22
23. YCSB Experiments
We had suspicion on one of the following two:
1) REST
2) Azure Storage
We isolated the problem by replacing Azure Storage with local SSDs
We then compared the performance of REST v/s RPC
Results:
REST was clearly a bottleneck!
23
24. YCSB Experiments (contd.)
Root cause of bottleneck in REST:
• Profiling the REST Servers uncovered multiple threads that were blocked on
INFO/DEBUG logging.
• Limiting the logging to WARNING/ERROR level dramatically improved the REST
server performance and brought it very close to RPC.
Sample Stack:
Thread 11540: (state = BLOCKED)
- org.apache.log4j.Category.callAppenders(org.apache.log4j.spi.LoggingEvent) @bci=12, line=204 (Compiled frame)
- org.apache.log4j.Category.forcedLog(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) @bci=14, line=391 (Compiled frame)
- org.apache.log4j.Category.log(java.lang.String, org.apache.log4j.Priority, java.lang.Object, java.lang.Throwable) @bci=34, line=856 (Compiled frame)
- org.apache.commons.logging.impl.Log4JLogger.debug(java.lang.Object) @bci=12, line=155 (Compiled frame)
- org.apache.hadoop.hbase.rest.RowResource.update(org.apache.hadoop.hbase.rest.model.CellSetModel, boolean) @bci=580, line=225 (Compiled frame)
- org.apache.hadoop.hbase.rest.RowResource.put(org.apache.hadoop.hbase.rest.model.CellSetModel, javax.ws.rs.core.UriInfo) @bci=60, line=318 (Interpreted frame)
- sun.reflect.GeneratedMethodAccessor27.invoke(java.lang.Object, java.lang.Object[]) @bci=48 (Interpreted frame)
- sun.reflect.DelegatingMethodAccessorImpl.invoke(java.lang.Object, java.lang.Object[]) @bci=6, line=43 (Compiled frame)
- java.lang.reflect.Method.invoke(java.lang.Object, java.lang.Object[]) @bci=57, line=606 (Compiled frame)
- com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(java.lang.reflect.Method, java.lang.Object, java.lang.Object[]) @bci=3, line=60 (Interpreted frame)
- com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(java.lang.Object, com.sun.jersey.api.core.HttpContext)
@bci=16, line=205 (Interpreted frame)
25. YCSB Experiments (contd.)
RPC v/s REST after fixing INFO message logging
We could saturate the SSD performance at 160K requests/sec
throughput
This confirmed that the bottleneck in REST server was solved
25
26. Back to Customer Workload
After limiting logging level to WARN, throughput improved further by
~5.5X
This was ~15X gain from the point where we started
Customer is happy and use HDInsight Hbase service in production
They are able to meet the throughput goals with enough margin to
scale further
26
27. Tools Utilized
27
Category Tools on Windows Tools on Linux for Java Process
System Counters: CPU, Memory, IO,
Process etc.
Perfmon mpstat, iostat, vmstat, sar, nload,
glances
Networking tcpdump Tcpdump
CPU Profiling kernrate, f1 sample, xperf YourKit, jvmtop, jprof
CPU blocking issues Xperf, concurrency visualizer, ppa Jstack
Debugging Large Clusters powershell, python expect, bash, awk, python, screen,
expect
29. Overcoming Storage Latency
HBase now has MultiWAL and BucketCaching features
Made to minimize the impact of high storage latency
Parallelism and batching are the keys to hide write latency (MultiWAL)
MultiWAL gives higher throughput with lower number of region nodes
We achieve 500K inserts/sec with just 8 small region nodes for an IoT
customer
29
30. Overcoming Storage Latency (contd.)
What about read latency?
Caching and Read-Ahead are the keys to overcome read latency
Cache on write helps application that are temporal in nature
HDInsight VM’s are backed with SSD’s
BucketCaching feature can utilize SSD as L2 cache
BucketCaching gives ~20X-30X gain in read performance to our
customers
31. Conclusion
The performance issue was quite complex, where bottlenecks were
hiding at several layers and components in the pipeline
Deeper engagement with customers helped in optimizing HDInsight
HBase service
HDI Team has been actively productizing performance fixes
ADLS, MultiWAL and BucketCache help in minimizing the latency impact
31
Understand the architecture and overall pipeline of data movement
Monitor the resource utilization of each layer in the pipeline
Profile the components with high resource utilization and identify hotspots
When resource utilization is low, identify blocking issues (if any)
Divide and Conquer – Develop a strategy to isolate the components that could be culprit. Isolation makes debugging easier.
Iterative Process!!
Reproduced customer scenario with 30 worker nodes
Collected system metrics (CPU, Memory, IO, etc.) on all the worker nodes
Started our analysis with HBase
CPU consumption was very high on nearly all REST servers
We then profiled the REST servers and observed following
Compression was ON by default (GZ filter) and was consuming ~70% CPU
Heavy GC activity on REST and REGION servers. We had to tune certain GC related parameters
REST Server busy wait for network IO’s. Bumping REGION server priority solved that issue
Tools like YourKit and JVMTop helped in uncovering efficiency issues
We noticed multiple threads in REST server waiting on network IO
We performed a deep networking analysis using TCP Dump and uncovered the locality issue with the key
REST server was fanning-out each batch request to almost all the region servers
Overall throughput seemed to be governed by the slowest region server
We used SALT_BUCKET scheme to improve the locality of batch requests
At this high ingestion rate, we suspected HDI gateway being a bottleneck and confirmed it by collecting perfmon data on both the gateways
Core#0 was ~100% on both the gateway nodes
Fixing RSS helped, but we started hitting network throttling
The network utilization on gateway nodes (A2 instances), surpassed Azure throttling limit
The custom GW gave us ability to debug the ingestion bottlenecks from customer app
From custom GW rules, we could directly return success without sending data to HBase cluster
We identified a few occasions, where client app wasn’t sending enough load to Hbase
After fixing scalability issues in the client application, it was able to send ~2 Gbps data to GW nodes
But we couldn’t push 2 Gbps data into HBase cluster
The next bottleneck was clearly in HBase
The custom GW gave us ability to debug the ingestion bottlenecks from customer app
From custom GW rules, we could directly return success without sending data to HBase cluster
We identified scalability bottlenecks in the client app and fixed them with customer’s help
The client application, was now able to send ~15X data to GW nodes
But we couldn’t push that much into HBase cluster
The next bottleneck was clearly in HBase