SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
Pinterest runs 38 different HBase clusters in production, doing a lot of different types of work—with some doing up to 5 million operations per second. In this talk, you'll get details about how we do capacity planning, maintenance tasks such as online automated rolling compaction, configuration management, and monitoring.
Pinterest runs 38 different HBase clusters in production, doing a lot of different types of work—with some doing up to 5 million operations per second. In this talk, you'll get details about how we do capacity planning, maintenance tasks such as online automated rolling compaction, configuration management, and monitoring.
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
1.
HBase
2
Jeremy Carroll & Tian-Ying Chang
Online, at at Low Latency
2.
30+ Billion Pins
categorized by people into more than
750 Million Boards
3
3.
Use Cases
Products running on HBase / Zen
4
The SmartFeed service renders
the Pinterest landing page for
feeds.
Pinterest
Engineering
Provides personal suggestions
based on a prefix for a user.
Pinterest
Engineering
Messages / Notifications
Pinterest
Engineering
Interests
Pinterest
Engineering
1 2 3 4
5.
Problem
Validating the new i2 platform
6
p99.9 zen get nodes in ms
6.
Garbage Collection
Stop the world events
7
pause time in milliseconds
7.
Problem #1: Promotion Failures
Issues seen in production
Heap fragmentation causing promotion failures
BlockCache at high QPS was causing fragmentation
Keeping hundreds of billions of rows BloomFilters from being evicted,
leading to latency issues
Solutions
Tuning CombinedCache for space for Memstore + Blooms & Indexes
with CombinedCache / OffHeap
Monitoring % of LRU Heap for blooms & indexes
Tuning & BucketCache
8
8.
Problem #2: Pause Time
Calculating Deltas
Started noticing ‘user + sys’ vs ‘real’ was very different
Random spikes of delta time, concentrated on hourly boundaries
Found resources online, but none of the fixes seemed to work
http://www.slideshare.net/cuonghuutran/gc-andpagescanattacksbylinux
http://yoshinorimatsunobu.blogspot.com/2014/03/why-buffered-writes-are-sometimes.html
Low user, low sys, high real
9
9.
Cloud Problems
Noisy neighbors
10
/dev/sda average io in milliseconds
10.
Formatting with EXT4
Logging to instance-store volume
11
100%
1251.60
99.99%
1223.52
99.9%
241.33
90%
151.45
Pause time in ms
11.
Formatting with XFS
PerfDisableSharedMem & logging to epemeral
12
100%
115.64
99.99%
107.88
99.9%
94.01
90%
66.38
Pause time in ms
12.
Online Performance
JVM Options
-XX:+PerfDisableSharedMem
-Djava.net.preferIPv4Stack=true
Instance Configuration
Treat instance-store as read only
irqbalance >= 1.0.6
kernel 3.13+ for disk performance
Changes for success on EC2
13
14.
Monitoring Zen
SLA driven from proxies
• Success criteria is composite of two clusters
due to backup requests
• Wounded cluster due to too many backup
requests
• Additional query metadata from proxy logs. Aids
in hot key analysis
99.99% Uptime
15
table
Thrift
zen cluster
table table
zen cluster
table
Thrift Thrift Backup
Request
Replication
15.
Dashboards
Metrics, Metrics, Metrics
• Usually low value, except when you need them
• Dashboards for all H-Stack daemons
(DataNode, NameNode, etc..)
• Capture amazing amounts of information due to
per-region information
• Per table / per region stats are very useful
Deep dives
16
16.
HBase Alerts
Maintenance primary driver
• Few clusters cause the majority of the alarms
• Mainly driven by lack of capacity planning
Hygiene / Capacity notifications
• Replacing failed / terminated nodes
• Measuring disk / cpu / network
• Canary analysis on code deployments
Common alerts
• Hot CPU / Disk Space
Analytics for HBase On-Call
17
17.
HotSpots
Spammers
• Few users request the same row over and over
• Rate limiting / caching
Real time analysis
• TCPDump is very helpful
tcpdump -i eth0 -w - -s 0 tcp port 60020 | strings
• Looking at per-region request stats
Code Issues
• Hard-coded key in product. Ex: Messages launch
Debugging imbalanced requests
18
19.
Capacity Planning
Managed Splitting
UniformSplitAlgo to pre-split regions
Salted keys for uniform distribution
Name spaced table on a shared cluster
Feature Cost
Some table attributes are more expensive
BlockSize, Bloom Filters, Compression (Prefix,
FastDiff, Snappy)
Start small. Split for growth
20
table_feat
Split
Split
20.
Capacity Planning
Distributing Load
Balance is important to eliminate hot spots
Per-table load balancing
Disable auto-splitting and monitor region size
Scaling Metrics
CPU Utilization
HDFS Space
Regions per server (Memstore / Flushing)
All Others (Memory, Bandwidth, etc..)
Balancing regions to servers
21
table
table
1
2
4 5
3
1
3 4
2
21.
Launching Messages
• Started on a NameSpace table on a shared
cluster
• Ramping out to production with decider (x%)
• Split table to get additional region servers
serving traffic
• Migrated to dedicated cluster
• As experiment ramped up, added / removed
capacity as feature was adopted
From development to production
22
2:50 PM 100%
https://www.flickr.com/photos/zachd1_618/13498790545
23.
Availability
Conditions for Failure
Termination notices from the underlying host
Default RF of 3 in one zone dangerous
Placement Groups may make this worse
Stability Patterns
Highest numbered instance type in a family
Multi Availability Zone + block placement
“Air Gaped” change management
Strategies for mitigating failure
24
Master Slave
US-East-1A US-East-1E
Replication
24.
Disaster Recovery
Recovering Data
System copies WALs and Snapshots to local
HDFS
HDFS keeps <X> hours locally. Rotates to S3
Data can restore from S3, HDFS, or another
Cluster w/rate limited copying
Used frequently to test new configurations, and
upgrade clusters side-by-side
hbaserecover -t unique_index -ts
2015-04-08-16-01-15 --source-cluster
zennotifications
& bootstrapping new clusters
25
Master Slave
US-East-1A US-East-1E
Replication DR SlaveReplication
US-East-1D
HDFS
S3
25.
Flow Monitoring
Snapshot backup routine
• ZooKeeper based configuration for each cluster
• Backup metadata is sent to ElasticSearch for
integration with dashboards
• Monitoring and alerting around WAL copy &
snapshot status
Reliable cloud backups
26
26.
Flow Monitoring
DistCP all hlog hourly
• Copy from slave cluster to avoid latency impact
• Using DistCP V2 which has throttling
• Spike in graph means DistCP failed, huge
amount of data copied once issued fixed
• Using ElasticSearch to store all backup
metadata
Alert when backup pipeline failed
27
27.
4
28
Maintenance
Check Health
Get Lock
Get Server Locality
Check Server Status
Region Movement w/Threads
Start RegionServer
+ Cooldown
Release Lock
Update Status + Cooldown
Rolling
Restart
Change management
while retaining
availability
1
2
3
Region Movement w/Threads
Stop RegionServer
+ Cooldown
Verify Locality
Check Server Status
28.
Rolling Compaction
Only one region per server is selected
• Avoid blocked region in queue
Controlled concurrency
• Control the space spike
• Reduce increased network and disk traffic
Controlled time to stop
• Stop before day time traffic ramp up
• Stop if compaction causing perf issue
Resume the next night
• Filter out the regions that has run compaction
Important for online facing clusters
29
29.
Next Challenges
Upgrade to latest stable (1.x) from 0.94.x w/no downtime
Increasing performance
• Lower latency
• Better compaction throughput
Regional Failover
• Cross Datacenter is in production now
• Need cross regional failover
Looking forward
30