HBase, Meet OPs.
OPs, Meet HBase
Kevin O'Dell
Jean-Daniel Cryans
Kevin O'Dell
Systems Engineer
Extensive experience supporting customers
2
Jean-Daniel Cryans
Software Engineer
Builds and runs HBase
3
Agenda
Leveraging previous knowledge (Kevin)
Getting to production (JD)
4
Goals
Help audience members understanding how to
operate HBase.
Empower audience members when talking to
their own ops organization.
5
Leveraging previous knowledge
Distributed Filesystem
Distributed Database
6
Java
OS
Network
Hardware
Machines
•Industry Standard
•No RAID controller (JBOD on the slaves)
•Homogeneous environment is not necessary
•Cores, Spindles, and RAM
• Different configurations for different uses
7
Network
•Leverage the existing infrastructure
•No fancy equipment, no Infiniband
•Redundancy is key, no SPOF
•TOR vs Core
•1Gb, 10Gb, and 40Gb
•Bonding, VIPs, other such complexities
8
Network
9
Operating system
•Production ready Linux
•Swap vs. Swappiness
•Basic FS -> Ext3/4
•Cgroups
•Recommended packages (systat, mce, iperf)
10
Java
•User space
•Programs run in contained JVM
•JVM requires tuning
•No leaks (usually), but overcommitting is easy
11
Distributed Filesystem(HDFS)
•Shared nothing
•User-level filesystem
•No POSIX compliant
•Immutable
•Built in Redundancy
•Linear Scalability
12
Distributed Filesystem
13
Distributed Database(HBase)
•Distributed Hash table
•Get, put, delete, scan, and CaS
•Denormalization is necessary
•Not a parallel database, just distributed
•Write-ahead log / data durability
•Master/slave replication
•ACID compliance
14
Distributed Database
15
Getting to Production
Things HBase doesn't come with:
•Metrics
•Automation
•Alerting
16
Metrics
Tony was really excited to try his
new cluster
17
Metrics
You have no excuses:
•Ganglia
•Cacti
•OpenTSDB
•Hannibal
18
Metrics - Ganglia
19
Metrics - Cacti
20
Metrics - OpenTSDB
21
Metrics - Hannibal
22
Metrics
Metrics you want in your dashboards:
•Call queues
•IO wait
•Compaction queues
23
Metrics - Call Queues
24
View of all the machines together
Metrics - Call Queues
25
Ceiling
Metrics - Call Queues
26
Breaking it down per node
Metrics - Call Queues
27
What’s up with this one?
Metrics - IO Wait
28
Same time, breaking it down per node
Metrics - IO Wait
29
Our machine is somewhere here...
Metrics - IO Wait
30
Showing the previous machine (used to be yellow sorry)
Metrics - Compaction Queues
31
View of all the machines together, different time
Metrics - Compaction Queues
32
Nice slope! Load is well distributed
Metrics - Compaction Queues
33
Oh...
Metrics - Compaction Queues
34
What is going on here?
Metrics
Want to learn more about metrics?
See:
“Using Metrics to Monitor and Debug Apache
HBase” (5:00pm-5:20pm) with Elliott Clark
35
23:59:60
36
Automation
How fast can you:
•Change an OS configuration on 100 machines?
•Kill one process on said machines?
•Reboot all your machines?
•Reboot all your machines one by one, with
some added configuration changes?
•Add 10 new fully configured nodes?
37
"Automation" - CSSH
Are you blind yet?
38
Automation - Puppet
39
Automation - Chef
40
Automation - Fabric
$ fab
41
Automation
Common automations:
•Rolling restart
•Adding/removing nodes
•Deploying new configurations
•Finer re-balancing
42
Alerting
HBase is just like any other system you are
running, so maybe you've heard of...
43
Alerting - Nagios
44
Alerting - Zabbix
45
Alerting
What to alert on:
•Previous metrics (call/compaction queues, IO).
•Network bandwidth
•Disk space
•Number of regions
•SMARTD
46
Backup
47
Backup
48
No, you’re not the only one.
Now drop that gun.
If you can manage to take your cluster offline for
possibly an hour:
1.Shutdown HBase
2.distcp to another cluster/separate folder
3.Restart HBase
* It's possible to run a distcp before shutting down, make sure you run distcp
-update -delete for the second step.
Backup - Offline
49
1.Create another HBase cluster (can be remote)
2.Alter the families that need replication
3.Make sure the same tables exist on the slave
cluster
* Replication isn't done inline with the inserts in the master cluster
* See "Apache HBase Replication" with Chris Trezzo at 5:20PM
Backup - Replication
50
•Doesn't require copying data
•Runs in less than 60 seconds
•Minimal impact on performance
* See the slides from "Apache HBase Table Snapshots" with Jonathan Hsieh
& pals
Backup - Snapshot
51
Thank You!
52

HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.