HBase with MapR

Running HBase with the MapR distribution
Tomer Shiran
Director of Product Management, MapR Technologies

7/23/2012 ©MapR Technologies 1

Agenda
• The HBase volume
• HBase backups with snapshots
• Mirroring
• Tuning memory settings
• Architecting applications with many objects


MapR
• Complete Hadoop distribution
• Makes it easy to deploy HBase
• MapR 1.2 includes HBase 0.90.4 + 15 patches

• Seeing huge growth in HBase adoption
• Thanks to everyone in this room!

• MapR expands the market for HBase
• Enterprises require HA, data protection and disaster recovery
• MapR makes it easier to run HBase in production
 One minute to set up hourly snapshots
 One minute to set up cross-datacenter mirroring
 No need to worry about NameNode


Volumes – easy data management
• MapR makes data
management easier with
volumes
• Volumes are directories
with management policies
• Replication, snapshots,
mirroring, data placement
control, quotas, usage
tracking, …
• Each user/project
directory should be a
volume
• 100K volumes not a
problem


The HBase volume
• All HBase data should be in one volume
• HBase WALs are per RegionServer, so can’t create per-table volumes
• A volume for HBase data is created on installation
• Name: hbase.volume
• Mount path: /hbase
• Replication optimized for low latency
• Star replication beats chain replication for HBase
• For bulk load, create the HFiles in the HBase volume (/hbase)

# cd /mapr/default/hbase Reminder: A MapR
# ls -la
total 7
cluster can be mounted
drwxrwxrwx 13 root root 12 2012-01-16 11:44 . via NFS so cd and ls
drwxrwxrwx 6 root root 7 2012-01-13 16:08 .. just work
drwxrwxrwx 3 root root 1 2012-01-15 11:30 AdImpressions
-rwxrwxrwx 1 root root 3 2011-12-16 13:03 hbase.version
drwxrwxrwx 5 root root 3 2012-01-12 15:28 .logs All WALs are in .logs,
drwxrwxrwx 3 root root 1 2011-12-16 13:03 .META. not in the user table
drwxrwxrwx 2 root root 0 2012-01-13 14:29 .oldlogs
drwxrwxrwx 3 root root 1 2011-12-16 13:03 -ROOT- directories
drwxrwxrwx 3 root root 1 2012-01-16 11:44 Users (AdImpressions, Users)


HBase backups with snapshots
• Why snapshots?
• Consistent – HFiles and HLogs at the same point in time
• No downtime – snapshot a live HBase cluster, no performance impact
• No data duplication – takes seconds to snapshot petabytes
• Short RPOs – snapshot hourly or more frequently

• Access HBase snapshots in /hbase/.snapshot:
# cd .snapshot
# pwd
/mapr/default/hbase/.snapshot
# ls -la
total 3
drwxr-xr-x 5 root root 3 Jan 16 16:02 .
drwxrwxrwx 7 root root 6 Jan 16 11:46 ..
drwxrwxrwx 7 root root 6 Jan 16 11:46 2012-01-16.14-02-02
# ls -a 2012-01-16.16-02-02
. .. AdImpressions hbase.version .logs .META. .oldlogs -ROOT-


Manage your schedules


Choose a snapshot schedule for HBase

Use this GUI dialog, or the CLI
or REST API

Choose a snapshot schedule
for this volume


Mirroring

Mirror to…
• Research cluster
• Failover (DR) cluster
• Remote backup cluster
• Same cluster!
•…

Fast (and easy) Safe Flexible

• Differential (deltas) • Consistent (snapshot) • Scheduled or on-
• Compressed • Checksummed demand
• Intranet, WAN or
Sneakernet


Mirroring the HBase volume

Create a new volume on
destination cluster. Choose
Remote Mirroring Volume type

Choose source cluster and
volume (mapr.hbase)

Choose mirroring schedule


Mirroring vs. HBase master/slave replication
• Block level
• No need to run HBase on sink cluster
• Only latest update to the a block needs to be sent
 With master/slave every operation is sent

• MapR mirroring is practically stateless
• Each sink cluster keeps one integer – a serial number
 When asking for the next update, sink provides most recently seen serial
number
• Master cluster does not keep any state
 No resources consumed on the master cluster
• No ZooKeeper involved
• Master/slave replication is challenging when it gets out of sync

• One system for mirroring both HBase and file/directories


Warden
• Warden runs on each server
• /etc/init.d/mapr-warden start
• Warden starts/manages services on the node
• Warden decides how much memory to give each
service based on settings in warden.conf

# cat /opt/mapr/conf/warden.conf
…
service.command.hbregion.heapsize.percent=25
service.command.hbregion.heapsize.max=4000
service.command.hbregion.heapsize.min=1000
service.command.mfs.heapsize.percent=20
service.command.mfs.heapsize.min=512
…


Tuning memory settings
• The defaults are suitable in most cases

• Guidelines:
• Don’t exceed 100-200 regions per server
• Don’t give RegionServer more than 16GB RAM
 Garbage collection might kill you
• Give spare memory to FileServer
 Written in C/C++ (unlike HDFS DataNode)
 Advanced caching and prefetching
• Don’t enable TaskTracker unless you need it
 Or Warden will reserve memory for tasks
 If TaskTracker not enabled and mfs.heapsize.max not in
warden.conf, Warden assigns spare memory to FileServer


Architecting applications with many objects
• MapR supports up to 1 trillion files (small files OK)
• Fully distributed metadata
 No NameNode or block reports
• Extremely fast random I/O (10-1000x compared to HDFS)
• With HDFS Federation the upcoming HA NameNode you would need 20K
NameNodes and an HA NetApp :-)

• Keep smaller objects in HBase and larger objects (> 100KB) in MapR
storage services

Metadata (IDs, attributes, etc.)

Content (messages, attachments, etc.)

HBase

MapR storage services


Three ways to access the files
• NFS
• Mount the cluster over NFS
• NFS HA ensures availability – MapR assigns and manages virtual IPs
• No client library, works with any language
$ mount –o … mycluster:/mapr /mapr
$ python
>>> with open(r'/mapr/mycluster/images/asdfghjkl', 'w') as f:
... f.write(…)

• Java – Hadoop FileSystem API
FileSystem fs = FileSystem.get(new Configuration());
FSDataOutputStream out = fs.create(…);
out.write(…)

• C/C++ – native libhdfs library (MapR 1.2+)
• Same API (header file) as libhdfs, but no Java involved
hdfsFS fs = hdfsConnect(...);
hdfsFile f = hdfsOpenFile(fs, ...);
hdfsWrite(fs, f, ...);

Questions?


HBase with MapR

More Related Content

What's hot

Viewers also liked

Similar to HBase with MapR

Recently uploaded

HBase with MapR