MapR M7: Providing an enterprise quality Apache HBase API

1©MapR Technologies - Confidential
M7 Technical Overview
M. C. Srivas
CTO/Founder, MapR

MapR: Lights Out Data Center Ready
• Automated stateful failover
• Automated re-replication
• Self-healing from HW and SW
failures
• Load balancing
• Rolling upgrades
• No lost jobs or data
• 99999’s of uptime
Reliable Compute Dependable Storage
• Business continuity with snapshots
and mirrors
• Recover to a point in time
• End-to-end check summing
• Strong consistency
• Built-in compression
• Mirror between two sites by RTO
policy

MapR does MapReduce (fast)
TeraSort Record
1 TB in 54 seconds
1003 nodes
MinuteSort Record
1.5 TB in 59 seconds
2103 nodes

MapR does MapReduce (faster)
TeraSort Record
1 TB in 54 seconds
1003 nodes
MinuteSort Record
1.5 TB in 59 seconds
2103 nodes
1.65
300

Dynamo DB
ZopeDB
Shoal
CloudKit
Vertex DB
FlockD
B
NoSQL

HBase Table Architecture
 Tables are divided into key ranges (regions)
 Regions are served by nodes (RegionServers)
 Columns are divided into access groups (columns families)
CF1 CF2 CF3 CF4 CF5
R1
R2
R3
R4

HBase Architecture is Better
 Strong consistency model
– when a write returns, all readers will see same value
– "eventually consistent" is often "eventually inconsistent"
 Scan works
– does not broadcast
– ring-based NoSQL databases (eg, Cassandra, Riak) suffer on scans
 Scales automatically
– Splits when regions become too large
– Uses HDFS to spread data, manage space
 Integrated with Hadoop
– map-reduce on HBase is straightforward

M7
An integrated system for
unstructured and structured data

MapR M7 Tables
 Binary compatible with Apache HBase
– no recompilation needed to access M7 tables
– Just set CLASSPATH
– including HBase CLI
 M7 tables accessed via pathname
– openTable( "hello") … uses HBase
– openTable( "/hello") … uses M7
– openTable( "/user/srivas/hello") … uses M7
9

Binary Compatible
 HBase applications work "as is" with M7
– No need to recompile , just set CLASSPATH
 Can run M7 and HBase side-by-side on the same cluster
– eg, during a migration
– can access both M7 table and HBase table in same program
 Use standard Apache HBase CopyTable tool to copy a table
from HBase to M7 or vice-versa, viz.,
% hbase org.apache.hadoop.hbase.mapreduce.CopyTable
--new.name=/user/srivas/mytable oldtable

Features
 Unlimited number of tables
– HBase is typically 10-20 tables (max 100)
 No compaction
 Instant-On
– zero recovery time
 8x insert/update perf
 10x random scan perf
 10x faster with flash - special flash support
11

M7: Remove Layers, Simplify
MapR M7

M7 tables in a MapR Cluster
 M7 tables integrated into storage
– always available on every node
– no separate process to start/stop/monitor
– zero administration
– no tuning parameters … just works
 M7 tables work 'as expected'
– First copy local to writing client
– Snapshots and mirrors
– Quotas , repl factor, data placement
13

Unified Namespace for Files and Tables
$ pwd
/mapr/default/user/dave
$ ls
file1 file2 table1 table2
$ hbase shell
hbase(main):003:0> create '/user/dave/table3', 'cf1', 'cf2', 'cf3'
0 row(s) in 0.1570 seconds
$ ls
file1 file2 table1 table2 table3
$ hadoop fs -ls /user/dave
Found 5 items
-rw-r--r-- 3 mapr mapr 16 2012-09-28 08:34 /user/dave/file1
-rw-r--r-- 3 mapr mapr 22 2012-09-28 08:34 /user/dave/file2
trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:32 /user/dave/table1

M7 – An Integrated System

Tables for End Users
 Users can create and manage their own tables
– Unlimited # of tables
– first copy local
 Tables can be created in any directory
– Tables count towards volume and user quotas
 No admin intervention needed
– do stuff on the fly, no stop/restart servers
 Automatic data protection and disaster recovery
– Users can recover from snapshots/mirrors on their own

M7 combines the best of LSM and BTrees
 LSM Trees reduce insert cost by deferring and batching index changes
– If don't compact often, read perf is impacted
– If compact too often, write perf is impacted
 B-Trees are great for reads
– but expensive to update in real-time
Can we combine both ideas?
Writes cannot be done better than W = 2.5x
write to log + write data to somewhere + update meta-data

M7 from MapR
 Twisting BTree's
– leaves are variable size (8K - 8M or larger)
– can stay unbalanced for long periods of time
• more inserts will balance it eventually
• automatically throttles updates to interior btree nodes
– M7 inserts "close to" where the data is supposed to go
 Reads
– Uses BTree structure to get "close" very fast
• very high branching with key-prefix-compression
– Utilizes a separate lower-level index to find it exactly
• updated "in-place"bloom-filters for gets, range-maps for scans
 Overhead
– 1K record read will transfer about 32K from disk in logN seeks

M7
Comparative Analysis with
Apache HBase, Level-DB and a BTree

Apache HBase HFile Structure
64Kbyte blocks
are compressed
An index into the
compressed blocks is
created as a btree
Key-value
pairs are
laid out in
increasing
order
Each cell is an individual key + value
- a row repeats the key for each column

HBase Region Operation
 Typical region size is a few GB, sometimes even 10G or 20G
 RS holds data in memory until full, then writes a new HFile
– Logical view of database constructed by layering these files, with the
latest on top
Key range represented by this region
newest
oldest

HBase Read Amplification
 When a get/scan comes in, all the files have to be examined
– schema-less, so where is the column?
– Done in-memory and does not change what's on disk
• Bloom-filters do not help in scans
newest
oldest
With 7 files, a 1K-record get () takes about 30 seeks, 7 block decompressions,
and a total data transfer of about 130K from HDFS.

HBase Write Amplification
 To reduce the read-amplification, HBase merges the HFiles
periodically
– process called compaction
– runs automatically when too many files
– usually turned off due to I/O storms
– and kicked-off manually on weekends
Compaction reads all files and merges
into a single HFile

HBase Compaction Analysis
 Assume 10G per region, write 10% per day, grow 10% per week
– 1G of writes
– after 7 days, 7 files of 1G and 1file of 10G
 Compaction
– Total reads: 17G (= 7 x 1G + 1 x 10G)
– Total writes: 25G (= 7G wal + 7G flush + 11G write to new HFile)
 500 regions
– read 8.5T, write 12.5T  major outage on node
– with fewer hfiles, it only gets worse
 Best practice, serve < 500g per node (50 regions)

Level-DB
 Tiered, logarithmic increase
– L1: 2 x 1M files
– L2: 10 x 1M
– L3: 100 x 1M
– L4: 1,000 x 1M, etc
 Compaction overhead
– avoids IO storms (i/o done in smaller increments of ~10M)
– but significantly extra bandwidth compared to HBase
 Read overhead is still high
– 10-15 seeks, perhaps more if the lowest level is very large
– 40K - 60K read from disk to retrieve a 1K record

BTree analysis
 Read finds data directly, proven to be fastest
– interior nodes only hold keys
– very large branching factor
– values only at leaves
– thus caches work
– R = logN seeks, if no caching
– 1K record read will transfer about logN blocks from disk
 Writes are slow on inserts
– inserted into correct place right away
– otherwise read will not find it
– requires btree to be continuously rebalanced
– causes extreme random i/o in insert path
– W = 2.5x + logN seeks if no caching

Let’s look at some
Performance Numbers
for proof

M7 vs. CDH: 50-50 Mix (Reads)

M7 vs. CDH: 50-50 load (read latency)

M7 vs. CDH: 50-50 Mix (Updates)

M7 vs. CDH: 50-50 mix (update latency)

MapR M7 Accelerates HBase Applications
Benchmark MapR 3.0.1
(M7)
CDH 4.3.0
(HBase)
MapR
Increase
50% read,
50% update
8000 1695 5.5x
95% read, 5%
update
3716 602 6x
Reads 5520 764 7.2x
Scans
(50 rows)
1080 156 6.9x
CPU: 2 x Intel Xeon CPU E5645 2.40GHz 12 cores
RAM: 48GB
Disk: 12 x 3TB (7200 RPM)
Record size: 1KB
Data size: 2TB
OS: CentOS Release 6.2 (Final)
(M7)
CDH 4.3.0
(HBase)
MapR
Increase
50% read,
50% update
21328 2547 8.4x
95% read, 5%
update
13455 2660 5x
Reads 18206 1605 11.3x
Scans
(50 rows)
1298 116 11.2x
RAM: 24GB
Disk: 1 x 1.2TB Fusion I/O ioDrive2
Record size: 1KB
Data size: 600GB
MapR speedup with HDDs: 5x-7x MapR speedup with SSD: 5x-11.3x

M7: Fileservers Serve Regions
 Region lives entirely inside a container
– Does not coordinate through ZooKeeper
 Containers support distributed transactions
– with replication built-in
 Only coordination in the system is for splits
– Between region-map and data-container
– already solved this problem for files and its chunks

Server Reboot
 Full container-reports are tiny
– CLDB needs 2G dram for 1000-node cluster
 Volumes come online very fast
– each volume independent of others
– as soon as min-repl # of containers ready

Server Reboot
– does not wait for whole cluster
(eg, HDFS waits for 99.9% blocks reporting)

Server Reboot
– does not wait for whole cluster
(eg, HDFS waits for 99.9% blocks reporting)
 1000-node cluster restart < 5 mins

M7 provides Instant Recovery
 0-40 microWALs per region
– idle WALs go to zero quickly, so most are empty
– region is up before all microWALs are recovered
– recovers region in background in parallel
– when a key is accessed, that microWAL is recovered inline
– 1000-10000x faster recovery

M7 provides Instant Recovery
 0-40 microWALs per region
– idle WALs go to zero quickly, so most are empty
– region is up before all microWALs are recovered
– recovers region in background in parallel
– when a key is accessed, that microWAL is recovered inline
– 1000-10000x faster recovery
 Why doesn't HBase do this?
– M7 leverages unique MapR-FS capabilities, not impacted by HDFS
limitations
– No limit to # of files on disk
– No limit to # open files
– I/O path translates random writes to sequential writes on disk

MapR M7 Accelerates HBase Applications
(M7)
CDH 4.3.0
(HBase)
MapR
Increase
50% read,
50% update
8000 1695 5.5x
95% read, 5%
update
3716 602 6x
Reads 5520 764 7.2x
Scans
(50 rows)
1080 156 6.9x
RAM: 48GB
Disk: 12 x 3TB (7200 RPM)
Record size: 1KB
Data size: 2TB
(M7)
CDH 4.3.0
(HBase)
MapR
Increase
50% read,
50% update
21328 2547 8.4x
95% read, 5%
update
13455 2660 5x
Reads 18206 1605 11.3x
Scans
(50 rows)
1298 116 11.2x
RAM: 24GB
Disk: 1 x 1.2TB Fusion I/O ioDrive2
Record size: 1KB
Data size: 600GB
MapR speedup with HDDs: 5x-7x MapR speedup with SSD: 5x-11.3x

MapR M7: Providing an enterprise quality Apache HBase API

More Related Content

What's hot

Viewers also liked

Similar to MapR M7: Providing an enterprise quality Apache HBase API

Recently uploaded

MapR M7: Providing an enterprise quality Apache HBase API