Hadoop/HBase POC framework

Hadoop/HBase POC v1 Review

A framework for Hadoop/HBase POC

POC
• Proof Of Concept, usually in competition with
another product.
• Given use case:
– Performance: critical path (speed), most
benchmark read performance,shard for write
performance
– Cost: H/W + administrative cost
– Look at Hbase+Hadoop vs. MongoDB

HBase
• Transactional store; 70k messages/sec
1.5kb/message. >1GB ethernet speeds
• What is Hbase?, Sources

Cloudera HBase Training Materials
• Exercises:
http://async.pbworks.com/w/file/55596671/H
Base_exercise_instructions.pdf
• Training Slides:
http://async.pbworks.com/w/file/54915308/Cl
oudera_hbase_training.pdf
• Training VM; 2GB put somewhere else.

System Design on working
components

HDFS vs. Hbase
• Replication and distributed FS. Think NFS not just replicas.
Metadata at central NameNode, single point of failure.
Secondary NN as hot backup. Failure and recovery protocol
testing not part of POC
• Blocks, larger is better. Blocks are replicated. Not cells.
• HDFS write once, was modified to append to file for HBase.
• MapR HDFS compatible:
– fast adoption w/Hbase; snapshots
– Cross data center mirroring, consistent mirroring
– Star replication vs. chain replication
– FileServer vs. TaskTracker, Warden vs. NN. No Single point failure

Hbase Disks(book)
• No RAID on slaves, master ok. Use IOPS

Transactional Write Perf.
• Factor out network, multiple clients, any disk
seeks from test program
• Create test packets in memory only.
• Write perf function of Instance
memory, packet size,

Run on Amazon AWS first
• INSTANCES:
– SMALL INSTANCE: 1.7GB
– LARGE INSTANCE: 7.5GB
– HIMEM XLARGE: 17GB, 34GB, 68GB
– SSD DRIVES!!

Write performance, 300k m/s 1500
bytes synthetic data.
Series 2
3500

3000

2500

2000

1500 Series 2

1000

500

0
1.7 7.5 17 34 68

Dell Notes:
• MapR says 16GB/Cloudera 24GB,
• plot heap size instead.
• Dell, is this slowing down performance?
• Take out a dimm?
• Reproduce results first?

HBase write perf, 1M byte/s
• http://www.slideshare.net/jdcryans/performa
nce-bof12, 100k-40k/second 10 byte packets

Write test code
• No network, no disk accesses. Run on local
node

Hbase AWS Packet Size 16-1500 bytes
• http://async.pbworks.com/w/file/55320973/A
WSHBasePerf16_1500bytepacket.xlsx

Hbase Write Perf, 1500 byte packets
• Single thread, single node. Should be >>
w/more threads or async client
• 16 Byte: 11235p/s
• 40 Byte: 8064p/s
• 80 Byte: 5263p/s
• 1500 Byte:3311p/s
• 8GB Heap, big regions(optimizations in file
names), etc…12-20 tried, 4 make diff

AWS Reduce #RPC
• Batch Mode, 1000 inserts = 1000 RPCs, reduce
to 1 RPC w/Batch, 3610 p/s (5.4Mb/s, pass
error check, m22xlarge instance). Note:mongo
2.5

2

1.5

Series1
1

0.5

0
1
16
31
46
61
76
91
106
121
136
151
166
181
196
211
226
241
256
271
286

Dell H/W Perf. (default config) worse
2262p/s vs 3311(aws)
http://async.pbworks.com/w/file/55225682/graphdell1500bytepacket8gb.txt

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

0
1
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
253
260
267
274
281
288
295
DELL WAL off, 2262->2688(+18.5%)

Series1

0
1
2
3
4
5
6
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
253
260
>3557p/s. 57% increase

267
274
281
288
295
Dell WAL disabled,big heap, big
regions, need more time 2262-

Series1

AWS SSD(3267p/s) vs.
EBS(4042p/s), no compaction. Red
5
m2large. Maybe AWS using SSD?
4.5

4

3.5

3

2.5 Series1
Series2
2

1.5

1

0.5

0
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
253
260
267
274
281
288
295

AWS(3500-4k packets/sec) vs. DELL
• AWS 3-4k p/s default configuration w/o optimization.
• Dell (3557p/s) slower than AWS(3610 optimized
m22xlarge, 4240p/s m2large)
• Faster h/w instances in AWS makes a difference.
Lesson(4210p/s): contolling the regions and
compactions have impact on performance, fast IO.
Spend time later on this.
• User error w/Dell h/w somewhere. Can’t be that slow!
• Could run a benchmark on m22xlarge over 24h period
to see variability in perf. Not worth time investment

Dell Tuning
• Ext3/4 5% diff in benchmarks. No diff in p/s
performance.
• Raid levels? JBOD not avail.
• Maybe m2.2xlarge are high perf AWS drives
are SSD? Seems funny w/pricing structure.
• Noatime, 256k block sizes,
• Goal: 4k P/S?

Bulk Load (worth time investment?)
• Quick error check
• Take existing table, export it, bulk load.
Command line; very rough.
• Should redo w/Java program. WAL off is
approximation

Write Clients for NOSQL
• HBase, Mongo, Cassandra have threads
behind them, need a threaded or async client
to get full performance.
• need more time, higher priority than dist
mode, needed in dist mode
• lock timeout behavior; insert 1 row
• Need a threaded or async client. Most get
threaded design wrong?

Write Load Tool (multiple clients)
• 300k rows single thread single client: 14430
ms, 2079p/s; about right….
• 300k rows 3 threads:22804ms
• M/R 30 mappers:24289
• M/R better when need to do combining or
processing of input data. M/R & Threads
comparison about right. Threads should
increase performance… ok writing my own…

Application Level Perf
• Not transactional…
• Simulate reporting store; writes concurrent
w/web page read.
• Compare w/SQL Server, MongoDB which have
column indexes.
• You may not need column indexes if designed
correctly. ESN not key, will need consecutive
keys to split into balanced regions.

Web GUI
• Demo, webpage & writes into DB. Test MS SQL
Server packets/sec using same.
• Do a like %asdf% with no data to see if there
is a timeout

Read Performance
• Index search through webpage w/writer is fast, 50-100ms, <10-
20ms if in cache
• Don’t do all table scans. Like in hbase shell count ‘table name’
– Count * from table
• PIG/HIVE are faster on top of Hbase b/c they store metadata
• All table scan:
10 rows:18ms
100 rows:11-166ms
1000 rows: 638 ms
10k rows: 4.3 s
100k rows: 38 s (not printed)
• Use filters for search, exact match, regex, substring, more

SingleValueColumn Filter
• Search for specific
value, constant, regex, prefix. Did not try
others
• Same queries as before, search for specific
values testing 100k-1M rows.
• W/O filters, use iterator to hold result set and
iterate through each result, test each result
value. Like DB drivers. Filter reduces result set
size from all rows to only rows which meet
condition

Column Value
• Filter filter = new
SingleColumnValueFilter(“CF”, “Key5”, Compar
eOp.EQUAL, “bob”).
• Filter f = new
SingleColumnValutFilter(“CF”,”COLUMN”, Com
pareOp.EQUAL, new
RegexStringComparator(“z*”));
• 565ms for 200k rows, 115 result set returned
(printed), small result sets are faster.

Column Value Searches
• 100k row table
– Returning .1% of results , (10):5s
– Returning 1% of results, (100): 11.29s
• 1M row table
– 1% results:212 s (10k)
– .1% results:204.057s (1k)

Compose row key w/values or index
tables
• Add second table where the row keys are
composed partially of the values
Secondary table Consistency, don’t need for a
reporting system? Consistent on inserts or bulk
import.

Build Environment
• Ready for CI, (Jenkins)
• Ubuntu specific process for changing
code, make all, make deb, make apt, then
install using apt-get install hadoop* hbase*.
• Need to start over for yum for centos.
• Demo
• Also ready for command line w/o GUI
Hbase org.apache.hadoop.hbase.PerfEval xx xx

Distributed mode
• Setup build environment
• Distributed mode setup. Zookeeper error
message:
• Disable ipv6? Debugging

Docs:
• Bigtop/updated version of CDH
• Installation:
• Build Docs: Ubuntu/deb; big change to rpms;
takes time to document and debug. Can do
both, takes time.
• Distributed Mode:
• NXServer/NXClient:
• Screen:

Hadoop/HBase POC framework

More Related Content

What's hot

Viewers also liked

Similar to Hadoop/HBase POC framework

More from Doug Chang

Recently uploaded

Hadoop/HBase POC framework