Hadoop/HBase POC v1 Review A framework for Hadoop/HBase POC
POC• Proof Of Concept, usually in competition with another product.• Given use case: – Performance: critical path (speed), most benchmark read performance,shard for write performance – Cost: H/W + administrative cost – Look at Hbase+Hadoop vs. MongoDB
HBase• Transactional store; 70k messages/sec 1.5kb/message. >1GB ethernet speeds• What is Hbase?, Sources
Cloudera HBase Training Materials• Exercises: http://async.pbworks.com/w/file/55596671/H Base_exercise_instructions.pdf• Training Slides: http://async.pbworks.com/w/file/54915308/Cl oudera_hbase_training.pdf• Training VM; 2GB put somewhere else.
HDFS vs. Hbase• Replication and distributed FS. Think NFS not just replicas. Metadata at central NameNode, single point of failure. Secondary NN as hot backup. Failure and recovery protocol testing not part of POC• Blocks, larger is better. Blocks are replicated. Not cells.• HDFS write once, was modified to append to file for HBase.• MapR HDFS compatible: – fast adoption w/Hbase; snapshots – Cross data center mirroring, consistent mirroring – Star replication vs. chain replication – FileServer vs. TaskTracker, Warden vs. NN. No Single point failure
AWS(3500-4k packets/sec) vs. DELL• AWS 3-4k p/s default configuration w/o optimization.• Dell (3557p/s) slower than AWS(3610 optimized m22xlarge, 4240p/s m2large)• Faster h/w instances in AWS makes a difference. Lesson(4210p/s): contolling the regions and compactions have impact on performance, fast IO. Spend time later on this.• User error w/Dell h/w somewhere. Can’t be that slow!• Could run a benchmark on m22xlarge over 24h period to see variability in perf. Not worth time investment
Dell Tuning• Ext3/4 5% diff in benchmarks. No diff in p/s performance.• Raid levels? JBOD not avail.• Maybe m2.2xlarge are high perf AWS drives are SSD? Seems funny w/pricing structure.• Noatime, 256k block sizes,• Goal: 4k P/S?
Bulk Load (worth time investment?)• Quick error check• Take existing table, export it, bulk load. Command line; very rough.• Should redo w/Java program. WAL off is approximation
Write Clients for NOSQL• HBase, Mongo, Cassandra have threads behind them, need a threaded or async client to get full performance.• need more time, higher priority than dist mode, needed in dist mode• lock timeout behavior; insert 1 row• Need a threaded or async client. Most get threaded design wrong?
Write Load Tool (multiple clients)• 300k rows single thread single client: 14430 ms, 2079p/s; about right….• 300k rows 3 threads:22804ms• M/R 30 mappers:24289• M/R better when need to do combining or processing of input data. M/R & Threads comparison about right. Threads should increase performance… ok writing my own…
Application Level Perf• Not transactional…• Simulate reporting store; writes concurrent w/web page read.• Compare w/SQL Server, MongoDB which have column indexes.• You may not need column indexes if designed correctly. ESN not key, will need consecutive keys to split into balanced regions.
Web GUI• Demo, webpage & writes into DB. Test MS SQL Server packets/sec using same.• Do a like %asdf% with no data to see if there is a timeout
Read Performance• Index search through webpage w/writer is fast, 50-100ms, <10- 20ms if in cache• Don’t do all table scans. Like in hbase shell count ‘table name’ – Count * from table• PIG/HIVE are faster on top of Hbase b/c they store metadata• All table scan: 10 rows:18ms 100 rows:11-166ms 1000 rows: 638 ms 10k rows: 4.3 s 100k rows: 38 s (not printed)• Use filters for search, exact match, regex, substring, more
SingleValueColumn Filter• Search for specific value, constant, regex, prefix. Did not try others• Same queries as before, search for specific values testing 100k-1M rows.• W/O filters, use iterator to hold result set and iterate through each result, test each result value. Like DB drivers. Filter reduces result set size from all rows to only rows which meet condition
Column Value• Filter filter = new SingleColumnValueFilter(“CF”, “Key5”, Compar eOp.EQUAL, “bob”).• Filter f = new SingleColumnValutFilter(“CF”,”COLUMN”, Com pareOp.EQUAL, new RegexStringComparator(“z*”));• 565ms for 200k rows, 115 result set returned (printed), small result sets are faster.
Column Value Searches• 100k row table – Returning .1% of results , (10):5s – Returning 1% of results, (100): 11.29s• 1M row table – 1% results:212 s (10k) – .1% results:204.057s (1k)
Compose row key w/values or index tables• Add second table where the row keys are composed partially of the valuesSecondary table Consistency, don’t need for areporting system? Consistent on inserts or bulkimport.
Build Environment• Ready for CI, (Jenkins)• Ubuntu specific process for changing code, make all, make deb, make apt, then install using apt-get install hadoop* hbase*.• Need to start over for yum for centos.• Demo• Also ready for command line w/o GUIHbase org.apache.hadoop.hbase.PerfEval xx xx
Docs:• Bigtop/updated version of CDH• Installation:• Build Docs: Ubuntu/deb; big change to rpms; takes time to document and debug. Can do both, takes time.• Distributed Mode:• NXServer/NXClient:• Screen: