Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

4,670 views

Published on

This is the extended deck I used for my presentation at the Information On Demand 2013 conference for Session Number 1687 - Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL.

This presentation covers accessing HBase using Big SQL. It starts by going over general HBase concepts, than delves into how Big SQL adds an SQL layer on top of HBase (via HBase storage handler), secondary index support, queries, etc.

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,670
On SlideShare
0
From Embeds
0
Number of Embeds
44
Actions
Shares
0
Downloads
131
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL

  1. 1. Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL Session Number 1687 Piotr Pruski @ppruski
  2. 2. Please note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. 2
  3. 3. Agenda  Introduction to HBase  Big SQL HBase Storage Handler – Column mapping – Data encoding – Data load  Secondary Indexes  Querying  Recommendations and limitations  Logs and Troubleshooting  Highlights and HBase use cases 3
  4. 4. HBase Basics  Client/server database – Master and a set of region servers  Key-value store – Key and value are byte arrays – Efficient access using row key  Different from relational databases – No types: all data is stored as bytes – No schema: Rows can have different set of columns 4
  5. 5. HBase Data Model  Table – Contains column-families  Column family HBTABLE Row key Value 11111 cf_data: {‘cq_name’: ‘name1’, ‘cq_val’: 1111} cf_info: {‘cq_desc’: ‘desc11111’} 22222 cf_data: {‘cq_name’: ‘name2’, – Logical and physical grouping of columns  Column – Exists only when inserted – Can have multiple versions – Each row can have different set of columns – Each column identified by it’s key ‘cq_val’: 2013 @ ts = 2013, ‘cq_val’: 2012 @ ts = 2012 }  Row key – Implicit primary key – Used for storing ordered rows – Efficient queries using row key 5 HFile HFile 11111 cf_data cq_name name1 @ ts1 11111 cf_data cq_val 1111 @ ts1 22222 cf_data cq_name name2 @ ts1 22222 cf_data cq_val 2013 @ ts1 22222 cf_data cq_val 2012 @ ts 2 HFile 11111 cf_info cq_desc desc11111 @ ts1
  6. 6. More on the HBase Data Model  There is no Schema for an HBase Table in the RDBMS sense – Except that one has to declare the Column Families • Since it determines the physical on-disk organization – Thus every row can have a different set of Columns  HBase is described as a key-value store Key Key/Value Row Column Family Column Qualifier Timestamp Value  Each key-value pair is versioned – Can be a timestamp or an integer – Update a column is just to add a new version  All data are byte arrays, including table name, Column Family names, and Column names (also called Column Qualifiers) 6
  7. 7. HBase Cluster Architecture Client finds region server addresses in ZooKeeper ZooKeeper Quorum ZooKeeper Peer Client reads and writes row by accessing the region server HFile HFile 7 … Coprocessor Coprocessor … HFile HFile Hbase master assigns regions and load balancing Region Server Coprocessor Coprocessor Region … … Region Server Region Master ZooKeeper Peer Client HFile HFile ZooKeeper is used for coordination / monitoring Region HFile HDFS / GPFS Region HFile HFile … … …
  8. 8. BigInsights - Big SQL  Big SQL brings robust SQL support to the Hadoop ecosystem  Driving design goals – Existing queries should run with no or few modifications – Existing JDBC and ODBC compliant tools should continue to function • Data warehouse augmentation is a very common use case for Hadoop  While highly scalable, MapReduce is notoriously difficult to use  SQL support opens the data to a much wider audience  Making data in BigInsights accessible to SQL capable tools – – – – 88 Cognos BI Microstrategy Tableau …
  9. 9. Big Data for a Query-able Archive Cognos BI Server Cognos Insight Report & Act Explore & Analyze InfoSphere Optim SQL SQL BigInsights InfoSphere Warehouse/ Netezza ** (Hadoop) Bi-Directional Query Support • Cognos BI can issue SQL Queries against data managed by Apache Hive in BigInsights • The IBM BigData platform supports bi-directional queries between BigInsights and the EDW •Key Benefits: • Existing SQL based applications can leverage the BigData platform • EDW optimized from size and performance perspective 9 • Provides cost effective and flexible big data storage and analysis
  10. 10. Big SQL HBase Storage Handler  Mapping of SQL to HBase data: Column Mapping  Handles serialization/deserialization of data (SerDe)  Efficiently handles SQL queries by pushing down predicates Delimited files Warehouse SQL Query Query Results Big SQL Input Data JDBC application HBase Storage Handler SerDe Query Analyzer (Runtime) - HBase scan limits - Filters - Index usage DFS Query Optimizer (Compile time) - Process hints 10 HBase
  11. 11. Column Mapping  Mapping HBase row key/columns to SQL columns – Supports one to one and one to many mappings  One to one mapping – Single HBase entity mapped to a single SQL column Column Family: cf_data key 11111 id 11 cq_name name1 name cq_val 1111 value Column Family: cf_info cq_desc desc11111 desc HBase SQL
  12. 12. Create Table: One to One Mapping CREATE HBASE TABLE HBTABLE ( id INT, name VARCHAR(10), value INT, desc VARCHAR(20) ) COLUMN MAPPING ( key mapped cf_data:cq_name mapped cf_data:cq_val mapped cf_info:cq_desc mapped ); HBase 12 Required by by by by (id), (name), (value), (desc) SQL HBase column identified by family:qualifier
  13. 13. One to Many Column Mapping  Single HBase entity mapped to multiple SQL columns  Composite key – HBase row key mapped to multiple SQL columns  Dense column – One HBase column mapped to multiple SQL columns key 11111_ac11 userid 13 acc_no Column Family: cf_data cq_acct cq_names fname1_lname1 first_name last_name HBase 11111#11#0.25 balance min_bal interest SQL
  14. 14. Create Table: One to Many Mapping CREATE HBASE TABLE DENSE_TABLE ( userid INT, acc_no VARCHAR(10), Composite Key first_name VARCHAR(10), last_name VARCHAR(10), balance double, min_bal double, interest double ) List of SQL columns COLUMN MAPPING ( key mapped by (userid, acc_no), cf_data:cq_names mapped by (first_name, last_name), cf_data:cq_acct mapped by (balance, min_bal, interest) ); Dense Columns 14
  15. 15. Why use One to Many mapping ?  HBase is very verbose – Stores a lot of information for each value – Primarily intended for sparse data <row> <columnfamily> <columnqualifier> <timestamp> <value>  Save storage space – Sample table with 9 columns. 1.5 million rows – One to one mapping: 522 MB – One to many mapping: 276 MB  Improve query response time – Query results also return the entire key for each value – select * query on sample table • One to one mapping: 1m 31 s • One to many mapping: 1m 2s 15
  16. 16. Data encoding  HBase stores all data as an array of bytes – Application decides how to encode/decode the bytes  Big SQL uses Hive SerDe interface for serialization/deserialization  Supports two types of data encodings: String, Binary  Encoding can be specified at HBase row key/column level key 11111_ac11 userid acc_no String 16 Column Family: cf_data cq_acct cq_names fname1_lname1 first_name last_name String 0x000001 … balance min_bal interest Binary HBase SQL
  17. 17. String encoding  Default encoding  Value is converted to string and stored as UTF-8 bytes  Separator to identify parts in one to many mapping – Default separator: u0000 CREATE HBASE TABLE DENSE_TABLE_STR ( userid INT, acc_no VARCHAR(10), Can specify different separator first_name VARCHAR(10), last_name VARCHAR(10), for each column and row key. balance double, Default separator is null byte min_bal double, (u0000) for string encoding. interest double ) COLUMN MAPPING ( key mapped by (userid, acc_no) separator '_', cf_data:cq_names mapped by (first_name, last_name) separator '_', cf_data:cq_acct mapped by (balance, min_bal, interest) separator '#' ); 17
  18. 18. String Encoding: Pros and Cons  Readable format and easier to port across applications  Useful to map existing data key 11111_ac11 userid acc_no Column Family: cf_data cq_acct cq_names fname1_lname1 10000#10#0.25 first_name last_name balance  Numeric data not collated correctly – HBase stores data as bytes – Lexicographic ordering  Slow – Parsing strings is expensive 18 min_bal interest 1 10 2 9 Existing HBase table External Big SQL table 2 > 10 9 > 10
  19. 19. External Tables  Useful to map tables that already exist in HBase – Data in external tables is not pre-validated  Can create multiple views of same table Use subset of data from dense_table create external hbase table externalhbase_table (user INT, acc string, balance double, min_bal double, interest double) column mapping(key mapped by (user,acc), cf_data:cq_acct mapped by(balance, min_bal, interest) separator '#') hbase table name 'dense_table';  HBase tables created using Hive HBase storage handler cannot be read by Big SQL – Need to create external tables for this  Things to note: – Dropping external table only drops the metadata – Cannot create secondary index on external tables 19
  20. 20. Binary Encoding  Data encoded using sortable binary representation  Separators handled internally – Escaped to avoid issue of separator existing within data CREATE HBASE TABLE MIXED_ENCODING ( C1 INT, C2 INT, C3 INT, C4 VARCHAR(10), C5 DECIMAL(5,2), C6 SMALLINT ) COLUMN MAPPING ( KEY MAPPED BY (C1, C2, C3) ENCODING BINARY, CF1:COL1 MAPPED BY (C4, C5) SEPARATOR '|', CF2:COL1 MAPPED BY (C6) ENCODING BINARY ); If encoding not specified, string is used as default col1 col1 0x000000000000000100000000000000020000000000000003 20 cf2 key cf1 foo|97.31 0x0000DEAF
  21. 21. Binary Encoding: Pros and Cons  Faster  Numeric types collated correctly including negative numbers CREATE HBASE TABLE WEATHER (temp INT, date TIMESTAMP, humidity DOUBLE) COLUMN MAPPING (key mapped by (temp, date), cf:cq mapped by (humidity)) default encoding binary; 100,2012-06-10 17:00:00:000,40.25 -17,2012-12-12 17:00:00:000,30.25 95,2012-06-05 17:00:00:000,50.25 -17 95 100 x01x7FxFFxFFxEFx012012-12-12 17:00:00:000x00 x01x80x00x00_x012012-06-05 17:00:00:000x00 x01x80x00x00dx012012-06-10 17:00:00:000x00  Limited portability 21 cf cq x01xC0>@x00x00x00x00x00 x01xC0I x00x00x00x00x00 x01xC0D x00x00x00x00x00
  22. 22. Load Data  Load HBase File can be on DFS or local to Big SQL server – Loads data from delimited files – Column list can be specified load hbase data inpath 'file:///input.dat' delimited fields terminated by '|' into table hbtable (name, value, desc, id); Column list optional. If not specified, uses column ordering in table definition  Load FROM – Loads data from a (JDBC) source outside of a BigInsights cluster  Insert command available insert into hbtable (name, value, desc, id) values(‘name5’, 5555, ‘desc55555’, 55555); 22
  23. 23. Load Data: Upsert  HBase ensures uniqueness of row key key 11111 ,, name1, 1111, desc11111 11111 11111 , name9, 9999, desc99999 22222 , name2, 2222, desc22222 … Load 11111 ,, name1, 1111, desc11111 @ts0 11111 name1, 1111, desc11111 @ts0 11111 , name9, 9999, desc99999 @ts1 22222 , name2, 2222, desc22222 @ts1 …  Upsert can be confusing. No errors but fewer rows ! Delimited file : 10 rows Load : 10 rows affected select count(*) from hbtable : 7 rows  Combine multiple columns to make row key unique key mapped by (id, name) key 11111 ,, name1, 1111, desc11111 11111 11111 , name9, 9999, desc99999 22222 , name2, 2222, desc22222 … 23 Load 11111/x00name1, 1111, desc11111 @ts0 11111 , name1, 1111, desc11111 @ts0 11111/x00name9, 9999, desc99999 @ts1 11111 , name9, 9999, desc99999 @ts1 22222/x00name2,2222, desc22222 @ts1 22222 , name2, 2222, desc22222 @ts1 …
  24. 24. Force Key Unique  Use force key unique option when creating a table CREATE HBASE TABLE HBTABLE_FORCE_KEY_UNIQUE ( id INT, name VARCHAR(10), value INT, desc VARCHAR(20) ) COLUMN MAPPING ( key mapped by (id) force key unique, cf_data:cq_name mapped by (name), cf_data:cq_val mapped by (value), cf_info:cq_desc mapped by (desc) );  Load adds UUID to the row key  Prevents data loss  Inefficient  Stores more data  Slower queries 24 11111 , name1, 1111, desc11111 11111 , name9, 9999, desc99999 22222 , name2, 2222, desc22222 … 11111x00b71c95d8-ffdd-4d49-9015-2fdd6f7dcdf4, name1, 1111, desc11111 11111x00ea780078-9893-4bf7-95d8-cb9ca4b2427f, name9, 9999, desc99999 22222x00a90885b0-418b-49ac-a6f6-aa73273b57ca, name2, 2222, desc22222 …
  25. 25. Load Data: Error Handling  Option to continue and log error rows – LOG ERROR ROWS IN FILE 'filename'  Common Errors – Separator exists within data for string encoding – Invalid numeric types  Always count number of rows after loading – Load always reports total number of rows that it handled key mapped by (id, name) separator ‘-’ id defined as integer HBase Table (2 rows) key 11111, name1, 1111, desc11111 11111 , name1, 1111, desc11111 11111, name9, 9999, desc99999 11111 , name9, 9999, desc99999 22222, name-2, 2222, desc22222 22222 , name2, 2222, desc22222 3333a, name3, 3333, desc33333 … … 25 11111-name1, 1111, desc11111 11111-name9, 9999, desc99999 Load: 4 rows affected Error file (2 rows) 22222 , name-2, 2222, desc22222 3333a , name3, 3333, desc33333
  26. 26. Options to Speed up Load  Disable WAL – Data loss can happen if region server crashes LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS DISABLE WAL;  Increase write buffer – set hbase.client.write.buffer=8388608; 26
  27. 27. Secondary Index Support  Self-maintaining secondary indexes – Stored in an HBase table – Populated using a Map Reduce index builder – Kept up to date using a synchronous coprocessor Big SQL create index Client Client input data query results MRIndexBuilder HBase Storage Handler Data Table Data Regions Index Coprocessor SerDe Query Analyzer (Runtime) - Use index ? Index Regions query Query Optimizer (Compile time) Index Table - Process hints Index building 27 Index maintenance Batched Get Requests
  28. 28. Index Creation and Usage create hbase table dt(id int,c1 string,c2 string,c3 string,c4 string,c5 string) column mapping (key mapped by (id), f:a mapped by (c1,c2,c3), f:b mapped by (c4,c5)); create index ixc3 on table dt (c3) as 'hbase'; Data table (dt) key c1 bt1 , bt2 , bt3 , create index ixc3 (c3) c2 c3 c4 c5 c11_c21_c31, c41_c51 c12_c22_c32, c42_c52 c13_c23_c33, c43_c53 … No Full table scan  Automatic index usage Index table (dt_ixc3) key Data table get row = bt2 Use Index ? Query c3=c32 c31_bt1 c32_bt2 c33_bt3 … Yes Index table range scan start row = c32 stop row = c32++ – Range scan on index table to get matching row key(s) in base table – Batched get requests to base table with the matched row key(s) 28
  29. 29. Index Pros and Cons  Fast key based lookups for queries that return limited data  Not beneficial if there are too many matches  No statistics to make the decision in compiler  useindex hint to make explicit choices  Index adds latency to data load – When loading a big data set, drop index and recreate LOAD from option bypasses index maintenance  Uses HBase bulk load which writes to HFiles directly 29
  30. 30. Column Family Options  Compression – compression(gz)  Bloom filters – NONE, ROW, ROWCOL  In memory columns – in memory, no in memory create hbase table colopt_table (key string, c1 string) column mapping(key mapped by (key), cf1:c1 mapped by(c1)) column family options(cf1 compression(gz) bloom filter(row) in memory); 30
  31. 31. Query Handling  Projection pushdown  Predicate pushdown – – – – Point scan Range scan Automatic index usage Filters  Query Hints 31
  32. 32. Sample Data  TPCH orders table with 1.5 million rows drop table if exists orders; CREATE HBASE TABLE ORDERS ( O_ORDERKEY BIGINT, O_CUSTKEY INTEGER, O_ORDERSTATUS VARCHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY VARCHAR(15), O_CLERK VARCHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) column mapping ( key mapped by (O_ORDERKEY,O_CUSTKEY), cf:d mapped by (O_ORDERSTATUS,O_TOTALPRICE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY ,O_COMMENT), cf:od mapped by (O_ORDERDATE) ) default encoding binary; LOAD HBASE DATA INPATH 'tpch/ORDERS.TXT' DELIMITED FIELDS TERMINATED BY '|' INTO TABLE ORDERS; 32
  33. 33. Projection Pushdown  Get only columns required by the query  Limit data retrieved to the client select * from orders go -m discard 1500000 rows in results(first row: 0.21s; total: 1m1.77s) Log HBase scan details:{ … , families={cf=[d, od]}, …} select o_totalprice from orders go -m discard 1500000 rows in results(first row: 0.19s; total: 21.27s) Log HBase scan details:{ … , families={cf=[d]}, …} select o_orderdate from orders go -m discard 1500000 rows in results(first row: 0.36s; total: 36.24s) Log The response time is higher for this query even when it retrieves lesser data than query for o_totalprice. This is because timestamp type is more expensive HBase scan details:{ … , families={cf=[od]}, …}  Projection happens at HBase column level – For composite key and dense columns, the entire value is retrieved to the client – Efficient to pack columns that are queried together 33
  34. 34. Predicate Pushdown: Point Scan  With full row key  Big SQL can combine predicates on row key parts set force local on; select o_orderkey,o_totalprice from orders where o_custkey=1 and o_orderkey=454791; +--------------+ | o_totalprice | +--------------+ | 208660.75000 | +--------------+ 1 row in results(first row: 0.14s; total: 0.14s) Log Found a row scan by combining all composite key parts. key o_custkey o_orderkey Query o_custkey=1 and o_orderkey=454791 34 start row=1#454791 stop row=1#454791 1#454791 1# 579908 1# 3868359 1# 4273923 1# 4808192 1# 5133509 … columns …
  35. 35. Predicate Pushdown: Partial row Scan select o_orderkey,o_totalprice from orders where o_custkey=1; +------------+--------------+ Predicate(s) on leading | o_orderkey | o_totalprice | part(s) of row key +------------+--------------+ | 454791 | 74602.81250 | | 579908 | 54048.26172 | | 3868359 | 123076.84375 | | 4273923 | 95911.00781 | | 4808192 | 65478.05078 | | 5133509 | 174645.93750 | +------------+--------------+ 6 rows in results(first row: 0.13s; total: 0.13s) Log Found a row scan that uses the first 1 part(s) of composite key. key o_custkey o_orderkey Query o_custkey=1 35 start row=1 stop row=1++ 1#454791 1# 579908 1# 3868359 1# 4273923 1# 4808192 1# 5133509 2#430243 … columns …
  36. 36. Predicate Pushdown: Range Scan  With range predicates select o_orderkey,o_totalprice from orders where o_custkey < 3; Log Found a row scan that uses the first 1 part(s) of composite key. Log HBase scan details:{ .. , stopRow=x01x80x00x00x03, startRow=, … } key o_custkey o_orderkey Query o_custkey<3 36 start row= stop row=3# 1#454791 … 1# 5133509 2#430243 … 4#164711 columns …
  37. 37. Predicate Pushdown: Full table Scan  This is an example of a case where predicates are not pushed down.  If there are predicates on non-leading parts of row key set force local on; select o_orderkey,o_totalprice from orders where o_orderkey=454791; +------------+--------------+ | o_orderkey | o_totalprice | +------------+--------------+ | 454791 | 74602.81250 | +------------+--------------+ 1 row in results(first row: 32.13s; total: 32.13s) Log 37 HBase scan details:{ .. , stopRow=, startRow=, … }
  38. 38. Automatic Index Usage select * from orders where o_clerk='Clerk#000000999' go -m discard 1472 rows in results(first row: 1.63s; total: 30.32s) create index ix_clerk on table orders (o_clerk) as 'hbase'; 0 rows affected (total: 3m57.82s) select * from orders where o_clerk='Clerk#000000999' go -m discard 1472 rows in results(first row: 3.60s; total: 3.65s) Log Index query successful  Index used automatically  For composite index, rules similar to composite row key apply – Parts will be combined where possible – With partial value for composite index, range scan done on index table  Multiple indexes on a table – Index to be used is randomly chosen – Specify useIndex hint to make use of specific index 38
  39. 39. Pushing down Filters into HBase  Filters do not avoid full table scan – Some filters can skip certain sections e.g, PrefixFilter  Limits rows returned to the client  Limits data returned to client – Key only filters Column filter as there is a predicate on leading part of dense column Row scan select o_orderkey from orders where o_custkey>100000 and o_orderstatus='P' go -m discard 12819 rows in results(first row: 1.12s; total: 6.80s) Log 39 Found a row scan that uses the first 1 part(s) of composite key. HBase filter list created using AND. HBase scan details:{… , filter=FilterList AND (1/1): [SingleColumnValueFilter (cf, d, EQUAL, x01Px00)], stopRow=, startRow=x01x80x01x86xA1, …}
  40. 40. Key Only Tables  Big SQL allows creation of tables without specifying any HBase column create hbase table KEY_ONLY_TABLE (k1 string, k2 string, k3 string) column mapping (key mapped by (k1, k2, k3)); select * from KEY_ONLY_TABLE; Log 40 Only row key or parts of row key requested. Applying filters. … HBase scan details:{… families={}, filter=FilterList AND (2/2): [FirstKeyOnlyFilter, KeyOnlyFilter], …}
  41. 41. Predicate Precedence  When a query contains multiple predicates, the following precedence applies: – Row Scan – Index – Filters • Row filters • Column filters  Filters will be applied along with row scans The OR condition prevents usage of row scan. Row filter (PrefixFilter) is used along with a column filter  Filters cannot be combined with index lookups  Multiple predicates: Use of row and column filter select o_orderkey, o_custkey, o_orderdate from orders where o_orderdate=cast('1996-12-09' as timestamp) or o_custkey=2; Log 41 HBase filter list created using OR. HBase scan details:{… , filter=FilterList OR (2/2): [SingleColumnValueFilter (cf, od, EQUAL, x011996-12-09 00:00:00.000x00), PrefixFilter x01x80x00x00x02], cacheBlocks=false, stopRow=, startRow=, … }
  42. 42. Accessmode Hint  Will run the query locally in Big SQL server – Useful to avoid map reduce overhead  Very important for HBase point queries – This is not detected currently by compiler – Specify accessmode=‘local’ hint when getting a limited set of data from HBase  Specify at query level select o_orderkey from orders /*+ accessmode='local' +*/ where o_custkey=1 and o_orderkey=454791;  Specify at session level – set force local on – set force commands override query level hints 42
  43. 43. HBase Hints  rowcachesize (default=2000) – Used as scan cache setting – Also used to determine number of get requests to batch in index lookups  colbatchsize (default=100)  useindex (‘false’ to avoid index usage) select o_orderkey from orders /*+ rowcachesize=10000 +*/ where o_custkey>5000 go -m discard 1450136 rows in results(first row: 22.67s; total: 27.46s) Log HBase scan details:{... , caching=10000, ...}  rowcachesize can also be set using the set command: – set hbase.client.scanner.caching=10000; 43
  44. 44. Recommendations  Row key design is the most important factor – Try to combine predicates that are most commonly used into row key columns – Do not make the row key too long  Use short names for HBase column families and column qualifiers – f:q instead of mycolumnfamily:mycolumnqualifier  Check if key only tables can be used  Pack columns that are queried together into dense columns – Use the column that is used as query predicate as prefix – Create indexes for columns that do not have repeating values and are queried often  Separate columns that are rarely or never queried into a different column family  Set hbase.client.scanner.caching to an optimum value  Ensure even data distribution 44
  45. 45. Limitations  No diagnostic info about HBase pushdown – How HBase storage handler pushes down a query is decided only at runtime – Predicate handling details are logged at INFO level – Many examples of log messages covered in previous slides  No auto detection of local vs MR mode – Currently depends on user specified hints  Statistics not available – Big SQL does not have a framework to collect statistics – Query optimizations can be improved with availability of useful statistics  Map type not supported – Big SQL does not support map data type – Hive HBase handler supports map data type and many to one mapping • Mapping an entire HBase column family to a map data type 45
  46. 46. Logs and Troubleshooting  Big SQL logs – Look for rewritten query – More information in Big SQL logs if query is run in local mode  Map Reduce logs – Predicate handling information in map task log when run in MR mode  HBase web GUI – http://<<hostname>>:60010/master-status 46
  47. 47. Big SQL HBase Handler Highlights  Support for composite key/dense columns  Pushdown for efficient execution of queries  Support for secondary indexes  Binary encoding (collated correctly)  Key only tables  Support for hints to make query optimization decisions 47
  48. 48. Scenarios that can leverage HBase features  Point queries – Queries that return a single row of result – Row can be determined using row key or secondary index • All queries using secondary index are not point queries  Queries with projections – If a query requires only a few columns – Projection happens at HBase column level  Data maintenance using upserts – Loading different value for columns using same row key 48
  49. 49. Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2013. All rights reserved. •U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, ibm.com, and InfoSphere BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others. 49
  50. 50. Thank You Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL Acknowledgements  Full credit to Deepa Remesh Piotr Pruski @ppruski

×