Big Data: Big SQL and HBase

© 2016 IBM Corporation
Big SQL and HBase:
A Technical Introduction
Created by C. M. Saracco

© 2016 IBM Corporation2
Executive summary
 Big SQL
 Industry-standard SQL interface for data managed by IBM’s Hadoop platform
 Table creators chose desired storage mechanism
 Big SQL queries require no special syntax due to underlying storage
 Why Big SQL and HBase?
 Easy on-ramp to Hadoop for SQL professionals
 Support familiar SQL tools / applications (via JDBC and ODBC drivers)
 HBase very popular in Hadoop – low latency for certain queries, good for sparse data . . .
 What operations are supported with HBase?
 Create tables / views. DDL extensions for exploiting HBase (e.g., column mappings,
FORCE UNIQUE KEY, ADD SALT, . . . )
 Load data into tables (from local files, remote files, RDBMSs)
 Query data (wide range of relational operators, including join, union, wide range of sub-
queries, wide range of built-in functions . . . . )
 Update and delete data
 Create secondary indexes
 Collect statistics and inspect data access plan
 . . . .

Agenda
 HBase overview
 Big SQL with HBase
 Creating tables and views
 Populating tables with data
 Querying data
 Design considerations
 Reducing storage and I/O
 Generating unique row keys
 Specifying dense / composite columns
 Encoding data
 Advanced topics

What is HBase?
 A columnar NoSQL data store for Hadoop
 Data stored as a sparse, distributed, persistent multidimensional sorted map
 An industry leading implementation of Google’s BigTable Design
 An open source Apache Top Level Project
 HBase powers some of the leading sites on the Web

HBase basics
 Clustered Environment
 Master and Region Servers
 Automatic split ( sharding )
 Key-value store
 Key and value are byte arrays
 Efficient access using row key
 Rich set of Java APIs and extensible frameworks
 Supports a wide variety of filters
 Allows application logic to run in region server via coprocessors
 Different from relational databases
 No types: all data is stored as bytes
 No schema: Rows can have different set of columns

HBase data model: LOGICAL VIEW
 Table
 Contains column-families
 Column family
 Logical and physical grouping of
columns. Basic storage unit.
 Column
 Exists only when inserted
 Can have multiple versions
 Each row can have different set of
columns
 Each column identified by it’s key
 Row key
 Implicit primary key
 Used for storing ordered rows
 Efficient queries using row key
HBTABLE
Row key Value
11111 cf_data:
{‘cq_name’: ‘name1’,
‘cq_val’: 1111}
cf_info:
{‘cq_desc’: ‘desc11111’}
22222 cf_data:
{‘cq_name’: ‘name2’,
‘cq_val’: 2013 @ ts = 2013,
‘cq_val’: 2012 @ ts = 2012
}
HFileHFileHFile
11111 cf_data cq_name name1 @ ts1
11111 cf_data cq_val 1111 @ ts1
22222 cf_data cq_name name2 @ ts1
HFile
11111 cf_info cq_desc desc11111 @ ts1

Consider an example …
Client ID Last Name First
Name
Account
Number
Type of
Account
Timestamp
01234 Smith John abcd1234 Checking 20120118
01235 Johnson Michael wxyz1234 Checking 20120118
01235 Johnson Michael aabb1234 Checking 20111123
Given this sample RDBMS table

HBase logical view (“records”) looks like . . .
Row key Value (CF, Column, Version, Cell)
01234 info: {‘lastName’: ‘Smith’,
‘firstName’: ‘John’}
acct: {‘checking’: ‘abcd1234’}
01235 info: {‘lastName’: ‘Johnson’,
‘firstName’: ‘Michael’}
acct: {‘checking’: ‘wxyz1234’@ts=2012,
‘checking’: ‘aabb1234’@ts=2011}

While the physical view (“cell”) looks like . . .
Row Key Column Key Timestamp Cell Value
01234 info:fname 1330843130 John
01234 info:lname 1330843130 Smith
01235 info:fname 1330843345 Michael
01235 info:lname 1330843345 Johnson
info Column Family
acct Column Family
Row Key Column Key Timestamp Cell Value
01234 acct:checking 1330843130 abcd1234
01235 acct:checking 1330843345 wxyz1234
01235 acct:checking 1330843239 aabb1234
Key/Value Row Column Family Column Qualifier Timestamp Value
Key

HBase cluster architecture
HDFS
Region Server
…
Master
…
Client
ZooKeeper Peer
ZooKeeper Quorum
ZooKeeper Peer
…
Hbase master
assigns
regions and
load balancing
Client finds
region server
addresses in
ZooKeeper
Client reads and
writes row by
accessing the
region server
ZooKeeper is
used for
coordination /
monitoring
Region
Region Server
Region … Region Region …
HFile
HFile
HFile
HFile
HFile
HFile HFile HFile
HFile HFile
Coprocessor Coprocessor …Coprocessor Coprocessor
Region Server
Coprocessor CoprocessorCoprocessor …CoprocessorCoprocessor …CoprocessorCoprocessor
… …
WAL WAL

How are tables stored in HBase?
Rows
Table–LogicalView
A
.
.
K
.
.
T
.
.
Z
Keys:[A-D]
Keys:[E-H]
Keys:[S-V]
Keys:[N-R]
Keys:[I-M]
Keys:[W-Z]
Region Server 1 Region Server 2 Region Server 3
Region
Region
Region
Region
Region
Region
Auto-Sharded Regions
The data is “sharded”
Each shard contains all the data in a key-range

Agenda
 HBase overview
 Querying data
 Encoding data
 Advanced topics

Creating a Big SQL table in HBase
 Standard CREATE TABLE DDL with extensions
CREATE HBASE TABLE BIGSQLLAB.REVIEWS (
REVIEWID varchar(10) primary key not null,
PRODUCT varchar(30),
RATING int,
REVIEWERNAME varchar(30),
REVIEWERLOC varchar(30),
COMMENT varchar(100),
TIP varchar(100)
)
COLUMN MAPPING
(
key mapped by (REVIEWID),
summary:product mapped by (PRODUCT),
summary:rating mapped by (RATING),
reviewer:name mapped by (REVIEWERNAME),
reviewer:location mapped by (REVIEWERLOC),
details:comment mapped by (COMMENT),
details:tip mapped by (TIP)
);

Results from previous CREATE TABLE . . .
 Data stored in subdirectory of HBase
 /hbase/data/default/table
 Additional subdirectories for meta data, column families
 Table’s data are files within column family subdirectories
 Views over Big SQL HBase tables supported – no special syntax
create view bigsqllab.testview as
select reviewid, product, reviewername, reviewerloc
from bigsqllab.reviews
where rating >= 3;

CREATE HBASE TABLE . . . AS SELECT . . . .
 Create a Big SQL HBase table based on contents of other table(s)
-- source tables aren’t in HBase; they’re just external Big SQL tables over DFS files
CREATE hbase TABLE IF NOT EXISTS bigsqllab.sls_product_flat
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_line_en VARCHAR(90)
, product_line_de VARCHAR(90)
)
column mapping
(
key mapped by (product_key),
data:c2 mapped by (product_line_code),
data:c3 mapped by (product_type_key),
data:c4 mapped by (product_type_code),
data:c5 mapped by (product_line_en),
data:c6 mapped by (product_line_de)
)
as select product_key, d.product_line_code, product_type_key,
product_type_code, product_line_en, product_line_de
from extern.sls_product_dim d, extern.sls_product_line_lookup l
where d.product_line_code = l.product_line_code;

Populating tables via LOAD
 Typically best runtime performance
 Load data from remote file system
LOAD HADOOP using file url
'sftp://myID:myPassword@myhost:22/sampleDir/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_
DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE
bigsqllab.sls_product_dim;
 Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL,
Informix) via JDBC connection
LOAD HADOOP USING JDBC CONNECTION URL 'jdbc:teradata://myhost/database=GOSALES'
WITH PARAMETERS (user ='myuser',password='mypassword')
FROM TABLE 'GOSALES'.'COUNTRY' COLUMNS (SALESCOUNTRYCODE, COUNTRY, CURRENCYNAME)
WHERE 'SALESCOUNTRYCODE > 9' SPLIT COLUMN 'SALESCOUNTRYCODE‘
INTO TABLE country_hbase
WITH TARGET TABLE PROPERTIES
('hbase.timestamp' = 1366221230799, 'hbase.load.method'='bulkload')
WITH LOAD PROPERTIES ('num.map.tasks' = 10, 'num.reduce.tasks'=5);

SQL capability highlights
 Same SELECT support as Big SQL tables
with Hive, DFS!
 Query operations
 Projections, restrictions
 UNION, INTERSECT, EXCEPT
 Wide range of built-in functions (e.g. OLAP)
 Full support for subqueries
 In SELECT, FROM, WHERE and
HAVING clauses
 Correlated and uncorrelated
 Equality, non-equality subqueries
 EXISTS, NOT EXISTS, IN, ANY,
SOME, etc.
 All standard join operations
 UPDATE and DELETE operations
 Stored procedures, UDFs
 Queries operate on current (latest) HBase
row values
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name;

Agenda
 HBase overview
 Querying data
 Encoding data
 Advanced topics

Storage and I/O considerations
 HBase is verbose – full key info stored with each cell value
 “Wide” tables with many column families / columns
 Tables with long column family and column names
 How to achieve user-friendly yet efficient design?
 Declare intuitive Big SQL column names
 Declare few HBase column families. Consider query patterns, exclusivity of data
 Map Big SQL columns to HBase col family:col with short names
 Consider use of dense (composite) HBase columns (discussed later)
CREATE hbase TABLE IF NOT EXISTS bigsqllab.sls_product1
( product_key INT NOT NULL
, product_line_code INT NOT NULL, product_type_key INT NOT NULL
. . . )
column mapping (
key mapped by (product_key),
data:c2 mapped by (product_line_code),
data:c3 mapped by (product_type_key),
. . . );

Forcing unique row keys
 HBase requires unique row keys for each row
 PUT (insert) with duplicate row key value creates new version of data – no error
as with relational PK!
 Modeling challenge for some RDBMS designs
 Big SQL's FORCE KEY UNIQUE allows for duplicate row keys
 Big SQL transparently appends a "uniqifier" to the row key (36 bytes)
CREATE HBASE TABLE DUP_ROW
(
C1 INT NOT NULL,
C2 INT NOT NULL
)
COLUMN MAPPING
(
KEY MAPPED BY (C1) FORCE KEY UNIQUE
CF:C1 MAPPED BY (C2)
);
DUP_ROW
0x0000010AF-F2E4-A…
Key Value
C1 0x00000001
Row Key Col Fam: CF
1> INSERT INTO DUP_ROW VALUES (1, 1), (1, 2);
0x0000010DE-E134-0…
Key Value
C1 0x00000002

Specifying dense columns
 Map N SQL columns to 1 HBase column family:column
 Disk space savings, potential query runtime improvements
CREATE HBASE TABLE BIGSQLLAB.SLS_SALES_FACT_DENSE (
ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int,
RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int,
PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int,
SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int,
UNIT_COST1 decimal(19,2), UNIT_PRICE decimal(19,2),
UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double,
SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) )
COLUMN MAPPING (
key mapped by
(ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY,
PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY),
cf_data:cq_OTHER_KEYS mapped by
(SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY),
cf_data:cq_QUANTITY mapped by (QUANTITY),
cf_data:cq_MONEY mapped by
(UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) );

Encoding data
 3 options with Big SQL
 Binary: value stored in platform neutral binary encoding
 String: value converted to string and stored as UTF-8 bytes
 Custom: user provided Hive SerDe class to encode/decode column(s)
 Specified at table creation
 BINARY
 Based on Hive’s BinarySortable SerDe
 Default (and recommended) encoding
 STRING
 Format dictated by Java’s toString() functions
 Easy to read but expensive to convert
 Generally discouraged, particularly for numeric data
• Query predicates of numeric data encoded as Strings won’t be pushed down

Agenda
 HBase overview
 Querying data
 Encoding data
 Advanced topics

Secondary indexes
create hbase table dt(id int,c1 string,c2 string,c3 string,c4 string,c5 string)
column mapping (key mapped by (id), f:a mapped by (c1,c2,c3), f:b mapped by (c4,c5));
create index ixc3 on dt (c3);
 Defined on SQL column(s) not mapped to HBase row key
 Optimizer can leverage secondary indexes automatically
 Secondary indexes maintained automatically
bt1 , c11_c21_c31, c41_c51
bt2 , c12_c22_c32, c42_c52
bt3 , c13_c23_c33, c43_c53
…
key c1 c2 c3 c4 c5
c31_bt1
c32_bt2
c33_bt3
…
key
Data table (dt) Index table (dt_ixc3)
Use
Index ?
Query
c3=c32
create index ixc3 (c3)
YesNo
Full table scan
Index table range scan
start row = c32
stop row = c32++
Data table get
row = bt2

Salting
 Adds prefix to row key
 Helps minimize hot spots in HBase regions
 Common occurrence with sequential row key values (writes often hit same
Region Server, inhibiting parallelism)
 Logical example (prefix routes salted rows to different regions):
 CREATE HBASE TABLE . . . ADD SALT
CREATE HBASE TABLE mysalt1 ( rowkey INT, c0 INT)
COLUMN MAPPING ( KEY MAPPED BY (rowkey), f:q MAPPED BY (c0) ) ADD SALT;

Get started with Big SQL: External resources
 Hadoop Dev: links to videos, white paper, lab, . . . .
https://developer.ibm.com/hadoop/

Big Data: Big SQL and HBase

More Related Content

What's hot

Viewers also liked

Similar to Big Data: Big SQL and HBase

More from Cynthia Saracco

Recently uploaded

Big Data: Big SQL and HBase