Analyzing HBase Data with
Apache Hive
Swarnim Kulkarni, Cerner Corporation
Nick Dimiduk, Hortonworks
Brock Noland, StreamSets
May 7th, 2015
Who are we?
● Nick Dimiduk
o Apache HBase Committer and PMC member
o Co-author of HBase in Action
● Brock Noland
o Apache Hive Committer and PMC member
● Swarnim Kulkarni
o Lead Architect at Cerner Corporation
o Contributor to Apache Hive
Agenda
● Apache Hive Basics
● Hive + HBase - Architecture
● Hive + HBase - Features and Improvements
● Future Work
● Q & A
Apache Hive
● De Facto standard for ad-hoc analysis of data in
Hadoop
● SQL-like language called HiveQL for querying of data
● Scalable
o SQL queries translate to M/R jobs
● Extensible
o Plugin custom mappers/reducers
o Custom UDFs/UDAFs
o Custom FileFormats/SerDes
Apache Hive
Hive/HBase Integration
● Brings best of both world together
● Familiar analytical tooling of Hive to cover
online data stored in HBase
● No need for analysts to write M/R jobs to
analyze the data in HBase
● Uses StorageHandler to access data stored
and managed by HBase
Hive/HBase Integration
Improvements and New features
Query HBase Snapshots (HIVE-6584)
● Queries over HBase snapshots on HDFS
instead of online Region Servers
● Specify hive.hbase.snapshot.name instead
of hbase.table.name to query the snapshot
● Under the hood:
o Map tasks embed mini-RS, open snapshot regions
o Snapshot restored to a unique directory under /tmp
o Location override: hive.hbase.snapshot.restoredir
Query HBase Snapshots (HIVE-6584)
Query without snapshots
hive> CREATE EXTERNAL TABLE store_sales(...) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' ...;
hive> SELECT * FROM store_sales WHERE ss_item_sk > 60010
and ss_ticket_number < 60030;
Query HBase Snapshots (HIVE-6584)
Query with snapshots
hbase(main)> snapshot 'store_sales', 'store_sales_snap0'
hive> SET hive.hbase.snapshot.name=store_sales_snap0;
hive> SELECT * FROM store_sales WHERE ss_item_sk > 60010
and ss_ticket_number < 60030;
● Create HFiles with HBaseStorageHandler
● Set the following properties:
o set hive.hbase.generatehfiles=true
o set hfile.family.path=/tmp/columnfamily_name;
● hfile.family.path can also be set as a table
property
HFile support for bulk HBase uploads (HIVE-
6473)
HFile support for bulk HBase uploads (HIVE-
6473)
hive> CREATE EXTERNAL TABLE store_sales(...) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' ...;
hive> SET hive.hbase.generatehfiles=true;
hive> SET hfile.family.path=/tmp/new_store_sales_records/cf;
hive> INSERT OVERWRITE TABLE store_sales SELECT DISTINCT key,
value FROM some_table CLUSTER BY key;
Query HBase composite keys (HIVE-2599)
● Support simple and complex
implementations
● Delimiters for delimited composite keys
provided as a part of the DDL
● For complex implementations, custom
implementation of HBaseCompositeKey or
HBaseKeyFactory
hive> CREATE EXTERNAL TABLE hbase_table_1(key
struct<a:string,b:string,c:string>, value string)
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '~'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,test-
family:test-qual")
TBLPROPERTIES ("hbase.table.name" = "SIMPLE_TABLE");
hive> select key.a,key.b,key.c from hbase_table_1;
Query HBase composite keys (HIVE-2599)
public class MyCompositeKey extends HBaseCompositeKey {
/** This is a required constructor **/
MyCompositeKey(LazySimpleStructObjectInspector oi, Properties tbl, Configuration conf){
…
}
@Override
Object getField(int n){
// override this to return the field at index “n” in the key
}
}
# Provide this class in the DDL
CREATE EXTERNAL TABLE MyTable(......)TBLPROPERTIES(..,hbase.composite.key.class=MyCompositeKey);
Query HBase composite keys (HIVE-2599)
public interface HBaseKeyFactory extends HiveStoragePredicateHandler {
/** Initialize factory with properties */
void init(HBaseSerDeParameters hbaseParam, Properties properties) throws SerDeException;
/** Create custom object inspector for hbase key */
ObjectInspector createKeyObjectInspector(TypeInfo type) throws SerDeException;
/** Create custom object for hbase key */
LazyObjectBase createKey(ObjectInspector inspector) throws SerDeException;
/** Serialize hive object in internal format of custom key */
byte[] serializeKey(Object object, StructField field) throws IOException;
}
# Provide the implementation in the DDL
CREATE EXTERNAL TABLE MyTable(......)TBLPROPERTIES(..,hbase.composite.key.factory=MyCompositeKeyFactory);
Query HBase composite keys (HIVE-2599)
Query HBase timestamps (HIVE-2828)
● First class support to query HBase
timestamps
● Use special :timestamp to pull up the
timestamps
● Specified as part of the
HBASE_COLUMN_MAPPING
Query HBase timestamps (HIVE-2828)
hive> CREATE TABLE hbase_table (key string, value
string, time timestamp)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf:string,:timestamp");
hive> SELECT key, value, cast(time as timestamp)
FROM hbase_table WHERE key > 100 AND key < 400 AND
time < 200000000000;
Additional Improvements
● Support to query avro structs stored in
HBase (HIVE-6147) - no serializing
capability yet (HIVE-8020)
● Support for pulling HBase columns with
wildcards (HIVE-3725)
● Multiple bug fixes and performance
enhancements
Coming to a Hive Release Near You!
● Query HBase Snapshots, 0.14.0
● HFile support for bulk HBase uploads, 0.14.0
● Query HBase composite keys, 0.13.0
● Query HBase timestamps, 1.1.0
● Support for pulling HBase columns with
wildcards, 0.12.0
Future Work
● Tighter integration with Phoenix
● Stronger support for salted HBase keys
(HIVE-7128)
● Support for HBase DataType API (HIVE-
6150)
● Improved HBase bulk load facility (HIVE-
4765)

HBaseCon 2015: Analyzing HBase Data with Apache Hive

  • 1.
    Analyzing HBase Datawith Apache Hive Swarnim Kulkarni, Cerner Corporation Nick Dimiduk, Hortonworks Brock Noland, StreamSets May 7th, 2015
  • 2.
    Who are we? ●Nick Dimiduk o Apache HBase Committer and PMC member o Co-author of HBase in Action ● Brock Noland o Apache Hive Committer and PMC member ● Swarnim Kulkarni o Lead Architect at Cerner Corporation o Contributor to Apache Hive
  • 3.
    Agenda ● Apache HiveBasics ● Hive + HBase - Architecture ● Hive + HBase - Features and Improvements ● Future Work ● Q & A
  • 4.
    Apache Hive ● DeFacto standard for ad-hoc analysis of data in Hadoop ● SQL-like language called HiveQL for querying of data ● Scalable o SQL queries translate to M/R jobs ● Extensible o Plugin custom mappers/reducers o Custom UDFs/UDAFs o Custom FileFormats/SerDes
  • 5.
  • 6.
    Hive/HBase Integration ● Bringsbest of both world together ● Familiar analytical tooling of Hive to cover online data stored in HBase ● No need for analysts to write M/R jobs to analyze the data in HBase ● Uses StorageHandler to access data stored and managed by HBase
  • 7.
  • 8.
  • 9.
    Query HBase Snapshots(HIVE-6584) ● Queries over HBase snapshots on HDFS instead of online Region Servers ● Specify hive.hbase.snapshot.name instead of hbase.table.name to query the snapshot ● Under the hood: o Map tasks embed mini-RS, open snapshot regions o Snapshot restored to a unique directory under /tmp o Location override: hive.hbase.snapshot.restoredir
  • 10.
    Query HBase Snapshots(HIVE-6584) Query without snapshots hive> CREATE EXTERNAL TABLE store_sales(...) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' ...; hive> SELECT * FROM store_sales WHERE ss_item_sk > 60010 and ss_ticket_number < 60030;
  • 11.
    Query HBase Snapshots(HIVE-6584) Query with snapshots hbase(main)> snapshot 'store_sales', 'store_sales_snap0' hive> SET hive.hbase.snapshot.name=store_sales_snap0; hive> SELECT * FROM store_sales WHERE ss_item_sk > 60010 and ss_ticket_number < 60030;
  • 12.
    ● Create HFileswith HBaseStorageHandler ● Set the following properties: o set hive.hbase.generatehfiles=true o set hfile.family.path=/tmp/columnfamily_name; ● hfile.family.path can also be set as a table property HFile support for bulk HBase uploads (HIVE- 6473)
  • 13.
    HFile support forbulk HBase uploads (HIVE- 6473) hive> CREATE EXTERNAL TABLE store_sales(...) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' ...; hive> SET hive.hbase.generatehfiles=true; hive> SET hfile.family.path=/tmp/new_store_sales_records/cf; hive> INSERT OVERWRITE TABLE store_sales SELECT DISTINCT key, value FROM some_table CLUSTER BY key;
  • 14.
    Query HBase compositekeys (HIVE-2599) ● Support simple and complex implementations ● Delimiters for delimited composite keys provided as a part of the DDL ● For complex implementations, custom implementation of HBaseCompositeKey or HBaseKeyFactory
  • 15.
    hive> CREATE EXTERNALTABLE hbase_table_1(key struct<a:string,b:string,c:string>, value string) ROW FORMAT DELIMITED COLLECTION ITEMS TERMINATED BY '~' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,test- family:test-qual") TBLPROPERTIES ("hbase.table.name" = "SIMPLE_TABLE"); hive> select key.a,key.b,key.c from hbase_table_1; Query HBase composite keys (HIVE-2599)
  • 16.
    public class MyCompositeKeyextends HBaseCompositeKey { /** This is a required constructor **/ MyCompositeKey(LazySimpleStructObjectInspector oi, Properties tbl, Configuration conf){ … } @Override Object getField(int n){ // override this to return the field at index “n” in the key } } # Provide this class in the DDL CREATE EXTERNAL TABLE MyTable(......)TBLPROPERTIES(..,hbase.composite.key.class=MyCompositeKey); Query HBase composite keys (HIVE-2599)
  • 17.
    public interface HBaseKeyFactoryextends HiveStoragePredicateHandler { /** Initialize factory with properties */ void init(HBaseSerDeParameters hbaseParam, Properties properties) throws SerDeException; /** Create custom object inspector for hbase key */ ObjectInspector createKeyObjectInspector(TypeInfo type) throws SerDeException; /** Create custom object for hbase key */ LazyObjectBase createKey(ObjectInspector inspector) throws SerDeException; /** Serialize hive object in internal format of custom key */ byte[] serializeKey(Object object, StructField field) throws IOException; } # Provide the implementation in the DDL CREATE EXTERNAL TABLE MyTable(......)TBLPROPERTIES(..,hbase.composite.key.factory=MyCompositeKeyFactory); Query HBase composite keys (HIVE-2599)
  • 18.
    Query HBase timestamps(HIVE-2828) ● First class support to query HBase timestamps ● Use special :timestamp to pull up the timestamps ● Specified as part of the HBASE_COLUMN_MAPPING
  • 19.
    Query HBase timestamps(HIVE-2828) hive> CREATE TABLE hbase_table (key string, value string, time timestamp) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:string,:timestamp"); hive> SELECT key, value, cast(time as timestamp) FROM hbase_table WHERE key > 100 AND key < 400 AND time < 200000000000;
  • 20.
    Additional Improvements ● Supportto query avro structs stored in HBase (HIVE-6147) - no serializing capability yet (HIVE-8020) ● Support for pulling HBase columns with wildcards (HIVE-3725) ● Multiple bug fixes and performance enhancements
  • 21.
    Coming to aHive Release Near You! ● Query HBase Snapshots, 0.14.0 ● HFile support for bulk HBase uploads, 0.14.0 ● Query HBase composite keys, 0.13.0 ● Query HBase timestamps, 1.1.0 ● Support for pulling HBase columns with wildcards, 0.12.0
  • 22.
    Future Work ● Tighterintegration with Phoenix ● Stronger support for salted HBase keys (HIVE-7128) ● Support for HBase DataType API (HIVE- 6150) ● Improved HBase bulk load facility (HIVE- 4765)