Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

HBaseCon 2015: Analyzing HBase Data with Apache Hive

  1. Analyzing HBase Data with Apache Hive Swarnim Kulkarni, Cerner Corporation Nick Dimiduk, Hortonworks Brock Noland, StreamSets May 7th, 2015
  2. Who are we? ● Nick Dimiduk o Apache HBase Committer and PMC member o Co-author of HBase in Action ● Brock Noland o Apache Hive Committer and PMC member ● Swarnim Kulkarni o Lead Architect at Cerner Corporation o Contributor to Apache Hive
  3. Agenda ● Apache Hive Basics ● Hive + HBase - Architecture ● Hive + HBase - Features and Improvements ● Future Work ● Q & A
  4. Apache Hive ● De Facto standard for ad-hoc analysis of data in Hadoop ● SQL-like language called HiveQL for querying of data ● Scalable o SQL queries translate to M/R jobs ● Extensible o Plugin custom mappers/reducers o Custom UDFs/UDAFs o Custom FileFormats/SerDes
  5. Apache Hive
  6. Hive/HBase Integration ● Brings best of both world together ● Familiar analytical tooling of Hive to cover online data stored in HBase ● No need for analysts to write M/R jobs to analyze the data in HBase ● Uses StorageHandler to access data stored and managed by HBase
  7. Hive/HBase Integration
  8. Improvements and New features
  9. Query HBase Snapshots (HIVE-6584) ● Queries over HBase snapshots on HDFS instead of online Region Servers ● Specify hive.hbase.snapshot.name instead of hbase.table.name to query the snapshot ● Under the hood: o Map tasks embed mini-RS, open snapshot regions o Snapshot restored to a unique directory under /tmp o Location override: hive.hbase.snapshot.restoredir
  10. Query HBase Snapshots (HIVE-6584) Query without snapshots hive> CREATE EXTERNAL TABLE store_sales(...) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' ...; hive> SELECT * FROM store_sales WHERE ss_item_sk > 60010 and ss_ticket_number < 60030;
  11. Query HBase Snapshots (HIVE-6584) Query with snapshots hbase(main)> snapshot 'store_sales', 'store_sales_snap0' hive> SET hive.hbase.snapshot.name=store_sales_snap0; hive> SELECT * FROM store_sales WHERE ss_item_sk > 60010 and ss_ticket_number < 60030;
  12. ● Create HFiles with HBaseStorageHandler ● Set the following properties: o set hive.hbase.generatehfiles=true o set hfile.family.path=/tmp/columnfamily_name; ● hfile.family.path can also be set as a table property HFile support for bulk HBase uploads (HIVE- 6473)
  13. HFile support for bulk HBase uploads (HIVE- 6473) hive> CREATE EXTERNAL TABLE store_sales(...) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' ...; hive> SET hive.hbase.generatehfiles=true; hive> SET hfile.family.path=/tmp/new_store_sales_records/cf; hive> INSERT OVERWRITE TABLE store_sales SELECT DISTINCT key, value FROM some_table CLUSTER BY key;
  14. Query HBase composite keys (HIVE-2599) ● Support simple and complex implementations ● Delimiters for delimited composite keys provided as a part of the DDL ● For complex implementations, custom implementation of HBaseCompositeKey or HBaseKeyFactory
  15. hive> CREATE EXTERNAL TABLE hbase_table_1(key struct<a:string,b:string,c:string>, value string) ROW FORMAT DELIMITED COLLECTION ITEMS TERMINATED BY '~' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,test- family:test-qual") TBLPROPERTIES ("hbase.table.name" = "SIMPLE_TABLE"); hive> select key.a,key.b,key.c from hbase_table_1; Query HBase composite keys (HIVE-2599)
  16. public class MyCompositeKey extends HBaseCompositeKey { /** This is a required constructor **/ MyCompositeKey(LazySimpleStructObjectInspector oi, Properties tbl, Configuration conf){ … } @Override Object getField(int n){ // override this to return the field at index “n” in the key } } # Provide this class in the DDL CREATE EXTERNAL TABLE MyTable(......)TBLPROPERTIES(..,hbase.composite.key.class=MyCompositeKey); Query HBase composite keys (HIVE-2599)
  17. public interface HBaseKeyFactory extends HiveStoragePredicateHandler { /** Initialize factory with properties */ void init(HBaseSerDeParameters hbaseParam, Properties properties) throws SerDeException; /** Create custom object inspector for hbase key */ ObjectInspector createKeyObjectInspector(TypeInfo type) throws SerDeException; /** Create custom object for hbase key */ LazyObjectBase createKey(ObjectInspector inspector) throws SerDeException; /** Serialize hive object in internal format of custom key */ byte[] serializeKey(Object object, StructField field) throws IOException; } # Provide the implementation in the DDL CREATE EXTERNAL TABLE MyTable(......)TBLPROPERTIES(..,hbase.composite.key.factory=MyCompositeKeyFactory); Query HBase composite keys (HIVE-2599)
  18. Query HBase timestamps (HIVE-2828) ● First class support to query HBase timestamps ● Use special :timestamp to pull up the timestamps ● Specified as part of the HBASE_COLUMN_MAPPING
  19. Query HBase timestamps (HIVE-2828) hive> CREATE TABLE hbase_table (key string, value string, time timestamp) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:string,:timestamp"); hive> SELECT key, value, cast(time as timestamp) FROM hbase_table WHERE key > 100 AND key < 400 AND time < 200000000000;
  20. Additional Improvements ● Support to query avro structs stored in HBase (HIVE-6147) - no serializing capability yet (HIVE-8020) ● Support for pulling HBase columns with wildcards (HIVE-3725) ● Multiple bug fixes and performance enhancements
  21. Coming to a Hive Release Near You! ● Query HBase Snapshots, 0.14.0 ● HFile support for bulk HBase uploads, 0.14.0 ● Query HBase composite keys, 0.13.0 ● Query HBase timestamps, 1.1.0 ● Support for pulling HBase columns with wildcards, 0.12.0
  22. Future Work ● Tighter integration with Phoenix ● Stronger support for salted HBase keys (HIVE-7128) ● Support for HBase DataType API (HIVE- 6150) ● Improved HBase bulk load facility (HIVE- 4765)
Advertisement