HAWQ HCatalog Integration
Shivram Mani ( HAWQ-UD)
What is HCatalog ?
Table and storage management layer for Hadoop
Relational view of data in the Hadoop distributed file system
Abstracts data format/location from user
Built on top of the Hive metastore and incorporates Hive's DDL
HAWQ->(PXF)->HIVE (before)
HAWQ HIVE
CREATE EXTERNAL TABLE zoo_ext (
id double, animal string, age int
)
LOCATION ('pxf://pivotal:50070//zoo?PROFILE=hive')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
PXF
PXF
PXF
SELECT * FROM zoo_ext;
disk heap:
pg_exttable
pg_class...
3
2
1
Motivation
Problem:
● User has to know Hive table schema
● Entering schema/location/format information is error prone
● External table (static) would not be aware of Hive table metadata change
(dynamic)
Solution:
● Native integration of HCatalog into HAWQ
○ HAWQ catalog manage HAWQ tables
○ HCatalog manage external(Hive, Pig, HBase) tables
HAWQ->(PXF)->HIVE (before)
HAWQ HIVE
CREATE EXTERNAL TABLE zoo_ext (
id double, animal string, age int
)
LOCATION ('pxf://pivotal:50070//zoo?PROFILE=hive')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
PXF
PXF
PXF
SELECT * FROM zoo_ext;
disk heap:
pg_exttable
pg_class...
3
1
2
zoo:
id double
animal string
age int
HAWQ->(HCat)->(PXF)->HIVE (NOW)
HAWQ HIVE
CREATE EXTERNAL TABLE zoo_ext (
id double, animal string, age int
) LOCATION ('pxf://pivotal:50070//zoo?PROFILE=hive')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
PXF
PXF
PXF
SELECT * FROM zoo_ext;
SELECT * FROM hcatalog.default.zoo;
*** hcatalog.[HIVE_DB_NAME].[HIVE_TABLE_NAME]
HCAT
disk heap:
pg_exttable
pg_class...
in-memory:
pg_exttable
pg_class...
1
2 3
zoo:
id double
animal string
age int
HAWQ->(HCat)->(PXF)->HIVE (NOW)
HAWQ HIVE
1. Retrieve metadata from HCatalog
a. HAWQ->(REST)->proxy->(Thrift)->HCatalog
2. Parse metadata into in-memory only catalog tables
a. Extended caql and syscache
3. Query with PXF (unchanged)
PXF
PXF
PXF
SELECT * FROM hcatalog.default.zoo;
*** hcatalog.[HIVE_DB_NAME].[HIVE_TABLE_NAME]
*** Also modified grammar to recognize hcatalog queries
HCAT
disk heap:
pg_exttable
pg_class...
in-memory:
pg_exttable
pg_class...
1
2 3
Under the hood
Per session:
● Check if enough oid left for HCatalog (only when accessing HCatalog)
Per transaction:
● Get table metadata of interest on initial access
● Create in-memory catalog tables from metadata
○ pg_namespace, pg_class, pg_exttable, pg_type, pg_attribute, gp_distribution_policy
● Same table referenced multiple times will use in-memory metadata, minimizing external
calls to HCatalog
● Table metadata dropped at the end of txn
● In-memory catalog table constraints enforced (e.g. unique oid, namespace-relation name)
DEMO
In Hive:
describe demo.hive_types;
In HAWQ:
SELECT * FROM hcatalog.demo.hive_types;
SELECT s1, f FROM hcatalog.demo.hive_types
where vc1 = 'abcde' and dc1 > 1.0;

Hawq Hcatalog Integration

  • 1.
  • 2.
    What is HCatalog? Table and storage management layer for Hadoop Relational view of data in the Hadoop distributed file system Abstracts data format/location from user Built on top of the Hive metastore and incorporates Hive's DDL
  • 4.
    HAWQ->(PXF)->HIVE (before) HAWQ HIVE CREATEEXTERNAL TABLE zoo_ext ( id double, animal string, age int ) LOCATION ('pxf://pivotal:50070//zoo?PROFILE=hive') FORMAT 'CUSTOM' (formatter='pxfwritable_import'); PXF PXF PXF SELECT * FROM zoo_ext; disk heap: pg_exttable pg_class... 3 2 1
  • 5.
    Motivation Problem: ● User hasto know Hive table schema ● Entering schema/location/format information is error prone ● External table (static) would not be aware of Hive table metadata change (dynamic) Solution: ● Native integration of HCatalog into HAWQ ○ HAWQ catalog manage HAWQ tables ○ HCatalog manage external(Hive, Pig, HBase) tables
  • 6.
    HAWQ->(PXF)->HIVE (before) HAWQ HIVE CREATEEXTERNAL TABLE zoo_ext ( id double, animal string, age int ) LOCATION ('pxf://pivotal:50070//zoo?PROFILE=hive') FORMAT 'CUSTOM' (formatter='pxfwritable_import'); PXF PXF PXF SELECT * FROM zoo_ext; disk heap: pg_exttable pg_class... 3 1 2
  • 7.
    zoo: id double animal string ageint HAWQ->(HCat)->(PXF)->HIVE (NOW) HAWQ HIVE CREATE EXTERNAL TABLE zoo_ext ( id double, animal string, age int ) LOCATION ('pxf://pivotal:50070//zoo?PROFILE=hive') FORMAT 'CUSTOM' (formatter='pxfwritable_import'); PXF PXF PXF SELECT * FROM zoo_ext; SELECT * FROM hcatalog.default.zoo; *** hcatalog.[HIVE_DB_NAME].[HIVE_TABLE_NAME] HCAT disk heap: pg_exttable pg_class... in-memory: pg_exttable pg_class... 1 2 3
  • 8.
    zoo: id double animal string ageint HAWQ->(HCat)->(PXF)->HIVE (NOW) HAWQ HIVE 1. Retrieve metadata from HCatalog a. HAWQ->(REST)->proxy->(Thrift)->HCatalog 2. Parse metadata into in-memory only catalog tables a. Extended caql and syscache 3. Query with PXF (unchanged) PXF PXF PXF SELECT * FROM hcatalog.default.zoo; *** hcatalog.[HIVE_DB_NAME].[HIVE_TABLE_NAME] *** Also modified grammar to recognize hcatalog queries HCAT disk heap: pg_exttable pg_class... in-memory: pg_exttable pg_class... 1 2 3
  • 9.
    Under the hood Persession: ● Check if enough oid left for HCatalog (only when accessing HCatalog) Per transaction: ● Get table metadata of interest on initial access ● Create in-memory catalog tables from metadata ○ pg_namespace, pg_class, pg_exttable, pg_type, pg_attribute, gp_distribution_policy ● Same table referenced multiple times will use in-memory metadata, minimizing external calls to HCatalog ● Table metadata dropped at the end of txn ● In-memory catalog table constraints enforced (e.g. unique oid, namespace-relation name)
  • 10.
    DEMO In Hive: describe demo.hive_types; InHAWQ: SELECT * FROM hcatalog.demo.hive_types; SELECT s1, f FROM hcatalog.demo.hive_types where vc1 = 'abcde' and dc1 > 1.0;