Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

2,515 views

Published on

hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

Published in: Technology

hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

  1. 1. Hive Hbase Metastore - Improving Hive with a Big Data Metadata Storage Daniel Dai, Vaibhav Gumashta Hortonworks Hadoop Summit San Jose June, 2016
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Motivation System Design Caching Strategy Transaction Management Deployment Experimental Results Future Work
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is Hive MetaStore  Store Metadata about the data – Database – Table – Partition – Privilege – Role – Permanent UDF – Statistics – Locks – Transaction – etc  Two modes – Thrift Server – Embedded  Backend – RDBMS: Derby, MSSQL, MySQL, Oracle, PostGres
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Low latency in Hive  Hadoop is only for large job – Most jobs are small jobs – User want to run both small and large jobs in one system  What’s trending in Hive – Low latency – Stinger (Tez + ORC + Vectorization) • Bring query to 5-10s – LLAP • Sub-second query TPC-DS query 27
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New BottleNet - Metastore  Planning time is non-negligible  Among planning, significant amount of time spent on metadata fetching
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Besides Latency  Significantly more scale – More metadata – millions of partitions – New large scale metadata – Split information, ORC row group statistics – More calls – Handle orders of magnitude higher no of calls – From tasks  Reduce Complexity – Object Relational Modeling is an impedance mismatch – DataNucleus – DBCP, BoneCP, or Hikaricp?
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ER Diagram for ObjectStore Database BUCKETING_COLS SD_ID BIGINT(20) BUCKET_COL_NAME VARCHAR(256) INTEGER_IDX INT(11) Indexes CDS CD_ID BIGINT(20) Indexes COLUMNS_V2 CD_ID BIGINT(20) COMMENT VARCHAR(256) COLUMN_NAME VARCHAR(128) TYPE_NAME VARCHAR(4000) INTEGER_IDX INT(11) Indexes DATABASE_PARAMS DB_ID BIGINT(20) PARAM_KEY VARCHAR(180) PARAM_VALUE VARCHAR(4000) Indexes DBS DB_ID BIGINT(20) DESC VARCHAR(4000) DB_LOCATION_URI VARCHAR(4000) NAME VARCHAR(128) OWNER_NAME VARCHAR(128) OWNER_TYPE VARCHAR(10) Indexes DB_PRIVS DB_GRANT_ID BIGINT(20) CREATE_TIME INT(11) DB_ID BIGINT(20) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) DB_PRIV VARCHAR(128) Indexes GLOBAL_PRIVS USER_GRANT_ID BIGINT(20) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) USER_PRIV VARCHAR(128) Indexes IDXS INDEX_ID BIGINT(20) CREATE_TIME INT(11) DEFERRED_REBUILD BIT(1) INDEX_HANDLER_CLASS VARCHAR(4000) INDEX_NAME VARCHAR(128) INDEX_TBL_ID BIGINT(20) LAST_ACCESS_TIME INT(11) ORIG_TBL_ID BIGINT(20) SD_ID BIGINT(20) Indexes INDEX_PARAMS INDEX_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes NUCLEUS_TABLES CLASS_NAME VARCHAR(128) TABLE_NAME VARCHAR(128) TYPE VARCHAR(4) OWNER VARCHAR(2) VERSION VARCHAR(20) INTERFACE_NAME VARCHAR(255) Indexes PARTITIONS PART_ID BIGINT(20) CREATE_TIME INT(11) LAST_ACCESS_TIME INT(11) PART_NAME VARCHAR(767) SD_ID BIGINT(20) TBL_ID BIGINT(20) LINK_TARGET_ID BIGINT(20) Indexes PARTITION_EVENTS PART_NAME_ID BIGINT(20) DB_NAME VARCHAR(128) EVENT_TIME BIGINT(20) EVENT_TYPE INT(11) PARTITION_NAME VARCHAR(767) TBL_NAME VARCHAR(128) Indexes PARTITION_KEYS TBL_ID BIGINT(20) PKEY_COMMENT VARCHAR(4000) PKEY_NAME VARCHAR(128) PKEY_TYPE VARCHAR(767) INTEGER_IDX INT(11) Indexes PARTITION_KEY_VALS PART_ID BIGINT(20) PART_KEY_VAL VARCHAR(256) INTEGER_IDX INT(11) Indexes PARTITION_PARAMS PART_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes PART_COL_PRIVS PART_COLUMN_GRANT_ID BIGINT(20) COLUMN_NAME VARCHAR(128) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PART_ID BIGINT(20) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) PART_COL_PRIV VARCHAR(128) Indexes PART_PRIVS PART_GRANT_ID BIGINT(20) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PART_ID BIGINT(20) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) PART_PRIV VARCHAR(128) Indexes ROLES ROLE_ID BIGINT(20) CREATE_TIME INT(11) OWNER_NAME VARCHAR(128) ROLE_NAME VARCHAR(128) Indexes ROLE_MAP ROLE_GRANT_ID BIGINT(20) ADD_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) ROLE_ID BIGINT(20) Indexes SDS SD_ID BIGINT(20) CD_ID BIGINT(20) INPUT_FORMAT VARCHAR(4000) IS_COMPRESSED BIT(1) IS_STOREDASSUBDIRECTORIES BIT(1) LOCATION VARCHAR(4000) NUM_BUCKETS INT(11) OUTPUT_FORMAT VARCHAR(4000) SERDE_ID BIGINT(20) Indexes SD_PARAMS SD_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes SEQUENCE_TABLE SEQUENCE_NAME VARCHAR(255) NEXT_VAL BIGINT(20) Indexes SERDES SERDE_ID BIGINT(20) NAME VARCHAR(128) SLIB VARCHAR(4000) Indexes SERDE_PARAMS SERDE_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes SKEWED_COL_NAMES SD_ID BIGINT(20) SKEWED_COL_NAME VARCHAR(256) INTEGER_IDX INT(11) Indexes SKEWED_COL_VALUE_LOC_MAP SD_ID BIGINT(20) STRING_LIST_ID_KID BIGINT(20) LOCATION VARCHAR(4000) Indexes SKEWED_STRING_LIST STRING_LIST_ID BIGINT(20) Indexes SKEWED_STRING_LIST_VALUES STRING_LIST_ID BIGINT(20) STRING_LIST_VALUE VARCHAR(256) INTEGER_IDX INT(11) Indexes SKEWED_VALUES SD_ID_OID BIGINT(20) STRING_LIST_ID_EID BIGINT(20) INTEGER_IDX INT(11) Indexes SORT_COLS SD_ID BIGINT(20) COLUMN_NAME VARCHAR(128) ORDER INT(11) INTEGER_IDX INT(11) Indexes TABLE_PARAMS TBL_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes TBLS TBL_ID BIGINT(20) CREATE_TIME INT(11) DB_ID BIGINT(20) LAST_ACCESS_TIME INT(11) OWNER VARCHAR(767) RETENTION INT(11) SD_ID BIGINT(20) TBL_NAME VARCHAR(128) TBL_TYPE VARCHAR(128) VIEW_EXPANDED_TEXT MEDIUMTEXT VIEW_ORIGINAL_TEXT MEDIUMTEXT LINK_TARGET_ID BIGINT(20) Indexes TBL_COL_PRIVS TBL_COLUMN_GRANT_ID BIGINT(20) COLUMN_NAME VARCHAR(128) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) TBL_COL_PRIV VARCHAR(128) TBL_ID BIGINT(20) Indexes TBL_PRIVS TBL_GRANT_ID BIGINT(20) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) TBL_PRIV VARCHAR(128) TBL_ID BIGINT(20) Indexes TAB_COL_STATS CS_ID BIGINT(20) DB_NAME VARCHAR(128) TABLE_NAME VARCHAR(128) COLUMN_NAME VARCHAR(128) COLUMN_TYPE VARCHAR(128) TBL_ID BIGINT(20) LONG_LOW_VALUE BIGINT(20) LONG_HIGH_VALUE BIGINT(20) DOUBLE_HIGH_VALUE DOUBLE(53,4) DOUBLE_LOW_VALUE DOUBLE(53,4) BIG_DECIMAL_LOW_VALUE VARCHAR(4000) BIG_DECIMAL_HIGH_VALUE VARCHAR(4000) NUM_NULLS BIGINT(20) NUM_DISTINCTS BIGINT(20) AVG_COL_LEN DOUBLE(53,4) MAX_COL_LEN BIGINT(20) NUM_TRUES BIGINT(20) NUM_FALSES BIGINT(20) LAST_ANALYZED BIGINT(20) Indexes PART_COL_STATS CS_ID BIGINT(20) DB_NAME VARCHAR(128) TABLE_NAME VARCHAR(128) PARTITION_NAME VARCHAR(767) COLUMN_NAME VARCHAR(128) COLUMN_TYPE VARCHAR(128) PART_ID BIGINT(20) LONG_LOW_VALUE BIGINT(20) LONG_HIGH_VALUE BIGINT(20) DOUBLE_HIGH_VALUE DOUBLE(53,4) DOUBLE_LOW_VALUE DOUBLE(53,4) BIG_DECIMAL_LOW_VALUE VARCHAR(4000) BIG_DECIMAL_HIGH_VALUE VARCHAR(4000) NUM_NULLS BIGINT(20) NUM_DISTINCTS BIGINT(20) AVG_COL_LEN DOUBLE(53,4) MAX_COL_LEN BIGINT(20) NUM_TRUES BIGINT(20) NUM_FALSES BIGINT(20) LAST_ANALYZED BIGINT(20) Indexes TYPES TYPES_ID BIGINT(20) TYPE_NAME VARCHAR(128) TYPE1 VARCHAR(767) TYPE2 VARCHAR(767) Indexes TYPE_FIELDS TYPE_NAME BIGINT(20) COMMENT VARCHAR(256) FIELD_NAME VARCHAR(128) FIELD_TYPE VARCHAR(767) INTEGER_IDX INT(11) Indexes MASTER_KEYS KEY_ID INT MASTER_KEY VARCHAR(767) Indexes DELEGATION_TOKENS TOKEN_IDENT VARCHAR(767) TOKEN VARCHAR(767) Indexes VERSION VER_ID BIGINT SCHEMA_VERSION VARCHAR(127) VERSION_COMMENT VARCHAR(255) Indexes FUNCS FUNC_ID BIGINT(20) CLASS_NAME VARCHAR(4000) CREATE_TIME INT(11) DB_ID BIGINT(20) FUNC_NAME VARCHAR(128) FUNC_TYPE INT(11) OWNER_NAME VARCHAR(128) OWNER_TYPE VARCHAR(10) Indexes FUNC_RU FUNC_ID BIGINT(20) RESOURCE_TYPE INT(11) RESOURCE_URI VARCHAR(4000) INTEGER_IDX INT(11) Indexes
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How About Improving ObjectStore?  Already happening! – Using direct SQL instead of O-R  But – Maintenance nightmare! – Handle syntax difference for databases  Re-engineering effort may not pay off  Ultimate barrier: Scalability String queryText = "select "PARTITIONS"."PART_ID", "SDS"."SD_ID", "SDS"."CD_ID"," + " "SERDES"."SERDE_ID", "PARTITIONS"."CREATE_TIME"," + " "PARTITIONS"."LAST_ACCESS_TIME", "SDS"."INPUT_FORMAT", "SDS"."IS_COMPRESSED"," + " "SDS"."IS_STOREDASSUBDIRECTORIES", "SDS"."LOCATION", "SDS"."NUM_BUCKETS"," + " "SDS"."OUTPUT_FORMAT", "SERDES"."NAME", "SERDES"."SLIB" " + "from "PARTITIONS"" + " left outer join "SDS" on "PARTITIONS"."SD_ID" = "SDS"."SD_ID" " + " left outer join "SERDES" on "SDS"."SERDE_ID" = "SERDES"."SERDE_ID" " + "where "PART_ID" in (" + partIds + ") order by "PART_NAME" asc";
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Motivation System Design Caching Strategy Transaction Management Deployment Experimental Results Future Work
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved System Architecture HiveMetaStore Thrift Server ObjectStoreHBaseStore RDBMSHBase Omid • Two implementation of the RawStore interface • HBaseStore • ObjectStore • Both backend will live together for a while • HBaseStore • Most traffic will go through transaction layer (Omid). • Some traffic will bypass transaction layer • Volatile data • High possibility of conflict HiveMetaStore Thrift Client
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDBMS schema Read/Write path: • Thrift Client creates Thrift objects for RPC (based on specs in metastore/if/hive_metastore.thrift). • Thrift Server extracts values from Thrift objects and creates corresponding ORM model objects. • ORM opens transaction on RDBMS and writes/reads values to/from various tables in RDBMS, using appropriate foreign key references. • RDBMS fastpath enabled by not using ORM and writing direct SQL. However, complicates testing matrix as there may be slight variations in SQL semantics for different RDBMS databases. Example: adding a new partition: “add_partition(Partition new_part)”
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDBMS schema Read/Write path: • Thrift Client creates Thrift objects for RPC (based on specs in metastore/if/hive_metastore.thrift). • Thrift Server extracts values from Thrift objects and creates corresponding ORM model objects. • ORM opens transaction on RDBMS and writes / reads values to / from various tables in RDBMS, using appropriate foreign key references. Example: adding a new partition: “add_partition(Partition new_part)” struct Partition { 1: list<string> values 2: string dbName, 3: string tableName, 4: i32 createTime, 5: i32 lastAccessTime, 6: StorageDescriptor sd, 7: map<string, string> parameters, 8: optional PrincipalPrivilegeSet privileges } TBLS TBL_PRIVS TBL_COL_PRIVS PART_PRIVS SDS CDS SORT_ORDER SERDES TYPE_FIELDS PARTITIONS PARTITION_KEY_VALS PARTITION_PARAMS BUCKETING_COLS SORT_COLS SD_PARAMS SKEWED_COL_NAMES SKEWED_VALUES TABLE_PARAMS
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Table Name Key Column Families and Columns Description HBMS_DBS bytes(dbName) cf_catalog {“c”} “c”: Database proto HBMS_SDS bytes(md5(SD proto)) cf_catalog {“c”, “ref”} “c”: StorageDescriptor proto “ref”: reference count HBMS_TBLS bytes(dbName, tblName) cf_catalog {“c”} cf_stats {“s” -> c1, … cn} “c”: Table proto “s”: Stats per column in the Table HBMS_PARTITIONS bytes(dbName, tblName, partVal1, ..., partValn) cf_catalog {“c”} cf_stats {“s” -> c1, … cn} “c”: Partition proto “s”: Stats per column in the Partition HBMS_AGGR_STATS bytes(md5(dbName, tblName, partVal1, ..., partValn, colName) ) cf_catalog { “s”, “b”} “b”: AggrStatsBloomFilter proto “s”: AggrStats proto HBMS_FUNCS bytes(dbName, funcName) cf_catalog {“c”} “c”: Function proto HBMS_FILE_METADATA bytes(fileId) cf_catalog {“c”} cf_stats {“s”} “c”: Metadata footer proto “s”: PPD Stats HBase schema
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Table Name Key Column Families and Columns Description HBMS_GLOBAL_PRIVS bytes(“gp”) cf_catalog {“c”} “c”: store/retrieve serialized PrincipalPrivilegeSet proto HBMS_ROLES bytes(roleName) cf_catalog {“roles”} “roles”: store/retrieve serialized Role proto HBMS_USER_TO_ROLE bytes(userName) cf_catalog {“c”} “c”: store/retrieve serialized RoleList proto HBMS_SECURITY bytes(delTokenId) cf_catalog {“dt”, “mk”} “dt”: store/retrieve delegation token “mk”: master keys HBMS_SEQUENCES bytes(sequence) cf_catalog {“c”} “c”: store/retrieve sequences HBase schema
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved De-normalization • Goal • Optimized for querying. • May slower in DDL. • Example: drop_role(String roleName). Key Value bytes(“User 1”) Proto(Role 1, Role 2, Role 3, Role 5) bytes(“User 2”) Proto(Role 1, Role 2) bytes(“User 3”) Proto(Role 4, Role 5) bytes(“User 4”) Proto (Role 2, Role 3) HBMS_USER_TO_ROLE • Need to scan & de-serialize everything in order to drop a role.
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Partition Keys  Range scan for most queries – Where date = ‘201601’ and state = ‘CA’ – Where date >= ‘201602’ and date < ‘201604’  Server side filter for the rest – Where state = ‘CA’ (not prefix key) – Where date like ‘2016%’ (regex) – Where date > ‘201601’ and state > ‘OR’ (cannot be range scan) – Scan all keys, but not deserialize value date state 201601 CA 201601 WA 201602 CA 201603 CA 201605 CA
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Typed Partition Keys  Binary sorted – HBase range scan: Scan(byte[] startRow, byte[] stopRow) – Where key1 >= ‘A5’ and key2 >= 8 • startRow: 41 35 00 00 00 00 08  Using BinarySortableSerDe – Support all Hive data types – Handles null (String, Integer) Bytes ‘A10’, 3 41 31 30 00 00 00 00 03 ‘A10’, 10 41 31 30 00 00 00 00 0A ‘A5’, 4 41 35 00 00 00 00 04 ‘A5’, 15 41 35 00 00 00 00 0D
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage Descriptor de-duplication Table Name Key Column Families and Columns Description HBMS_DBS bytes(dbName) cf_catalog {“c”} “c”: Database proto HBMS_SDS bytes(md5(SD proto)) cf_catalog {“c”, “ref”} “c”: StorageDescriptor proto “ref”: reference count HBMS_TBLS bytes(dbName, tblName) cf_catalog {“c”} cf_stats {“s” -> c1, … cn} “c”: Table proto “s”: Stats per column in the Table HBMS_PARTITIONS bytes(dbName, tblName, partVal1, ..., partValn) cf_catalog {“c”} cf_stats {“s” -> c1, … cn} “c”: Partition proto “s”: Stats per column in the Partition HBMS_AGGR_STATS bytes(md5(dbName, tblName, partVal1, ..., partValn, colName) ) cf_catalog { “s”, “b”} “b”: AggrStatsBloomFilter proto “s”: AggrStats proto HBMS_FUNCS bytes(dbName, funcName) cf_catalog {‘c”} “c”: Function proto HBMS_FILE_METADATA bytes(fileId) cf_catalog {“c”} cf_stats {“s”} “c”: Metadata footer proto “s”: PPD Stats
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage Descriptor de-duplication Table Name Key Column Families and Columns Description HBMS_DBS bytes(dbName) cf_catalog {“c”} “c”: Database proto HBMS_SDS bytes(md5(SD proto)) cf_catalog {“c”, “ref”} “c”: StorageDescriptor proto “ref”: reference count HBMS_TBLS bytes(dbName, tblName) cf_catalog {“c”} cf_stats {“s” -> c1, … cn} “c”: Table proto “s”: Stats per column in the Table HBMS_PARTITIONS bytes(dbName, tblName, partVal1, ..., partValn) cf_catalog {“c”} cf_stats {“s” -> c1, … cn} “c”: Partition proto “s”: Stats per column in the Partition HBMS_AGGR_STATS bytes(md5(dbName, tblName, partVal1, ..., partValn, colName) ) cf_catalog { “s”, “b”} “b”: AggrStatsBloomFilter proto “s”: AggrStats proto HBMS_FUNCS bytes(dbName, funcName) cf_catalog {‘c”} “c”: Function proto HBMS_FILE_METADATA bytes(fileId) cf_catalog {“c”} cf_stats {“s”} “c”: Metadata footer proto “s”: PPD Stats struct Partition { 1: list<string> values 2: string dbName, 3: string tableName, 4: i32 createTime, 5: i32 lastAccessTime, 6: StorageDescriptor sd, 7: map<string, string> parameters, 8: optional PrincipalPrivilegeSet privileges } struct StorageDescriptor { 1: list<FieldSchema> cols, 2: string location, 3: string inputFormat, 4: string outputFormat, 5: bool compressed, 6: i32 numBuckets, 7: SerDeInfo serdeInfo, 8: list<string> bucketCols, 9: list<Order> sortCols, 10: map<string, string> parameters, 11: optional SkewedInfo skewedInfo, 12: optional bool storedAsSubDirectories }
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage Descriptor de-duplication Table Name Key Column Families and Columns Description HBMS_DBS bytes(dbName) cf_catalog {“c”} “c”: Database proto HBMS_SDS bytes(md5(SD proto)) cf_catalog {“c”, “ref”} “c”: StorageDescriptor proto “ref”: reference count HBMS_TBLS bytes(dbName, tblName) cf_catalog {“c”} cf_stats {“s” -> c1, … cn} “c”: Table proto “s”: Stats per column in the Table HBMS_PARTITIONS bytes(dbName, tblName, partVal1, ..., partValn) cf_catalog {“c”} cf_stats {“s” -> c1, … cn} “c”: Partition proto “s”: Stats per column in the Partition HBMS_AGGR_STATS bytes(md5(dbName, tblName, partVal1, ..., partValn, colName) ) cf_catalog { “s”, “b”} “b”: AggrStatsBloomFilter proto “s”: AggrStats proto HBMS_FUNCS bytes(dbName, funcName) cf_catalog {‘c”} “c”: Function proto HBMS_FILE_METADATA bytes(fileId) cf_catalog {“c”} cf_stats {“s”} “c”: Metadata footer proto “s”: PPD Stats message Partition { optional int64 create_time = 1; optional int64 last_access_time = 2; optional string location = 3; optional Parameters sd_parameters = 4; required bytes sd_hash = 5; optional Parameters parameters = 6; } message StorageDescriptor { message Order { … } message SerDeInfo { …. } message SkewedInfo { … } repeated FieldSchema cols = 1; optional string input_format = 2; optional string output_format = 3; optional bool is_compressed = 4; optional sint32 num_buckets = 5; optional SerDeInfo serde_info = 6; repeated string bucket_cols = 7; repeated Order sort_cols = 8; optional SkewedInfo skewed_info = 9; optional bool stored_as_sub_directories = 10; }
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBase schema Read/Write path: • Thrift Client creates Thrift objects for RPC (based on specs in metastore/if/hive_metastore.thrift) • Thrift Server passes thrift objects to HBase client open in the thrift server • HBase client extracts fields from thrift objects, converts them to corresponding protobuf objects (metastore/src/protobuf/org/apache/hadoop/hive/metastore/hbase/hbase_metastore _proto.proto) • Writes/reads the protobuf payloads to/from HBase tables. Example: adding a new partition: “add_partition(Partition new_part)”
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBase schema Read/Write path: • Thrift Client creates Thrift objects for RPC (based on specs in metastore/if/hive_metastore.thrift) • Thrift Server passes thrift objects to HBase client open in the thrift server • HBase client extracts fields from thrift objects, converts them to corresponding protobuf objects (metastore/src/protobuf/org/apache/hadoop/hive/metastore/hbase/hbase_metastore _proto.proto) • Writes/reads the protobuf payloads to/from HBase tables. Example: adding a new partition: “add_partition(Partition new_part)” struct Partition { 1: list<string> values 2: string dbName, 3: string tableName, 4: i32 createTime, 5: i32 lastAccessTime, 6: StorageDescriptor sd, 7: map<string, string> parameters, 8: optional PrincipalPrivilegeSet privileges } message Partition { optional int64 create_time = 1; optional int64 last_access_time = 2; optional string location = 3; optional Parameters sd_parameters = 4; required bytes sd_hash = 5; optional Parameters parameters = 6; } HBMS_ PARTITIONS HBMS_ SDS
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Motivation System Design Caching Strategy Transaction Management Deployment Experimental Results Future Work
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Caching Aggregate Stats: • Location - on HBase • Compile time File Footers: • Location - on HBase • Runtime - accessed from tasks Tables, Partitions, Storage Descriptors: • Location - on Metastore server(s) • Compile time
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Caching: Aggregate Stats “get_aggr_stats_for(dbName, tblName, partNames, colNames)” • Gets aggregated stats for columns in each partition – expensive call. • Used in CBO, Stats Annotation, Stats Optimizer • HBMS_AGGR_STATS • RowKey: md5(dbName, tblName, partVal1, ..., partValn, colName) • Columns: AggrStats proto and AggrStatsBloomFilter proto • Lookup: • New entry added for each key not found in cache. AggrStats calculated on client side & cached entry saved as serialized AggrStats proto. • AggrStatsBloomFilter created on partitions contained in AggrStats . • Invalidation: • TTL expiry: nodes evicted from cache. • Alter partition, Drop partition, Analyze etc: add invalidation request to a queue. • Invalidator thread picks invalidation request & executes a filter on HBase to removes expired entries. • Uses the bloom filter to find all AggrStats proto contains the candidate partition & removes them from the cache.
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Caching: File Footers • ORC footer cache • Task write file footers to a cache table on HBase (HBMS_FILE_METADATA; RowKey: fileId). • Read from AM for split generation (avoids reading lots of HDFS files for split generation). • Since fileId is unique, overwrite not a problem. Stale entries removed by a cleaner thread. • Skip transaction • High overhead • Transaction conflict • Row mutation is already atomic
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Motivation System Design Caching Strategy Transaction Management Deployment Experimental Results Future Work
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HBaseMetaStore Needs Transaction  Atomic is required – Create table / partition also create storage descriptor – Alter table also alter partitions – Drop table also drop table column privilege  HBase don’t support transaction – Don’t support cross-row transactions  HBaseConnection – Support different transaction manager in theory – VanillaHBaseConnection: no transaction
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Omid  Transaction layer on top of Hbase  Initially developed by Yahoo!  Apache incubator project – First release this Monday  Snapshot isolation – Natural as HBase is a versioned database – No locking, , no dead lock, no blocking for both read and write – Two concurrent transaction write to the same data: the later one aborts  Low overhead
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Omid Components  TSO Server (Timestamp Oracle) – Generate transid – Status of transaction  TSO Client – Talk to TSO – Cache transaction metadata – Most read don’t need to talk to TSO  Compactor – Run as HBase Coprocessor – Remove stale cell versions HBase Compactor Client TSO
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Omid Operations  Open transaction – Get transid from TSO  Read a cell – Read all versions of the cell from HBase – Read latest committed version before transaction start  Write a cell – Write value versioned with transid to HBase  Commit – Generate commitid from TSO – TSO figure out if there is conflict using transaction metadata
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Omid Data Structure  Memory management in TSO – Never run OOM, abort old transactions TSO row1 T20 row2 T25 row5 T22 lastCommit committed T10 T20 T4 T25 T11 T30 T2 … … aborted • Detect transaction conflict at commit time • Largest trunk of memory • Construct snapshot at read time • Partially replicated to client
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Transaction Conflict  Two concurrent DDL write to the same data – Proper retry logic  Task node writes - ORC footer cache – High chance for write conflict – Row mutation is atomic in Hbase – Cross row atomic is not required – Bypass transaction layer public void putFileMetadata(List<Long> fileIds, List<ByteBuffer> metadata, FileMetadataExprType type)
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Motivation System Design Caching Strategy Transaction Management Deployment Experimental Results Future Work
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Deployment  Server side components in HBase – Server side filter – Omid compactor – Copy related hive jars into hbase: hive-common.jar, hive-metastore.jar, hive-serde-.*.jar  New config in hive-site.xml – hive.metastore.rawstore.impl: org.apache.hadoop.hive.metastore.hbase.HBaseStore Server Side Filter Omid Compactor HBase TSO Hive MetaStore
  36. 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Deploy Omid  Create Omid Tables in HBase – omid.sh create-hbase-commit-table – omid.sh create-hbase-timestamp-table  Start Omid TSO – omid.sh tso  Related config in hive-site.xml – hive.metastore.hbase.connection.class=org.apache.hadoop.hive.metastore.hbase.Omid HBaseConnection – tso.host=localhost – tso.port=54758 – omid.client.connectionType=DIRECT
  37. 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Instantiate HBase Metastore  Instantiate Hbase Tables from scratch – hive --service hbaseschematool --install  Hbaseimport: import existing Hive Metastore – One way import from ObjectStore to HBaseStore – hive --service hbaseimport
  38. 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Motivation System Design Caching Strategy Transaction Management Deployment Experimental Results Future Work
  39. 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved TPC/DS queries 0 1000 2000 3000 4000 5000 6000 Query 7 Query 15 Query 27 Query 29 Query 39 Query 46 Query 56 Query 68 Query 70 Query 76 Query Plan Time for TPC/DS queries HBaseStore HBaseStore+Omid ObjectStore  1824 partitions: Sweetspot for ObjectStore  Average Speed up for all TPC/DS queries – 2.19 (without Omid) – 2.12 (With Omid)
  40. 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Motivation System Design Caching Strategy Transaction Management Deployment Experimental Results Future Work
  41. 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current Status  hbase-metastore branch merged to master last September  Turn off by default  Feature parity: Almost – Minor holes: event notification/version/constraints – Deprecate? listTableNamesByFilter/listPartitionNamesByFilter – Tools enhancement – ACID is not supported  Run most e2e queries  Fixing unit tests – TestMiniTezCliDriver all pass – TestCliDriver: HIVE-14097 pending review – Not production quality yet
  42. 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Work - ACID  Transaction metadata is stored in Metastore – Locks – Txns – Compactions  Data structure is harder to de-normalize  New work: transaction server – Keep lock and transaction tree in memory
  43. 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Work – HA via HBase Coprocessor  Two new server components – Omid TSO Server – Transaction Server  All servers need HA – Management headache  Automatic HA through HBase Coprocessor TSO Server via CoProcessor TSO Server via CoProcessor Region Server Region Server
  44. 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Work – Other  Stats Aggregation – Coprocessor  Improving ObjectCache – Rudimentary implementation currently – LRU  Omid consuming high CPU – 300% CPU always, by design – High throughput, avoid context switch – Might be an issue for small system
  45. 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You

×