Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015

Hive & HBase For
Transaction Processing
Page 1
Alan Gates
@alanfgates

Agenda
Page 2Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix

Agenda
• Our goal

A Brief History of Hive
• Initial goal was to make it easy to execute MapReduce using a familiar
language: SQL
– Most queries took minutes or hours
– Primarily used for batch ETL jobs
• Since 0.11 much has been done to support interactive and ad hoc queries
– Many new features focused on improving performance: ORC and Parquet, Tez and
Spark, vectorization
– As of Hive 0.14 (November 2014) TPC-DS query 3 (star-join, group, order, limit) using
ORC, Tez, and vectorization finishes in 9s for 200GB scale and 32s for 30TB scale.
– Still have ~2-5 second minimum for all queries
• Ongoing performance work with goal of reaching sub-second response time
– Continued investment in vectorization
– LLAP
– Using Apache HBase for metastore
LLAP = Live Long And Process

LLAP: Why?
• It is hard to be fast and flexible in Tez
– When SQL session starts Tez AM spun up (first query cost)
– For subsequent queries Tez containers can be
– pre-allocated – fast but not flexible
– allocated and released for each query – flexible but start up cost for every query
• No caching of data between queries
– Even if data is in OS cache much of IO cost is deserialization/vector marshaling
which is not shared

LLAP: What
• LLAP is a node resident daemon process
– Low latency by reducing setup cost
– Multi-threaded engine that runs smaller tasks for query
including reads, filter and some joins
– Use regular Tez tasks for larger shuffle and other operators
• LLAP has In-memory columnar data cache
– High throughput IO using Async IO Elevator with dedicated
thread and core per disk
– Low latency by providing data from in-memory (off heap)
cache instead of going to HDFS
– Store data in columnar format for vectorization irrespective
of underlying file type
– Security enforced across queries and users
• Uses YARN for resource management
Node
LLAP Process
Query
Fragment
LLAP In-
Memory
columnar
cache
LLAP
process
running a
task for a
query
HDFS

LLAP: What
Node
LLAP
Process
HDFS
Query
Fragm
ent
LLAP In-Memory
columnar cache
LLAP process
running read task
for a query
LLAP process runs on multiple nodes,
accelerating Tez tasks
Node
Hive
Query
Node NodeNode Node
LLAP LLAP LLAP LLAP

LLAP: Is and Is Not
• It is not MPP
– Data not shuffled between LLAP nodes (except in limited cases)
• It is not a replacement for Tez or Spark
– Configured engine still used to launch tasks for post-shuffle operations (e.g. hash
joins, distributed aggregations, etc.)
• It is not required, users can still use Hive without installing LLAP
demons
• It is a Map server, or a set of standing map tasks
• It is currently under development on the llap branch

HBase Metastore: Why?

BUCKETING_COLS
SD_ID BIGINT(20)
BUCKET_COL_NAME VARCHAR(256)
INTEGER_IDX INT(11)
Indexes
CDS
CD_ID BIGINT(20)
Indexes
COLUMNS_V2
CD_ID BIGINT(20)
COMMENT VARCHAR(256)
COLUMN_NAME VARCHAR(128)
TYPE_NAME VARCHAR(4000)
INTEGER_IDX INT(11)
Indexes
DATABASE_PARAMS
DB_ID BIGINT(20)
PARAM_KEY VARCHAR(180)
PARAM_VALUE VARCHAR(4000)
Indexes
DBS
DB_ID BIGINT(20)
DESC VARCHAR(4000)
DB_LOCATION_URI VARCHAR(4000)
NAME VARCHAR(128)
OWNER_NAME VARCHAR(128)
OWNER_TYPE VARCHAR(10)
Indexes
DB_PRIVS
DB_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
DB_ID BIGINT(20)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
DB_PRIV VARCHAR(128)
Indexes
GLOBAL_PRIVS
USER_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
USER_PRIV VARCHAR(128)
Indexes
IDXS
INDEX_ID BIGINT(20)
CREATE_TIME INT(11)
DEFERRED_REBUILD BIT(1)
INDEX_HANDLER_CLASS VARCHAR(4000)
INDEX_NAME VARCHAR(128)
INDEX_TBL_ID BIGINT(20)
LAST_ACCESS_TIME INT(11)
ORIG_TBL_ID BIGINT(20)
SD_ID BIGINT(20)
Indexes
INDEX_PARAMS
INDEX_ID BIGINT(20)
Indexes
NUCLEUS_TABLES
CLASS_NAME VARCHAR(128)
TABLE_NAME VARCHAR(128)
TYPE VARCHAR(4)
OWNER VARCHAR(2)
VERSION VARCHAR(20)
INTERFACE_NAME VARCHAR(255)
Indexes
PARTITIONS
PART_ID BIGINT(20)
CREATE_TIME INT(11)
PART_NAME VARCHAR(767)
SD_ID BIGINT(20)
TBL_ID BIGINT(20)
LINK_TARGET_ID BIGINT(20)
Indexes
PARTITION_EVENTS
PART_NAME_ID BIGINT(20)
DB_NAME VARCHAR(128)
EVENT_TIME BIGINT(20)
EVENT_TYPE INT(11)
PARTITION_NAME VARCHAR(767)
TBL_NAME VARCHAR(128)
Indexes
PARTITION_KEYS
TBL_ID BIGINT(20)
PKEY_COMMENT VARCHAR(4000)
PKEY_NAME VARCHAR(128)
PKEY_TYPE VARCHAR(767)
INTEGER_IDX INT(11)
Indexes
PARTITION_KEY_VALS
PART_ID BIGINT(20)
PART_KEY_VAL VARCHAR(256)
INTEGER_IDX INT(11)
Indexes
PARTITION_PARAMS
PART_ID BIGINT(20)
Indexes
PART_COL_PRIVS
PART_COLUMN_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
PART_ID BIGINT(20)
PART_COL_PRIV VARCHAR(128)
Indexes
PART_PRIVS
PART_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
PART_ID BIGINT(20)
PART_PRIV VARCHAR(128)
Indexes
ROLES
ROLE_ID BIGINT(20)
CREATE_TIME INT(11)
ROLE_NAME VARCHAR(128)
Indexes
ROLE_MAP
ROLE_GRANT_ID BIGINT(20)
ADD_TIME INT(11)
ROLE_ID BIGINT(20)
Indexes
SDS
SD_ID BIGINT(20)
CD_ID BIGINT(20)
INPUT_FORMAT VARCHAR(4000)
IS_COMPRESSED BIT(1)
IS_STOREDASSUBDIRECTORIES BIT(1)
LOCATION VARCHAR(4000)
NUM_BUCKETS INT(11)
OUTPUT_FORMAT VARCHAR(4000)
SERDE_ID BIGINT(20)
Indexes
SD_PARAMS
SD_ID BIGINT(20)
Indexes
SEQUENCE_TABLE
SEQUENCE_NAME VARCHAR(255)
NEXT_VAL BIGINT(20)
Indexes
SERDES
SERDE_ID BIGINT(20)
NAME VARCHAR(128)
SLIB VARCHAR(4000)
Indexes
SERDE_PARAMS
SERDE_ID BIGINT(20)
Indexes
SKEWED_COL_NAMES
SD_ID BIGINT(20)
SKEWED_COL_NAME VARCHAR(256)
INTEGER_IDX INT(11)
Indexes
SKEWED_COL_VALUE_LOC_MAP
SD_ID BIGINT(20)
STRING_LIST_ID_KID BIGINT(20)
LOCATION VARCHAR(4000)
Indexes
SKEWED_STRING_LIST
STRING_LIST_ID BIGINT(20)
Indexes
SKEWED_STRING_LIST_VALUES
STRING_LIST_ID BIGINT(20)
STRING_LIST_VALUE VARCHAR(256)
INTEGER_IDX INT(11)
Indexes
SKEWED_VALUES
SD_ID_OID BIGINT(20)
STRING_LIST_ID_EID BIGINT(20)
INTEGER_IDX INT(11)
Indexes
SORT_COLS
SD_ID BIGINT(20)
ORDER INT(11)
INTEGER_IDX INT(11)
Indexes
TABLE_PARAMS
TBL_ID BIGINT(20)
Indexes
TBLS
TBL_ID BIGINT(20)
CREATE_TIME INT(11)
DB_ID BIGINT(20)
OWNER VARCHAR(767)
RETENTION INT(11)
SD_ID BIGINT(20)
TBL_NAME VARCHAR(128)
TBL_TYPE VARCHAR(128)
VIEW_EXPANDED_TEXT MEDIUMTEXT
VIEW_ORIGINAL_TEXT MEDIUMTEXT
LINK_TARGET_ID BIGINT(20)
Indexes
TBL_COL_PRIVS
TBL_COLUMN_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
TBL_COL_PRIV VARCHAR(128)
TBL_ID BIGINT(20)
Indexes
TBL_PRIVS
TBL_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
TBL_PRIV VARCHAR(128)
TBL_ID BIGINT(20)
Indexes
TAB_COL_STATS
CS_ID BIGINT(20)
COLUMN_TYPE VARCHAR(128)
TBL_ID BIGINT(20)
LONG_LOW_VALUE BIGINT(20)
LONG_HIGH_VALUE BIGINT(20)
DOUBLE_HIGH_VALUE DOUBLE(53,4)
DOUBLE_LOW_VALUE DOUBLE(53,4)
BIG_DECIMAL_LOW_VALUE VARCHAR(4000)
BIG_DECIMAL_HIGH_VALUE VARCHAR(4000)
NUM_NULLS BIGINT(20)
NUM_DISTINCTS BIGINT(20)
AVG_COL_LEN DOUBLE(53,4)
MAX_COL_LEN BIGINT(20)
NUM_TRUES BIGINT(20)
NUM_FALSES BIGINT(20)
LAST_ANALYZED BIGINT(20)
Indexes
PART_COL_STATS
CS_ID BIGINT(20)
PARTITION_NAME VARCHAR(767)
COLUMN_TYPE VARCHAR(128)
PART_ID BIGINT(20)
LONG_LOW_VALUE BIGINT(20)
LONG_HIGH_VALUE BIGINT(20)
DOUBLE_HIGH_VALUE DOUBLE(53,4)
DOUBLE_LOW_VALUE DOUBLE(53,4)
BIG_DECIMAL_LOW_VALUE VARCHAR(4000)
BIG_DECIMAL_HIGH_VALUE VARCHAR(4000)
NUM_NULLS BIGINT(20)
NUM_DISTINCTS BIGINT(20)
AVG_COL_LEN DOUBLE(53,4)
MAX_COL_LEN BIGINT(20)
NUM_TRUES BIGINT(20)
NUM_FALSES BIGINT(20)
LAST_ANALYZED BIGINT(20)
Indexes
TYPES
TYPES_ID BIGINT(20)
TYPE_NAME VARCHAR(128)
TYPE1 VARCHAR(767)
TYPE2 VARCHAR(767)
Indexes
TYPE_FIELDS
TYPE_NAME BIGINT(20)
COMMENT VARCHAR(256)
FIELD_NAME VARCHAR(128)
FIELD_TYPE VARCHAR(767)
INTEGER_IDX INT(11)
Indexes
MASTER_KEYS
KEY_ID INT
MASTER_KEY VARCHAR(767)
Indexes
DELEGATION_TOKENS
TOKEN_IDENT VARCHAR(767)
TOKEN VARCHAR(767)
Indexes
VERSION
VER_ID BIGINT
SCHEMA_VERSION VARCHAR(127)
VERSION_COMMENT VARCHAR(255)
Indexes
FUNCS
FUNC_ID BIGINT(20)
CLASS_NAME VARCHAR(4000)
CREATE_TIME INT(11)
DB_ID BIGINT(20)
FUNC_NAME VARCHAR(128)
FUNC_TYPE INT(11)
OWNER_TYPE VARCHAR(10)
Indexes
FUNC_RU
FUNC_ID BIGINT(20)
RESOURCE_TYPE INT(11)
RESOURCE_URI VARCHAR(4000)
INTEGER_IDX INT(11)
Indexes

> 700 metastore queries to plan
TPC-DS query 27!!!

• Object Relational Modeling is an impedance mismatch
• The need to work across different DBs limits tuning opportunities
• No caching of catalog objects or stats in HiveServer2 or Hive metastore
• Hadoop nodes cannot contact RDBMS directly due to scale issues
• Solution: use HBase
– Can store object directly, no need to normalize
– Already scales, performs, etc.
– Can store additional data not stored today due to RDBMS capacity limitations
– Can access the metadata from the cluster (e.g. LLAP, Tez AM)

But...
• HBase does not have transactions –
metastore needs them
– Tephra, Omid 2 (Yahoo), others working on this
• HBase is hard to administer and install
– Yes, we will need to improve this
– We will also need embedded option for test/POC
setups to keep HBase from becoming barrier to
adoption
• Basically any work we need to do to HBase
for this is good since it benefits all HBase
users

HBase Metastore: How
• HBaseStore, a new implementation of RawStore that stores data in
HBase
• Not default, users still free to use RDBMS
• Less than 10 tables in HBase
– DBS, TBLS, PARTITIONS, ... – basically one for each object type
– Common partition data factored out to significantly reduce size
• Layout highly optimized for SELECT and DML queries, longer
operations moved into DDL (e.g. grant)
• Extensive caching
– Of data catalog objects for length of a query
– Of aggregated stats across queries and users
• On going work in hbase-metastore branch

Agenda
• Our goal

Apache Phoenix: Putting SQL Back in NoSQL
• SQL layer on top of HBase
• Originally oriented toward transaction processing
• Moving to add more analytics type operators
– Adding multiple join implementations
– Requests for OLAP functions (PHOENIX-154)
• Working on adding transactions (PHOENIX-1674)
• Moving to Apache Calcite for optimization (PHOENIX-1488)

Agenda
• Our goal

What If?
• We could share one O/JDBC driver?
• We could share one SQL dialect?
• Phoenix could leverage extensive analytics
functionality in Hive without re-inventing it
• Users could access their transactional and
analytics data in single SQL operations?

How?
• Insight #1: LLAP is a storage plus operations
server for Hive; we can swap it out for other
implementations
• Insight #2: Tez and Spark can do post-shuffle
operations (hash join, etc.) with LLAP or HBase
• Insight #3: Calcite (used by both Hive and
Phoenix) is built specifically to integrate
disparate data storage systems

Vision
• User picks storage location for table in create
table (LLAP or HBase)
• Transactions more efficient in HBase tables but
work in both
• Analytics more efficient in LLAP tables but work
in both
• Queries that require shuffle use Tez or Spark for
post shuffle operators
HDFS
JDBC Server
Node Node
HBase LLAP
Query
Query
Query
Calcite
used for
planning
Phoenix
used for
execution

Hurdles
• Need to integrate types/data representation
• Need to integrate transaction management
• Work to do in Calcite to optimize transactional queries well

Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015

Similar to Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015 (20)

Recently uploaded

Recently uploaded (20)

Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015