SlideShare a Scribd company logo
Hive & HBase For
Transaction Processing
Page 1
Alan Gates
@alanfgates
Agenda
Page 2Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
Agenda
Page 3Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
A Brief History of Hive
Page 4Hive & HBase For Transaction Processing
• Initial goal was to make it easy to execute MapReduce using a familiar
language: SQL
– Most queries took minutes or hours
– Primarily used for batch ETL jobs
• Since 0.11 much has been done to support interactive and ad hoc queries
– Many new features focused on improving performance: ORC and Parquet, Tez and
Spark, vectorization
– As of Hive 0.14 (November 2014) TPC-DS query 3 (star-join, group, order, limit) using
ORC, Tez, and vectorization finishes in 9s for 200GB scale and 32s for 30TB scale.
– Still have ~2-5 second minimum for all queries
• Ongoing performance work with goal of reaching sub-second response time
– Continued investment in vectorization
– LLAP
– Using Apache HBase for metastore
LLAP = Live Long And Process
LLAP: Why?
Page 5Hive & HBase For Transaction Processing
• It is hard to be fast and flexible in Tez
– When SQL session starts Tez AM spun up (first query cost)
– For subsequent queries Tez containers can be
– pre-allocated – fast but not flexible
– allocated and released for each query – flexible but start up cost for every query
• No caching of data between queries
– Even if data is in OS cache much of IO cost is deserialization/vector marshaling
which is not shared
LLAP: What
Page 6Hive & HBase For Transaction Processing
• LLAP is a node resident daemon process
– Low latency by reducing setup cost
– Multi-threaded engine that runs smaller tasks for query
including reads, filter and some joins
– Use regular Tez tasks for larger shuffle and other operators
• LLAP has In-memory columnar data cache
– High throughput IO using Async IO Elevator with dedicated
thread and core per disk
– Low latency by providing data from in-memory (off heap)
cache instead of going to HDFS
– Store data in columnar format for vectorization irrespective
of underlying file type
– Security enforced across queries and users
• Uses YARN for resource management
Node
LLAP Process
Query
Fragment
LLAP In-
Memory
columnar
cache
LLAP
process
running a
task for a
query
HDFS
LLAP: What
Page 7Hive & HBase For Transaction Processing
Node
LLAP
Process
HDFS
Query
Fragm
ent
LLAP In-Memory
columnar cache
LLAP process
running read task
for a query
LLAP process runs on multiple nodes,
accelerating Tez tasks
Node
Hive
Query
Node NodeNode Node
LLAP LLAP LLAP LLAP
LLAP: Is and Is Not
Page 8Hive & HBase For Transaction Processing
• It is not MPP
– Data not shuffled between LLAP nodes (except in limited cases)
• It is not a replacement for Tez or Spark
– Configured engine still used to launch tasks for post-shuffle operations (e.g. hash
joins, distributed aggregations, etc.)
• It is not required, users can still use Hive without installing LLAP
demons
• It is a Map server, or a set of standing map tasks
• It is currently under development on the llap branch
HBase Metastore: Why?
Page 9Hive & HBase For Transaction Processing
HBase Metastore: Why?
Page 10Hive & HBase For Transaction Processing
BUCKETING_COLS
SD_ID BIGINT(20)
BUCKET_COL_NAME VARCHAR(256)
INTEGER_IDX INT(11)
Indexes
CDS
CD_ID BIGINT(20)
Indexes
COLUMNS_V2
CD_ID BIGINT(20)
COMMENT VARCHAR(256)
COLUMN_NAME VARCHAR(128)
TYPE_NAME VARCHAR(4000)
INTEGER_IDX INT(11)
Indexes
DATABASE_PARAMS
DB_ID BIGINT(20)
PARAM_KEY VARCHAR(180)
PARAM_VALUE VARCHAR(4000)
Indexes
DBS
DB_ID BIGINT(20)
DESC VARCHAR(4000)
DB_LOCATION_URI VARCHAR(4000)
NAME VARCHAR(128)
OWNER_NAME VARCHAR(128)
OWNER_TYPE VARCHAR(10)
Indexes
DB_PRIVS
DB_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
DB_ID BIGINT(20)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
DB_PRIV VARCHAR(128)
Indexes
GLOBAL_PRIVS
USER_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
USER_PRIV VARCHAR(128)
Indexes
IDXS
INDEX_ID BIGINT(20)
CREATE_TIME INT(11)
DEFERRED_REBUILD BIT(1)
INDEX_HANDLER_CLASS VARCHAR(4000)
INDEX_NAME VARCHAR(128)
INDEX_TBL_ID BIGINT(20)
LAST_ACCESS_TIME INT(11)
ORIG_TBL_ID BIGINT(20)
SD_ID BIGINT(20)
Indexes
INDEX_PARAMS
INDEX_ID BIGINT(20)
PARAM_KEY VARCHAR(256)
PARAM_VALUE VARCHAR(4000)
Indexes
NUCLEUS_TABLES
CLASS_NAME VARCHAR(128)
TABLE_NAME VARCHAR(128)
TYPE VARCHAR(4)
OWNER VARCHAR(2)
VERSION VARCHAR(20)
INTERFACE_NAME VARCHAR(255)
Indexes
PARTITIONS
PART_ID BIGINT(20)
CREATE_TIME INT(11)
LAST_ACCESS_TIME INT(11)
PART_NAME VARCHAR(767)
SD_ID BIGINT(20)
TBL_ID BIGINT(20)
LINK_TARGET_ID BIGINT(20)
Indexes
PARTITION_EVENTS
PART_NAME_ID BIGINT(20)
DB_NAME VARCHAR(128)
EVENT_TIME BIGINT(20)
EVENT_TYPE INT(11)
PARTITION_NAME VARCHAR(767)
TBL_NAME VARCHAR(128)
Indexes
PARTITION_KEYS
TBL_ID BIGINT(20)
PKEY_COMMENT VARCHAR(4000)
PKEY_NAME VARCHAR(128)
PKEY_TYPE VARCHAR(767)
INTEGER_IDX INT(11)
Indexes
PARTITION_KEY_VALS
PART_ID BIGINT(20)
PART_KEY_VAL VARCHAR(256)
INTEGER_IDX INT(11)
Indexes
PARTITION_PARAMS
PART_ID BIGINT(20)
PARAM_KEY VARCHAR(256)
PARAM_VALUE VARCHAR(4000)
Indexes
PART_COL_PRIVS
PART_COLUMN_GRANT_ID BIGINT(20)
COLUMN_NAME VARCHAR(128)
CREATE_TIME INT(11)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PART_ID BIGINT(20)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
PART_COL_PRIV VARCHAR(128)
Indexes
PART_PRIVS
PART_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PART_ID BIGINT(20)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
PART_PRIV VARCHAR(128)
Indexes
ROLES
ROLE_ID BIGINT(20)
CREATE_TIME INT(11)
OWNER_NAME VARCHAR(128)
ROLE_NAME VARCHAR(128)
Indexes
ROLE_MAP
ROLE_GRANT_ID BIGINT(20)
ADD_TIME INT(11)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
ROLE_ID BIGINT(20)
Indexes
SDS
SD_ID BIGINT(20)
CD_ID BIGINT(20)
INPUT_FORMAT VARCHAR(4000)
IS_COMPRESSED BIT(1)
IS_STOREDASSUBDIRECTORIES BIT(1)
LOCATION VARCHAR(4000)
NUM_BUCKETS INT(11)
OUTPUT_FORMAT VARCHAR(4000)
SERDE_ID BIGINT(20)
Indexes
SD_PARAMS
SD_ID BIGINT(20)
PARAM_KEY VARCHAR(256)
PARAM_VALUE VARCHAR(4000)
Indexes
SEQUENCE_TABLE
SEQUENCE_NAME VARCHAR(255)
NEXT_VAL BIGINT(20)
Indexes
SERDES
SERDE_ID BIGINT(20)
NAME VARCHAR(128)
SLIB VARCHAR(4000)
Indexes
SERDE_PARAMS
SERDE_ID BIGINT(20)
PARAM_KEY VARCHAR(256)
PARAM_VALUE VARCHAR(4000)
Indexes
SKEWED_COL_NAMES
SD_ID BIGINT(20)
SKEWED_COL_NAME VARCHAR(256)
INTEGER_IDX INT(11)
Indexes
SKEWED_COL_VALUE_LOC_MAP
SD_ID BIGINT(20)
STRING_LIST_ID_KID BIGINT(20)
LOCATION VARCHAR(4000)
Indexes
SKEWED_STRING_LIST
STRING_LIST_ID BIGINT(20)
Indexes
SKEWED_STRING_LIST_VALUES
STRING_LIST_ID BIGINT(20)
STRING_LIST_VALUE VARCHAR(256)
INTEGER_IDX INT(11)
Indexes
SKEWED_VALUES
SD_ID_OID BIGINT(20)
STRING_LIST_ID_EID BIGINT(20)
INTEGER_IDX INT(11)
Indexes
SORT_COLS
SD_ID BIGINT(20)
COLUMN_NAME VARCHAR(128)
ORDER INT(11)
INTEGER_IDX INT(11)
Indexes
TABLE_PARAMS
TBL_ID BIGINT(20)
PARAM_KEY VARCHAR(256)
PARAM_VALUE VARCHAR(4000)
Indexes
TBLS
TBL_ID BIGINT(20)
CREATE_TIME INT(11)
DB_ID BIGINT(20)
LAST_ACCESS_TIME INT(11)
OWNER VARCHAR(767)
RETENTION INT(11)
SD_ID BIGINT(20)
TBL_NAME VARCHAR(128)
TBL_TYPE VARCHAR(128)
VIEW_EXPANDED_TEXT MEDIUMTEXT
VIEW_ORIGINAL_TEXT MEDIUMTEXT
LINK_TARGET_ID BIGINT(20)
Indexes
TBL_COL_PRIVS
TBL_COLUMN_GRANT_ID BIGINT(20)
COLUMN_NAME VARCHAR(128)
CREATE_TIME INT(11)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
TBL_COL_PRIV VARCHAR(128)
TBL_ID BIGINT(20)
Indexes
TBL_PRIVS
TBL_GRANT_ID BIGINT(20)
CREATE_TIME INT(11)
GRANT_OPTION SMALLINT(6)
GRANTOR VARCHAR(128)
GRANTOR_TYPE VARCHAR(128)
PRINCIPAL_NAME VARCHAR(128)
PRINCIPAL_TYPE VARCHAR(128)
TBL_PRIV VARCHAR(128)
TBL_ID BIGINT(20)
Indexes
TAB_COL_STATS
CS_ID BIGINT(20)
DB_NAME VARCHAR(128)
TABLE_NAME VARCHAR(128)
COLUMN_NAME VARCHAR(128)
COLUMN_TYPE VARCHAR(128)
TBL_ID BIGINT(20)
LONG_LOW_VALUE BIGINT(20)
LONG_HIGH_VALUE BIGINT(20)
DOUBLE_HIGH_VALUE DOUBLE(53,4)
DOUBLE_LOW_VALUE DOUBLE(53,4)
BIG_DECIMAL_LOW_VALUE VARCHAR(4000)
BIG_DECIMAL_HIGH_VALUE VARCHAR(4000)
NUM_NULLS BIGINT(20)
NUM_DISTINCTS BIGINT(20)
AVG_COL_LEN DOUBLE(53,4)
MAX_COL_LEN BIGINT(20)
NUM_TRUES BIGINT(20)
NUM_FALSES BIGINT(20)
LAST_ANALYZED BIGINT(20)
Indexes
PART_COL_STATS
CS_ID BIGINT(20)
DB_NAME VARCHAR(128)
TABLE_NAME VARCHAR(128)
PARTITION_NAME VARCHAR(767)
COLUMN_NAME VARCHAR(128)
COLUMN_TYPE VARCHAR(128)
PART_ID BIGINT(20)
LONG_LOW_VALUE BIGINT(20)
LONG_HIGH_VALUE BIGINT(20)
DOUBLE_HIGH_VALUE DOUBLE(53,4)
DOUBLE_LOW_VALUE DOUBLE(53,4)
BIG_DECIMAL_LOW_VALUE VARCHAR(4000)
BIG_DECIMAL_HIGH_VALUE VARCHAR(4000)
NUM_NULLS BIGINT(20)
NUM_DISTINCTS BIGINT(20)
AVG_COL_LEN DOUBLE(53,4)
MAX_COL_LEN BIGINT(20)
NUM_TRUES BIGINT(20)
NUM_FALSES BIGINT(20)
LAST_ANALYZED BIGINT(20)
Indexes
TYPES
TYPES_ID BIGINT(20)
TYPE_NAME VARCHAR(128)
TYPE1 VARCHAR(767)
TYPE2 VARCHAR(767)
Indexes
TYPE_FIELDS
TYPE_NAME BIGINT(20)
COMMENT VARCHAR(256)
FIELD_NAME VARCHAR(128)
FIELD_TYPE VARCHAR(767)
INTEGER_IDX INT(11)
Indexes
MASTER_KEYS
KEY_ID INT
MASTER_KEY VARCHAR(767)
Indexes
DELEGATION_TOKENS
TOKEN_IDENT VARCHAR(767)
TOKEN VARCHAR(767)
Indexes
VERSION
VER_ID BIGINT
SCHEMA_VERSION VARCHAR(127)
VERSION_COMMENT VARCHAR(255)
Indexes
FUNCS
FUNC_ID BIGINT(20)
CLASS_NAME VARCHAR(4000)
CREATE_TIME INT(11)
DB_ID BIGINT(20)
FUNC_NAME VARCHAR(128)
FUNC_TYPE INT(11)
OWNER_NAME VARCHAR(128)
OWNER_TYPE VARCHAR(10)
Indexes
FUNC_RU
FUNC_ID BIGINT(20)
RESOURCE_TYPE INT(11)
RESOURCE_URI VARCHAR(4000)
INTEGER_IDX INT(11)
Indexes
HBase Metastore: Why?
Page 11Hive & HBase For Transaction Processing
> 700 metastore queries to plan
TPC-DS query 27!!!
HBase Metastore: Why?
Page 12Hive & HBase For Transaction Processing
• Object Relational Modeling is an impedance mismatch
• The need to work across different DBs limits tuning opportunities
• No caching of catalog objects or stats in HiveServer2 or Hive metastore
• Hadoop nodes cannot contact RDBMS directly due to scale issues
• Solution: use HBase
– Can store object directly, no need to normalize
– Already scales, performs, etc.
– Can store additional data not stored today due to RDBMS capacity limitations
– Can access the metadata from the cluster (e.g. LLAP, Tez AM)
But...
Page 13Hive & HBase For Transaction Processing
• HBase does not have transactions –
metastore needs them
– Tephra, Omid 2 (Yahoo), others working on this
• HBase is hard to administer and install
– Yes, we will need to improve this
– We will also need embedded option for test/POC
setups to keep HBase from becoming barrier to
adoption
• Basically any work we need to do to HBase
for this is good since it benefits all HBase
users
HBase Metastore: How
Page 14Hive & HBase For Transaction Processing
• HBaseStore, a new implementation of RawStore that stores data in
HBase
• Not default, users still free to use RDBMS
• Less than 10 tables in HBase
– DBS, TBLS, PARTITIONS, ... – basically one for each object type
– Common partition data factored out to significantly reduce size
• Layout highly optimized for SELECT and DML queries, longer
operations moved into DDL (e.g. grant)
• Extensive caching
– Of data catalog objects for length of a query
– Of aggregated stats across queries and users
• On going work in hbase-metastore branch
Agenda
Page 15Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
Apache Phoenix: Putting SQL Back in NoSQL
Page 16Hive & HBase For Transaction Processing
• SQL layer on top of HBase
• Originally oriented toward transaction processing
• Moving to add more analytics type operators
– Adding multiple join implementations
– Requests for OLAP functions (PHOENIX-154)
• Working on adding transactions (PHOENIX-1674)
• Moving to Apache Calcite for optimization (PHOENIX-1488)
Agenda
Page 17Hive & HBase For Transaction Processing
• Our goal
– Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store
that can be used for analytics and transaction processing
• But before we get to that we need to consider
– Some things happening in Hive
– Some things happening in Phoenix
What If?
Page 18Hive & HBase For Transaction Processing
• We could share one O/JDBC driver?
• We could share one SQL dialect?
• Phoenix could leverage extensive analytics
functionality in Hive without re-inventing it
• Users could access their transactional and
analytics data in single SQL operations?
How?
Page 19Hive & HBase For Transaction Processing
• Insight #1: LLAP is a storage plus operations
server for Hive; we can swap it out for other
implementations
• Insight #2: Tez and Spark can do post-shuffle
operations (hash join, etc.) with LLAP or HBase
• Insight #3: Calcite (used by both Hive and
Phoenix) is built specifically to integrate
disparate data storage systems
Vision
Page 20Hive & HBase For Transaction Processing
• User picks storage location for table in create
table (LLAP or HBase)
• Transactions more efficient in HBase tables but
work in both
• Analytics more efficient in LLAP tables but work
in both
• Queries that require shuffle use Tez or Spark for
post shuffle operators
HDFS
JDBC Server
Node Node
HBase LLAP
Query
Query
Query
Calcite
used for
planning
Phoenix
used for
execution
Hurdles
Page 21Hive & HBase For Transaction Processing
• Need to integrate types/data representation
• Need to integrate transaction management
• Work to do in Calcite to optimize transactional queries well

More Related Content

What's hot

Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
DataWorks Summit/Hadoop Summit
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
alanfgates
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud Storage
Hortonworks
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
Yu Liu
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
t3rmin4t0r
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
DataWorks Summit/Hadoop Summit
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0
DataWorks Summit
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
t3rmin4t0r
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
Yifeng Jiang
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
HiveACIDPublic
HiveACIDPublicHiveACIDPublic
HiveACIDPublic
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud Storage
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 

Viewers also liked

Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
alanfgates
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
alanfgates
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
alanfgates
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
alanfgates
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)Steve Min
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
odsc
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
DataWorks Summit/Hadoop Summit
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
alanfgates
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
alanfgates
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Mats Johansson
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
kiran palaka
 
빅데이터, big data
빅데이터, big data빅데이터, big data
빅데이터, big data
H K Yoon
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
DataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
DataWorks Summit
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Matthew (정재화)
 
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
NAVER D2
 
하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기
beom kyun choi
 

Viewers also liked (20)

Bowling event
Bowling eventBowling event
Bowling event
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern ArchitectureData in Motion - Data at Rest - Hortonworks a Modern Architecture
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
빅데이터, big data
빅데이터, big data빅데이터, big data
빅데이터, big data
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
Hadoop과 SQL-on-Hadoop (A short intro to Hadoop and SQL-on-Hadoop)
 
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
 
하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기하둡 (Hadoop) 및 관련기술 훑어보기
하둡 (Hadoop) 및 관련기술 훑어보기
 

Similar to Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015

Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction Processing
DataWorks Summit
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-finalMaryann Xue
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Hortonworks
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
 

Similar to Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015 (20)

Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction Processing
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 

Recently uploaded

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 

Recently uploaded (20)

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 

Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015

  • 1. Hive & HBase For Transaction Processing Page 1 Alan Gates @alanfgates
  • 2. Agenda Page 2Hive & HBase For Transaction Processing • Our goal – Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store that can be used for analytics and transaction processing • But before we get to that we need to consider – Some things happening in Hive – Some things happening in Phoenix
  • 3. Agenda Page 3Hive & HBase For Transaction Processing • Our goal – Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store that can be used for analytics and transaction processing • But before we get to that we need to consider – Some things happening in Hive – Some things happening in Phoenix
  • 4. A Brief History of Hive Page 4Hive & HBase For Transaction Processing • Initial goal was to make it easy to execute MapReduce using a familiar language: SQL – Most queries took minutes or hours – Primarily used for batch ETL jobs • Since 0.11 much has been done to support interactive and ad hoc queries – Many new features focused on improving performance: ORC and Parquet, Tez and Spark, vectorization – As of Hive 0.14 (November 2014) TPC-DS query 3 (star-join, group, order, limit) using ORC, Tez, and vectorization finishes in 9s for 200GB scale and 32s for 30TB scale. – Still have ~2-5 second minimum for all queries • Ongoing performance work with goal of reaching sub-second response time – Continued investment in vectorization – LLAP – Using Apache HBase for metastore LLAP = Live Long And Process
  • 5. LLAP: Why? Page 5Hive & HBase For Transaction Processing • It is hard to be fast and flexible in Tez – When SQL session starts Tez AM spun up (first query cost) – For subsequent queries Tez containers can be – pre-allocated – fast but not flexible – allocated and released for each query – flexible but start up cost for every query • No caching of data between queries – Even if data is in OS cache much of IO cost is deserialization/vector marshaling which is not shared
  • 6. LLAP: What Page 6Hive & HBase For Transaction Processing • LLAP is a node resident daemon process – Low latency by reducing setup cost – Multi-threaded engine that runs smaller tasks for query including reads, filter and some joins – Use regular Tez tasks for larger shuffle and other operators • LLAP has In-memory columnar data cache – High throughput IO using Async IO Elevator with dedicated thread and core per disk – Low latency by providing data from in-memory (off heap) cache instead of going to HDFS – Store data in columnar format for vectorization irrespective of underlying file type – Security enforced across queries and users • Uses YARN for resource management Node LLAP Process Query Fragment LLAP In- Memory columnar cache LLAP process running a task for a query HDFS
  • 7. LLAP: What Page 7Hive & HBase For Transaction Processing Node LLAP Process HDFS Query Fragm ent LLAP In-Memory columnar cache LLAP process running read task for a query LLAP process runs on multiple nodes, accelerating Tez tasks Node Hive Query Node NodeNode Node LLAP LLAP LLAP LLAP
  • 8. LLAP: Is and Is Not Page 8Hive & HBase For Transaction Processing • It is not MPP – Data not shuffled between LLAP nodes (except in limited cases) • It is not a replacement for Tez or Spark – Configured engine still used to launch tasks for post-shuffle operations (e.g. hash joins, distributed aggregations, etc.) • It is not required, users can still use Hive without installing LLAP demons • It is a Map server, or a set of standing map tasks • It is currently under development on the llap branch
  • 9. HBase Metastore: Why? Page 9Hive & HBase For Transaction Processing
  • 10. HBase Metastore: Why? Page 10Hive & HBase For Transaction Processing BUCKETING_COLS SD_ID BIGINT(20) BUCKET_COL_NAME VARCHAR(256) INTEGER_IDX INT(11) Indexes CDS CD_ID BIGINT(20) Indexes COLUMNS_V2 CD_ID BIGINT(20) COMMENT VARCHAR(256) COLUMN_NAME VARCHAR(128) TYPE_NAME VARCHAR(4000) INTEGER_IDX INT(11) Indexes DATABASE_PARAMS DB_ID BIGINT(20) PARAM_KEY VARCHAR(180) PARAM_VALUE VARCHAR(4000) Indexes DBS DB_ID BIGINT(20) DESC VARCHAR(4000) DB_LOCATION_URI VARCHAR(4000) NAME VARCHAR(128) OWNER_NAME VARCHAR(128) OWNER_TYPE VARCHAR(10) Indexes DB_PRIVS DB_GRANT_ID BIGINT(20) CREATE_TIME INT(11) DB_ID BIGINT(20) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) DB_PRIV VARCHAR(128) Indexes GLOBAL_PRIVS USER_GRANT_ID BIGINT(20) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) USER_PRIV VARCHAR(128) Indexes IDXS INDEX_ID BIGINT(20) CREATE_TIME INT(11) DEFERRED_REBUILD BIT(1) INDEX_HANDLER_CLASS VARCHAR(4000) INDEX_NAME VARCHAR(128) INDEX_TBL_ID BIGINT(20) LAST_ACCESS_TIME INT(11) ORIG_TBL_ID BIGINT(20) SD_ID BIGINT(20) Indexes INDEX_PARAMS INDEX_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes NUCLEUS_TABLES CLASS_NAME VARCHAR(128) TABLE_NAME VARCHAR(128) TYPE VARCHAR(4) OWNER VARCHAR(2) VERSION VARCHAR(20) INTERFACE_NAME VARCHAR(255) Indexes PARTITIONS PART_ID BIGINT(20) CREATE_TIME INT(11) LAST_ACCESS_TIME INT(11) PART_NAME VARCHAR(767) SD_ID BIGINT(20) TBL_ID BIGINT(20) LINK_TARGET_ID BIGINT(20) Indexes PARTITION_EVENTS PART_NAME_ID BIGINT(20) DB_NAME VARCHAR(128) EVENT_TIME BIGINT(20) EVENT_TYPE INT(11) PARTITION_NAME VARCHAR(767) TBL_NAME VARCHAR(128) Indexes PARTITION_KEYS TBL_ID BIGINT(20) PKEY_COMMENT VARCHAR(4000) PKEY_NAME VARCHAR(128) PKEY_TYPE VARCHAR(767) INTEGER_IDX INT(11) Indexes PARTITION_KEY_VALS PART_ID BIGINT(20) PART_KEY_VAL VARCHAR(256) INTEGER_IDX INT(11) Indexes PARTITION_PARAMS PART_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes PART_COL_PRIVS PART_COLUMN_GRANT_ID BIGINT(20) COLUMN_NAME VARCHAR(128) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PART_ID BIGINT(20) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) PART_COL_PRIV VARCHAR(128) Indexes PART_PRIVS PART_GRANT_ID BIGINT(20) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PART_ID BIGINT(20) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) PART_PRIV VARCHAR(128) Indexes ROLES ROLE_ID BIGINT(20) CREATE_TIME INT(11) OWNER_NAME VARCHAR(128) ROLE_NAME VARCHAR(128) Indexes ROLE_MAP ROLE_GRANT_ID BIGINT(20) ADD_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) ROLE_ID BIGINT(20) Indexes SDS SD_ID BIGINT(20) CD_ID BIGINT(20) INPUT_FORMAT VARCHAR(4000) IS_COMPRESSED BIT(1) IS_STOREDASSUBDIRECTORIES BIT(1) LOCATION VARCHAR(4000) NUM_BUCKETS INT(11) OUTPUT_FORMAT VARCHAR(4000) SERDE_ID BIGINT(20) Indexes SD_PARAMS SD_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes SEQUENCE_TABLE SEQUENCE_NAME VARCHAR(255) NEXT_VAL BIGINT(20) Indexes SERDES SERDE_ID BIGINT(20) NAME VARCHAR(128) SLIB VARCHAR(4000) Indexes SERDE_PARAMS SERDE_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes SKEWED_COL_NAMES SD_ID BIGINT(20) SKEWED_COL_NAME VARCHAR(256) INTEGER_IDX INT(11) Indexes SKEWED_COL_VALUE_LOC_MAP SD_ID BIGINT(20) STRING_LIST_ID_KID BIGINT(20) LOCATION VARCHAR(4000) Indexes SKEWED_STRING_LIST STRING_LIST_ID BIGINT(20) Indexes SKEWED_STRING_LIST_VALUES STRING_LIST_ID BIGINT(20) STRING_LIST_VALUE VARCHAR(256) INTEGER_IDX INT(11) Indexes SKEWED_VALUES SD_ID_OID BIGINT(20) STRING_LIST_ID_EID BIGINT(20) INTEGER_IDX INT(11) Indexes SORT_COLS SD_ID BIGINT(20) COLUMN_NAME VARCHAR(128) ORDER INT(11) INTEGER_IDX INT(11) Indexes TABLE_PARAMS TBL_ID BIGINT(20) PARAM_KEY VARCHAR(256) PARAM_VALUE VARCHAR(4000) Indexes TBLS TBL_ID BIGINT(20) CREATE_TIME INT(11) DB_ID BIGINT(20) LAST_ACCESS_TIME INT(11) OWNER VARCHAR(767) RETENTION INT(11) SD_ID BIGINT(20) TBL_NAME VARCHAR(128) TBL_TYPE VARCHAR(128) VIEW_EXPANDED_TEXT MEDIUMTEXT VIEW_ORIGINAL_TEXT MEDIUMTEXT LINK_TARGET_ID BIGINT(20) Indexes TBL_COL_PRIVS TBL_COLUMN_GRANT_ID BIGINT(20) COLUMN_NAME VARCHAR(128) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) TBL_COL_PRIV VARCHAR(128) TBL_ID BIGINT(20) Indexes TBL_PRIVS TBL_GRANT_ID BIGINT(20) CREATE_TIME INT(11) GRANT_OPTION SMALLINT(6) GRANTOR VARCHAR(128) GRANTOR_TYPE VARCHAR(128) PRINCIPAL_NAME VARCHAR(128) PRINCIPAL_TYPE VARCHAR(128) TBL_PRIV VARCHAR(128) TBL_ID BIGINT(20) Indexes TAB_COL_STATS CS_ID BIGINT(20) DB_NAME VARCHAR(128) TABLE_NAME VARCHAR(128) COLUMN_NAME VARCHAR(128) COLUMN_TYPE VARCHAR(128) TBL_ID BIGINT(20) LONG_LOW_VALUE BIGINT(20) LONG_HIGH_VALUE BIGINT(20) DOUBLE_HIGH_VALUE DOUBLE(53,4) DOUBLE_LOW_VALUE DOUBLE(53,4) BIG_DECIMAL_LOW_VALUE VARCHAR(4000) BIG_DECIMAL_HIGH_VALUE VARCHAR(4000) NUM_NULLS BIGINT(20) NUM_DISTINCTS BIGINT(20) AVG_COL_LEN DOUBLE(53,4) MAX_COL_LEN BIGINT(20) NUM_TRUES BIGINT(20) NUM_FALSES BIGINT(20) LAST_ANALYZED BIGINT(20) Indexes PART_COL_STATS CS_ID BIGINT(20) DB_NAME VARCHAR(128) TABLE_NAME VARCHAR(128) PARTITION_NAME VARCHAR(767) COLUMN_NAME VARCHAR(128) COLUMN_TYPE VARCHAR(128) PART_ID BIGINT(20) LONG_LOW_VALUE BIGINT(20) LONG_HIGH_VALUE BIGINT(20) DOUBLE_HIGH_VALUE DOUBLE(53,4) DOUBLE_LOW_VALUE DOUBLE(53,4) BIG_DECIMAL_LOW_VALUE VARCHAR(4000) BIG_DECIMAL_HIGH_VALUE VARCHAR(4000) NUM_NULLS BIGINT(20) NUM_DISTINCTS BIGINT(20) AVG_COL_LEN DOUBLE(53,4) MAX_COL_LEN BIGINT(20) NUM_TRUES BIGINT(20) NUM_FALSES BIGINT(20) LAST_ANALYZED BIGINT(20) Indexes TYPES TYPES_ID BIGINT(20) TYPE_NAME VARCHAR(128) TYPE1 VARCHAR(767) TYPE2 VARCHAR(767) Indexes TYPE_FIELDS TYPE_NAME BIGINT(20) COMMENT VARCHAR(256) FIELD_NAME VARCHAR(128) FIELD_TYPE VARCHAR(767) INTEGER_IDX INT(11) Indexes MASTER_KEYS KEY_ID INT MASTER_KEY VARCHAR(767) Indexes DELEGATION_TOKENS TOKEN_IDENT VARCHAR(767) TOKEN VARCHAR(767) Indexes VERSION VER_ID BIGINT SCHEMA_VERSION VARCHAR(127) VERSION_COMMENT VARCHAR(255) Indexes FUNCS FUNC_ID BIGINT(20) CLASS_NAME VARCHAR(4000) CREATE_TIME INT(11) DB_ID BIGINT(20) FUNC_NAME VARCHAR(128) FUNC_TYPE INT(11) OWNER_NAME VARCHAR(128) OWNER_TYPE VARCHAR(10) Indexes FUNC_RU FUNC_ID BIGINT(20) RESOURCE_TYPE INT(11) RESOURCE_URI VARCHAR(4000) INTEGER_IDX INT(11) Indexes
  • 11. HBase Metastore: Why? Page 11Hive & HBase For Transaction Processing > 700 metastore queries to plan TPC-DS query 27!!!
  • 12. HBase Metastore: Why? Page 12Hive & HBase For Transaction Processing • Object Relational Modeling is an impedance mismatch • The need to work across different DBs limits tuning opportunities • No caching of catalog objects or stats in HiveServer2 or Hive metastore • Hadoop nodes cannot contact RDBMS directly due to scale issues • Solution: use HBase – Can store object directly, no need to normalize – Already scales, performs, etc. – Can store additional data not stored today due to RDBMS capacity limitations – Can access the metadata from the cluster (e.g. LLAP, Tez AM)
  • 13. But... Page 13Hive & HBase For Transaction Processing • HBase does not have transactions – metastore needs them – Tephra, Omid 2 (Yahoo), others working on this • HBase is hard to administer and install – Yes, we will need to improve this – We will also need embedded option for test/POC setups to keep HBase from becoming barrier to adoption • Basically any work we need to do to HBase for this is good since it benefits all HBase users
  • 14. HBase Metastore: How Page 14Hive & HBase For Transaction Processing • HBaseStore, a new implementation of RawStore that stores data in HBase • Not default, users still free to use RDBMS • Less than 10 tables in HBase – DBS, TBLS, PARTITIONS, ... – basically one for each object type – Common partition data factored out to significantly reduce size • Layout highly optimized for SELECT and DML queries, longer operations moved into DDL (e.g. grant) • Extensive caching – Of data catalog objects for length of a query – Of aggregated stats across queries and users • On going work in hbase-metastore branch
  • 15. Agenda Page 15Hive & HBase For Transaction Processing • Our goal – Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store that can be used for analytics and transaction processing • But before we get to that we need to consider – Some things happening in Hive – Some things happening in Phoenix
  • 16. Apache Phoenix: Putting SQL Back in NoSQL Page 16Hive & HBase For Transaction Processing • SQL layer on top of HBase • Originally oriented toward transaction processing • Moving to add more analytics type operators – Adding multiple join implementations – Requests for OLAP functions (PHOENIX-154) • Working on adding transactions (PHOENIX-1674) • Moving to Apache Calcite for optimization (PHOENIX-1488)
  • 17. Agenda Page 17Hive & HBase For Transaction Processing • Our goal – Combine Apache Hive, Hbase, Phoenix, and Calcite to build a single data store that can be used for analytics and transaction processing • But before we get to that we need to consider – Some things happening in Hive – Some things happening in Phoenix
  • 18. What If? Page 18Hive & HBase For Transaction Processing • We could share one O/JDBC driver? • We could share one SQL dialect? • Phoenix could leverage extensive analytics functionality in Hive without re-inventing it • Users could access their transactional and analytics data in single SQL operations?
  • 19. How? Page 19Hive & HBase For Transaction Processing • Insight #1: LLAP is a storage plus operations server for Hive; we can swap it out for other implementations • Insight #2: Tez and Spark can do post-shuffle operations (hash join, etc.) with LLAP or HBase • Insight #3: Calcite (used by both Hive and Phoenix) is built specifically to integrate disparate data storage systems
  • 20. Vision Page 20Hive & HBase For Transaction Processing • User picks storage location for table in create table (LLAP or HBase) • Transactions more efficient in HBase tables but work in both • Analytics more efficient in LLAP tables but work in both • Queries that require shuffle use Tez or Spark for post shuffle operators HDFS JDBC Server Node Node HBase LLAP Query Query Query Calcite used for planning Phoenix used for execution
  • 21. Hurdles Page 21Hive & HBase For Transaction Processing • Need to integrate types/data representation • Need to integrate transaction management • Work to do in Calcite to optimize transactional queries well