HAWQ meets Hive: Querying Unmanaged Data

1© 2017 Pivotal Software, Inc. All rights reserved. 1© 2017 Pivotal Software, Inc. All rights reserved.
Querying Unmanaged Data
HAWQ meets Hive
Shivram Mani
Oleksandr Diachenko

2© 2017 Pivotal Software, Inc. All rights reserved.
Agenda
● Overview of Apache HAWQ (incubating)
● HAWQ Architecture
● HAWQ Extension Framework
● HAWQ Hive Integration
● HAWQ HCatalog Integration

Apache HAWQ’s Lineage
1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
Postgres developed
at UC Berkeley
Postgres adds support for SQL
Open Source PostgreSQL
PostgreSQL 7.0 released
PostgreSQL 8.0 released
Greenplum based on
PostgreSQL
Hadoop 1.0 Released
HAWQ goes
open-source
(Apache)
HAWQ project launched
Hadoop 2.0 Released

HAWQ Overview
Multi-level Fault
Tolerance
Granular
Authorization
Resource Mgmt
(+ YARN)
Multi-tenancy + Security
ANSI SQL
Standard
OLAP Extensions
JDBC ODBC
Connectivity
Online
Expansion
Hadoop / HDFS
Operations
Cost Based Optimizer (ORCA)
Dynamic
Pipelining
ACID +
Transactional
MPP
Architecture
Data Federation
Language
Extensions
Advanced Analytics MPP Database for Enterprises
Extensibility
HDFS Native
File Formats
Compression +
Partitioning
Core
Connectivity
- Enable Data Science
- Large Scale Analytics
- Query All Data Types &
sources
- Manage Multiple
Workloads
- Security controls
- Well Integrated
- Leverage Existing
SQL Skills & BI Tools
- High-performance
Ambari
Management
Machine
Learning

HAWQ Components
HAWQ Master (1)
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Resource Mgr.
NN cache
Query Dispatch
Fault Tolerant Svc
HAWQ Segment (1..N)
Postmaster
Local directory
(Temp Data / Logs)
Virtual Segments (Query Executors)
libhdfs3
Datanode YARN NM
HAWQ Standby Master (1)

Server NServer 2Server 1
Query Execution (Native)
HAWQ Master
Metadata
Transaction Mgr.
Resource Mgr.
NameNode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Local directory Local directory Local directory
Animated slides
NN Cache
Interconnect

Query Execution - Plan
HAWQ Master
Metadata
Transaction Mgr.
NN Cache
Resource Mgr.
NameNode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Query Dispatch

Query Execution - Resource
HAWQ Master
Metadata
Transaction Mgr.
NN Cache
Resource Mgr.
NameNode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Query Dispatch
VS VS VS VS VS
I need 5 containers
Each with 1 CPU core
and 1 GB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
VS = Virtual Segment (container for Query Executors)
# of QEs in a v-seg = # of slices in a query

Query Execution - Prepare
HAWQ Master
Metadata
Transaction Mgr.
NN Cache
Resource Mgr.
NameNode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Query Dispatch
VS VS VS VS VS
Server 1
Local directory
Server 2
Local directory
Server N
Local directory

Query Execution - Execute
HAWQ Master
Metadata
Transaction Mgr.
NN Cache
Resource Mgr.
NameNode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Query Dispatch
VS VS VS VS VS
Server 1
Local directory
Server 2
Local directory
Server N
Local directory

Query Execution - Result
HAWQ Master
Metadata
Transaction Mgr.
NN Cache
Resource Mgr.
NameNode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Query Dispatch
VS VS VS VS VS
Server 1
Local directory
Server 2
Local directory
Server N
Local directory

Highly efficient MPP
(massively parallel
processing) heritage
and architecture
Dynamic pipelining, no
intermediate writes
to disk
Advanced
cost-based
optimizer
Scalable and fast
Interconnect
Native (C++) HDFS
access/scan speed
HDFS metadata
cache Optimal data locality
matching methods
Reasons why HAWQ is high-performance

seconds
* Queries that did not complete are omitted from results on both platforms
• HAWQ ~1.3x faster
• Competing MPP Hadoop engine failed to
complete 47% of the queries (unmodified)
1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40 41 42 43 44
45 46 47 48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63 64 65 66
67 68 69 70 71 72 73 74 75 76 77
78 79 80 81 82 83 84 85 86 87 88
89 90 91 92 93 94 95 96 97 98 99
Unsupported SQL
Long running killed
Memory Limit Exceeded
Test Query Failed in
the other engine
TPC-DS Queries with 5-Users
TPC-DS benchmark

Managed vs Unmanaged data
Managed data
Unmanaged data
Metadata Metadata
???

HAWQ eXtension Framework (aka PXF)
Uniform tabular view to
heterogeneous data sources
Exploits parallelism for data
access
Pluggable framework for
Custom connectors(profiles)
Built-in connectors for various data
sources/formats

Tomcat
(Webapp)
REST API
Java API
External Tables
Java API
Java/Thrift
● JDBC
● Solr
● Redis
● Cassandra
● GemfireXD
PXF Architecture
➔ Independent JVM
➔ Runs alongside namenode and datanodes
PXF

Query Execution (External Data)
HAWQ Master
NameNode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
Postmaster
Animated slides

Query Planning - Distribution
HAWQ Master
NameNode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
Postmaster PXF
Get Partition Metadata
{P1, P2, P3, P4, P5}
Planner
Partition Mapper
{P1, P4} {P5} {P2, P3}

Query Execution - Read
HAWQ Master
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
Postmaster
VS VSVS VS VS
NameNode
PXF
PXF PXF PXF
P2P5P1 P4 P3

Query Execution - Result
HAWQ Master
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
HAWQ Segment
Postmaster
HDFS Datanode
VS VS VS VS VS
Server 1
Local directory
Server 2
Local directory
Server N
Local directory
NameNode
PXFPostmaster
Global Aggregate

HAWQ-Hive Data Integration
HiveRC
➢ Works for
RCFile format
Hive
➢ Works for
heterogeneous tables
➢ Support all formats
➢ Unmooptimized
HiveText
➢ Works fast for text
data
➢ Lazy data resolution
➢ Only text datatypes
are supported
HiveORC
➢ Optimized for ORC
data
➢ Leverages predicates
push down
➢ Column projection
HiveVectorizedORC
➢ Uses ORC Batch API
➢ Sends 1024 row batch to
HAWQ
➢ Enables Vectorized
Execution

HAWQ-Hive ORC Optimizations
HAWQ Master
HAWQ Segment
Postmaster
PXF
column attributes: col1, col2
predicate: RPNF {filter(s)}
aggregate functions
{Col1,col2
col3=’abc’}
col4;
col3;
col2;
col1;
SELECT col1,col2 FROM tab1
WHERE col3 = ‘abc’;
SELECT COUNT(*) FROM tab1
WHERE col3 = ‘abc’;Query Dispatch
ORC API {Col1,col2
col3=’abc’}

Optimizations
Statistics
● Exposing statistics
about unmanaged
tables
● Optimized Query plan
Columns projection
● Passing requested
columns
● Disk I/O is optimized if
data format allows
Predicates pushdown
● Passing down predicates
from WHERE clause through
the PXF framework
● Partitions/stripes/files
elimination
Batches vs tuples
● HiveText
● HiveVectorizedORC
● Lazy Data resolution

HAWQ-Hive Catalog Integration
CREATE EXTERNAL TABLE items (column2 int, column2 string)
LOCATION ('pxf://namenode:51200/customer_db?PROFILE=Hive')
FORMAT 'custom' (formatter='pxfwritable_import');
SELECT * FROM items;
Was: Wanted:
● Need to create external HAWQ table
● Users need to know HAWQ-Hive data mapping
● Need to keep both tables metadata in sync manually
SELECT * FROM items;
● No need to create external HAWQ table
● Users don't know about HAWQ-Hive data types
mapping, etc
● Metadata is always up to date

Challenges with Catalog Unification
Hive Catalog

Challenges with Catalog Unification
HAWQ Catalog

Where to store HCatalog data in HAWQ
Requires few HAWQ changes
Getting all catalog utilities for free
Catalog is polluted with external
data
HCatalog objects are visible to
concurrent sessions
Session-level isolation
Cheap cleanup process
HAWQ Catalog service need to be
changed to be able to work with
disk/memory
Catalog utilities need to be modified
to work with HCatalog objects

Object namespaces
0 2^3210*2^20
Globalcounter
Session
1
counter
In-memory
In-memory
In-memory
Session
2
counter
Session
N
counte
HAWQ objects HCatalog objects
Persistant
Sessions states
are isolated

HAWQ-HCatalog Integration
Weblogs
id double
ts timestamp
...
SELECT * FROM hcatalog.default.weblogs
WHERE ts between ‘2015-09-01’ and ‘2015-09-30’;
HIVE
PXF
PXF
PXF
HCAT
SELECT COUNT(*) FROM hcatalog.default.weblogs
WHERE ts between ‘2015-09-01’ and ‘2015-09-30’;
In Memory
Catalog
Disk Heap
Catalog
Weblogs
id double
ts timestamp
...
HAWQCatalogservice
HAWQ

Avoid data duplication:
All processing engines point to the same copy of data
⬢ Apache HAWQ
● MPP engine from the core
● Easy transition from Tradition
DB/Warehouse
● Ad-hoc Analytics, BI & Visualization
● Low Query Latency
● Scale 100s TB to low PB’s
● Machine Learning (Madlib)
Apache Hive & HAWQ (via HDB)
The Most Comprehensive SQL on Hadoop
Right Tool for the Job:
Choose the right SQL engine based on your
application’s needs.
⬢ Apache Hive
● Holds very detailed information
● Integrates all data sources
● Low-Mid Query Latency
● Scales to 100’s petabytes
● Large Community
Run HAWQ & Hive alongside!

github.com/apache/incubator-hawq
HAWQ Homepage
Getting Started
HAWQ Wiki
PXF Wiki
Sandbox
Additional Resources
Documentation Wiki/Docs
Code Github(Apache)
Join Discussion/Ask Questions Apache DLs
dev@hawq.incubator.apache.org
user@hawq.incubator.apache.org

LIBYARNResourceBroker
libyarn
Resource pool
YARNResourceManager
segments
YARN Node
Manager
HAWQ
Segment
Register HAWQ as an unmanaged
application exclusively consuming a
YARN queue
Periodically fetch YARN cluster report,
container report and queue report to
recognize YARN cluster
Acquire YARN containers with host
preference information
Return YARN containers
Unregister HAWQ in YARN
Add activated YARN
containers’ quota
Return YARN
containers’ quota
Global RM container
Lifecycle Manager
Resourcebrokeruseslibyarn(ac/c++
versionlibrary)tocommunicatewith
YARNthroughprotobuf.
Indexed Resource Quota
Table
Accepted YARN
container quota
To be returned
YARN containers’
quota
Increase HAWQ segment resource quota when have new global resource
manager’s containers allocated;
Decrease HAWQ segment resource quota when some global resource manager’s
containers are decided to be kicked.
HAWQ resource
queue manager
Acquire
calculated
resource
quota or
return
unused
query
resource
HAWQ Query
Dispatcher
Acquire/Returnqueryresource
SQL statement
Container report
Cluster report
Queue report
Query Quota
Calculator
Query Resource
Request
Queuing Facility
HAWQ Resource Manager
Queue Quota
Calculator
Allocated query
resource
Allocatedqueryresource
Active YARN containers with
resource holding processes
started
Drive resource broker to acquire global resource manager containers. The quota of a global
resource manager can be (1GB,1core), (2GB, 1core), etc.
Allocate virtual segments with fixed resource quota assigned and dispatch workload to segments.
The resource quota can be as small as 128MB, 256MB and as large as GBs.
4
79
10
11
14
15
8
312
6
5
1
2
13
Internal Use Only

• Responsibility
– Responsible for acquiring & returning CPU/Mem resources from/to YARN
– Responsible for resource allocation among HAWQ users and queries
• Master resource manager process
– Resource negotiation with YARN and resource allocation
– Manage and maintain the resources in resource pool
– Handle resource allocation/return RPC requests from QD (query
dispatcher)
– Fault tolerance service are in the same process
• Segment resource manager process
– One HAWQ RM on each Segment
– Negotiation with Master resource manager (for resource enforcement)
– Fault tolerance service: Heartbeat sender
Resource Management
HAWQ Resource Manager

SQL on Hadoop benchmark

PXF Data Flow

PXF Data Model

Putting it all together
External Data pxf Parallelized access to external data sources (read/write)
Install and Configure Ambari to deploy and manage HAWQ, just like any other Hadoop service.
Manage Resources YARN-integrated for dynamic resource allocation across hierarchical groups.
Write Queries Advanced optimizer and dynamic pipelining for high-performance response.orca
Enable Data Science In-database machine learning algorithms for predictive analytics.
Extend Data Processing Procedural language extensions for custom application logic.
Summary of HAWQ user experience (via HDB)

HAWQ meets Hive: Querying Unmanaged Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HAWQ meets Hive: Querying Unmanaged Data

Similar to HAWQ meets Hive: Querying Unmanaged Data (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

HAWQ meets Hive: Querying Unmanaged Data