Apache HAWQ Architecture

Alexey Grishchenko
Alexey GrishchenkoTechnical Evangelist at Pivotal
HAWQ
Architecture
Alexey Grishchenko
Who I am
Enterprise Architect @ Pivotal
• 7 years in data processing
• 5 years of experience with MPP
• 4 years with Hadoop
• Using HAWQ since the first internal Beta
• Responsible for designing most of the EMEA HAWQ
and Greenplum implementations
• Spark contributor
• http://0x0fff.com
Agenda
• What is HAWQ
Agenda
• What is HAWQ
• Why you need it
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
• HAWQ Design
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
• HAWQ Design
• Query execution example
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
• HAWQ Design
• Query execution example
• Competitive solutions
What is
• Analytical SQL-on-Hadoop engine
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres Greenplum HAWQ
2005
Fork
Postgres 8.0.2
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgres 8.2.15
Greenplum
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgres 8.2.15
2011 Fork
GPDB 4.2.0.0
Greenplum
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgres 8.2.15
2011 Fork
GPDB 4.2.0.0
2013
HAWQ 1.0.0.0
Greenplum
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgres 8.2.15
2011 Fork
GPDB 4.2.0.0
2013
HAWQ 1.0.0.0
HAWQ 2.0.0.0
Open Source
2015
Greenplum
HAWQ is …
• 1’500’000 C and C++ lines of code
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
• More than 50 enterprise customers
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
• More than 50 enterprise customers
– More than 10 of them in EMEA
Apache HAWQ
• Apache HAWQ (incubating) from 09’2015
– http://hawq.incubator.apache.org
– https://github.com/apache/incubator-hawq
• What’s in Open Source
– Sources of HAWQ 2.0 alpha
– HAWQ 2.0 beta is planned for 2015’Q4
– HAWQ 2.0 GA is planned for 2016’Q1
• Community is yet young – come and join!
Why do we need it?
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
• Good performance
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
• Good performance
– How many times the data would hit the HDD during
a single Hive query?
HAWQ Cluster
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
interconnect
…
HAWQ Cluster
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
HAWQ Cluster
HAWQ Master
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
HAWQ
Standby
Server 2
ZK JM
HAWQ Segment
Server 6
Datanode
HAWQ Segment
Server N
Datanode
HAWQ Segment
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
Master Servers
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
HAWQ Segment
Server 6
Datanode
HAWQ Segment
Server N
Datanode
HAWQ Segment
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
HAWQ Master
HAWQ
Standby
Master Servers
HAWQ Master
Query Parser
Query
Optimizer
Global
Resource
Manager
Distributed
Transactions
Manager
Query Dispatch
Metadata
Catalog
HAWQ Standby Master
Query Parser
Query
Optimizer
Global Resource
Manager
Distributed
Transactions
Manager
Query Dispatch
Metadata
Catalog
WAL
repl.
HAWQ Master
HAWQ
Standby
Segments
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
HAWQ Segment HAWQ SegmentHAWQ Segment …
Segments
HAWQ Segment
Query Executor
libhdfs3
PXF
HDFS Datanode
Local Filesystem
Temporary Data
Directory
Logs
YARN Node Manager
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
– Number of null values in the field
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
– Number of null values in the field
– Average width of the field in bytes
Statistics
No Statistics
How many rows would produce the join of two
tables?
Statistics
No Statistics
How many rows would produce the join of two
tables?
 From 0 to infinity
Statistics
No Statistics
Row Count
How many rows would produce the join of two
tables?
 From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
Statistics
No Statistics
Row Count
How many rows would produce the join of two
tables?
 From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
 From 0 to 1’000’000
Statistics
No Statistics
Row Count
Histograms and MCV
How many rows would produce the join of two
tables?
 From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
 From 0 to 1’000’000
How many rows would produce the join of two 1000-
row tables, with known field cardinality, values
distribution diagram, number of nulls, most common
values?
Statistics
No Statistics
Row Count
Histograms and MCV
How many rows would produce the join of two
tables?
 From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
 From 0 to 1’000’000
How many rows would produce the join of two 1000-
row tables, with known field cardinality, values
distribution diagram, number of nulls, most common
values?
 ~ From 500 to 1’500
Metadata
• Table structure information
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
Metadata
• Table structure information
– Distribution fields
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
– Access privileges
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
– Access privileges
• Stored procedures
– PL/pgSQL, PL/Java, PL/Python, PL/Perl, PL/R
Query Optimizer
• HAWQ uses cost-based query optimizers
Query Optimizer
• HAWQ uses cost-based query optimizers
• You have two options
– Planner – evolved from the Postgres query
optimizer
– ORCA (Pivotal Query Optimizer) – developed
specifically for HAWQ
Query Optimizer
• HAWQ uses cost-based query optimizers
• You have two options
– Planner – evolved from the Postgres query
optimizer
– ORCA (Pivotal Query Optimizer) – developed
specifically for HAWQ
• Optimizer hints work just like in Postgres
– Enable/disable specific operation
– Change the cost estimations for basic actions
Storage Formats
Which storage format is the most optimal?
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
– Minimal time to retrieve record by key
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
– Minimal time to retrieve record by key
– Minimal time to retrieve subset of columns
– etc.
Storage Formats
• Row-based storage format
– Similar to Postgres heap storage
• No toast
• No ctid, xmin, xmax, cmin, cmax
Storage Formats
• Row-based storage format
– Similar to Postgres heap storage
• No toast
• No ctid, xmin, xmax, cmin, cmax
– Compression
• No compression
• Quicklz
• Zlib levels 1 - 9
Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
– Compression
• No compression
• Snappy
• Gzip levels 1 – 9
Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
– Compression
• No compression
• Snappy
• Gzip levels 1 – 9
– The size of “row group” and page size can be set
for each table separately
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
– Query might have many executors on each cluster
node to make it run faster
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
– Query might have many executors on each cluster
node to make it run faster
– You can control the parallelism of each query
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
– MIN/MAX number of executors on each node
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
– MIN/MAX number of executors on each node
• Can be set up for user or group
External Data
• PXF
– Framework for external data access
– Easy to extend, many public plugins available
– Official plugins: CSV, SequenceFile, Avro, Hive,
HBase
– Open Source plugins: JSON, Accumulo,
Cassandra, JDBC, Redis, Pipe
External Data
• PXF
– Framework for external data access
– Easy to extend, many public plugins available
– Official plugins: CSV, SequenceFile, Avro, Hive,
HBase
– Open Source plugins: JSON, Accumulo,
Cassandra, JDBC, Redis, Pipe
• HCatalog
– HAWQ can query tables from HCatalog the same
way as HAWQ native tables
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
QE QE QE QE QE
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
Query Performance
• Data does not hit the disk unless this cannot be
avoided
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
• C/C++ implementation is more efficient than
Java implementation of competitive solutions
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
• C/C++ implementation is more efficient than
Java implementation of competitive solutions
• Query parallelism can be easily tuned
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Distributions
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Distributions
Stability
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Distributions
Stability
Community
Roadmap
• AWS and S3 integration
Roadmap
• AWS and S3 integration
• Mesos integration
Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
• Cloudera, MapR and IBM Hadoop distributions
native support
Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
• Cloudera, MapR and IBM Hadoop distributions
native support
• Make the SQL-on-Hadoop engine ever!
Summary
• Modern SQL-on-Hadoop engine
• For structured data processing and analysis
• Combines the best techniques of competitive
solutions
• Just released to the open source
• Community is very young
Join our community and contribute!
Questions
Apache HAWQ
http://hawq.incubator.apache.org
dev@hawq.incubator.apache.org
user@hawq.incubator.apache.org
Reach me on http://0x0fff.com
1 of 135

Recommended

Writing Continuous Applications with Structured Streaming in PySpark by
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
2.2K views60 slides
Application architectures with Hadoop – Big Data TechCon 2014 by
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
6K views69 slides
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data... by
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
1.1K views27 slides
Is hadoop for you by
Is hadoop for youIs hadoop for you
Is hadoop for youGwen (Chen) Shapira
1.5K views44 slides
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ... by
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
6.6K views54 slides
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of by
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
8.1K views56 slides

More Related Content

What's hot

Breakthrough OLAP performance with Cassandra and Spark by
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
8.8K views72 slides
Incredible Impala by
Incredible Impala Incredible Impala
Incredible Impala Gwen (Chen) Shapira
4.7K views58 slides
LLAP: Sub-Second Analytical Queries in Hive by
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveDataWorks Summit/Hadoop Summit
1.6K views32 slides
Interactive SQL-on-Hadoop and JethroData by
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
4K views40 slides
Spark and Spark Streaming by
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
451 views41 slides
Koalas: Making an Easy Transition from Pandas to Apache Spark by
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
436 views26 slides

What's hot(20)

Breakthrough OLAP performance with Cassandra and Spark by Evan Chan
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan8.8K views
Interactive SQL-on-Hadoop and JethroData by Ofir Manor
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor4K views
Spark and Spark Streaming by 宇 傅
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
宇 傅451 views
Koalas: Making an Easy Transition from Pandas to Apache Spark by Databricks
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks436 views
A Developer’s View into Spark's Memory Model with Wenchen Fan by Databricks
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks1.5K views
Transformation Processing Smackdown; Spark vs Hive vs Pig by Lester Martin
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
Lester Martin10.9K views
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark by Evan Chan
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan5.8K views
Spark Meetup at Uber by Databricks
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks15.8K views
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable... by Sumeet Singh
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Sumeet Singh1.1K views
Architectural Patterns for Streaming Applications by hadooparchbook
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook2.8K views
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi by Vinoth Chandar
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar408 views
The Pushdown of Everything by Stephan Kessler and Santiago Mola by Spark Summit
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit3.8K views
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co... by DataStax
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax1.5K views

Similar to Apache HAWQ Architecture

Deploying your Data Warehouse on AWS by
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
5K views64 slides
aip-workshop1-dev-tutorial by
aip-workshop1-dev-tutorialaip-workshop1-dev-tutorial
aip-workshop1-dev-tutorialMatthew Vaughn
401 views41 slides
Architecting a Next Generation Data Platform by
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
5.6K views176 slides
Hadoop application architectures - using Customer 360 as an example by
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
6.1K views162 slides
Cost-based query optimization in Apache Hive 0.14 by
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
4K views36 slides
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ... by
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Aman Sinha
444 views22 slides

Similar to Apache HAWQ Architecture(20)

Architecting a Next Generation Data Platform by hadooparchbook
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook5.6K views
Hadoop application architectures - using Customer 360 as an example by hadooparchbook
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook6.1K views
Cost-based query optimization in Apache Hive 0.14 by Julian Hyde
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde4K views
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ... by Aman Sinha
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Aman Sinha444 views
Architecting a next generation data platform by hadooparchbook
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook2.5K views
Architecting a next-generation data platform by hadooparchbook
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
hadooparchbook1.9K views
SQL on everything, in memory by Julian Hyde
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde7.4K views
Genji: Framework for building resilient near-realtime data pipelines by Swami Sundaramurthy
Genji: Framework for building resilient near-realtime data pipelinesGenji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelines
The Polyglot Data Scientist - Exploring R, Python, and SQL Server by Sarah Dutkiewicz
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerThe Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
Sarah Dutkiewicz730 views
APEX 5 IR Guts and Performance by Karen Cannell
APEX 5 IR Guts and PerformanceAPEX 5 IR Guts and Performance
APEX 5 IR Guts and Performance
Karen Cannell302 views
APEX 5 IR: Guts & Performance by Karen Cannell
APEX 5 IR:  Guts & PerformanceAPEX 5 IR:  Guts & Performance
APEX 5 IR: Guts & Performance
Karen Cannell580 views
Prometheus lightning talk (Devops Dublin March 2015) by Brian Brazil
Prometheus lightning talk (Devops Dublin March 2015)Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)
Brian Brazil465 views
Creating PostgreSQL-as-a-Service at Scale by Sean Chittenden
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
Sean Chittenden1.2K views
APEX 5 Interactive Reports: Guts and PErformance by Karen Cannell
APEX 5 Interactive Reports: Guts and PErformanceAPEX 5 Interactive Reports: Guts and PErformance
APEX 5 Interactive Reports: Guts and PErformance
Karen Cannell2.2K views
OpenTSDB for monitoring @ Criteo by Nathaniel Braun
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
Nathaniel Braun1.2K views
Scaling ingest pipelines with high performance computing principles - Rajiv K... by SignalFx
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx1.8K views
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth... by Christian Tzolov
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Christian Tzolov3K views

More from Alexey Grishchenko

MPP vs Hadoop by
MPP vs HadoopMPP vs Hadoop
MPP vs HadoopAlexey Grishchenko
91.6K views53 slides
Greenplum Architecture by
Greenplum ArchitectureGreenplum Architecture
Greenplum ArchitectureAlexey Grishchenko
6.6K views79 slides
Apache Spark Architecture by
Apache Spark ArchitectureApache Spark Architecture
Apache Spark ArchitectureAlexey Grishchenko
76K views114 slides
Modern Data Architecture by
Modern Data ArchitectureModern Data Architecture
Modern Data ArchitectureAlexey Grishchenko
31.5K views100 slides
Архитектура Apache HAWQ Highload++ 2015 by
Архитектура Apache HAWQ Highload++ 2015Архитектура Apache HAWQ Highload++ 2015
Архитектура Apache HAWQ Highload++ 2015Alexey Grishchenko
747 views143 slides
Pivotal hawq internals by
Pivotal hawq internalsPivotal hawq internals
Pivotal hawq internalsAlexey Grishchenko
6.4K views46 slides

More from Alexey Grishchenko(6)

Recently uploaded

Best Home Security Systems.pptx by
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptxmogalang
9 views16 slides
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...StatsCommunications
7 views26 slides
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...DataScienceConferenc1
5 views19 slides
DGST Methodology Presentation.pdf by
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdfmaddierlegum
7 views9 slides
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
19 views29 slides
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptxDataScienceConferenc1
11 views16 slides

Recently uploaded(20)

Best Home Security Systems.pptx by mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
DGST Methodology Presentation.pdf by maddierlegum
DGST Methodology Presentation.pdfDGST Methodology Presentation.pdf
DGST Methodology Presentation.pdf
maddierlegum7 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx by DataScienceConferenc1
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Ivan Dundovic - How To Treat Your Data As A Product.pptx
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f... by DataScienceConferenc1
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus30 views
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821711 views
Dr. Ousmane Badiane-2023 ReSAKSS Conference by AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1007 views

Apache HAWQ Architecture

  • 2. Who I am Enterprise Architect @ Pivotal • 7 years in data processing • 5 years of experience with MPP • 4 years with Hadoop • Using HAWQ since the first internal Beta • Responsible for designing most of the EMEA HAWQ and Greenplum implementations • Spark contributor • http://0x0fff.com
  • 4. Agenda • What is HAWQ • Why you need it
  • 5. Agenda • What is HAWQ • Why you need it • HAWQ Components
  • 6. Agenda • What is HAWQ • Why you need it • HAWQ Components • HAWQ Design
  • 7. Agenda • What is HAWQ • Why you need it • HAWQ Components • HAWQ Design • Query execution example
  • 8. Agenda • What is HAWQ • Why you need it • HAWQ Components • HAWQ Design • Query execution example • Competitive solutions
  • 9. What is • Analytical SQL-on-Hadoop engine
  • 10. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries
  • 11. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres Greenplum HAWQ 2005 Fork Postgres 8.0.2
  • 12. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 Greenplum
  • 13. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 2011 Fork GPDB 4.2.0.0 Greenplum
  • 14. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 2011 Fork GPDB 4.2.0.0 2013 HAWQ 1.0.0.0 Greenplum
  • 15. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 2011 Fork GPDB 4.2.0.0 2013 HAWQ 1.0.0.0 HAWQ 2.0.0.0 Open Source 2015 Greenplum
  • 16. HAWQ is … • 1’500’000 C and C++ lines of code
  • 17. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only
  • 18. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC
  • 19. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC
  • 20. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC
  • 21. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC • 7’000 Shell scripts LOC
  • 22. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC • 7’000 Shell scripts LOC • More than 50 enterprise customers
  • 23. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC • 7’000 Shell scripts LOC • More than 50 enterprise customers – More than 10 of them in EMEA
  • 24. Apache HAWQ • Apache HAWQ (incubating) from 09’2015 – http://hawq.incubator.apache.org – https://github.com/apache/incubator-hawq • What’s in Open Source – Sources of HAWQ 2.0 alpha – HAWQ 2.0 beta is planned for 2015’Q4 – HAWQ 2.0 GA is planned for 2016’Q1 • Community is yet young – come and join!
  • 25. Why do we need it?
  • 26. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003
  • 27. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos
  • 28. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data
  • 29. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data – Example - parse URL to extract protocol, host name, port, GET parameters
  • 30. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data – Example - parse URL to extract protocol, host name, port, GET parameters • Good performance
  • 31. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data – Example - parse URL to extract protocol, host name, port, GET parameters • Good performance – How many times the data would hit the HDD during a single Hive query?
  • 32. HAWQ Cluster Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM Server 6 Datanode Server N Datanode Server 5 Datanode interconnect …
  • 33. HAWQ Cluster Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM Server 6 Datanode Server N Datanode Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect …
  • 34. HAWQ Cluster HAWQ Master Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM HAWQ Standby Server 2 ZK JM HAWQ Segment Server 6 Datanode HAWQ Segment Server N Datanode HAWQ Segment Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect …
  • 35. Master Servers Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM HAWQ Segment Server 6 Datanode HAWQ Segment Server N Datanode HAWQ Segment Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect … HAWQ Master HAWQ Standby
  • 36. Master Servers HAWQ Master Query Parser Query Optimizer Global Resource Manager Distributed Transactions Manager Query Dispatch Metadata Catalog HAWQ Standby Master Query Parser Query Optimizer Global Resource Manager Distributed Transactions Manager Query Dispatch Metadata Catalog WAL repl.
  • 37. HAWQ Master HAWQ Standby Segments Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM Server 6 Datanode Server N Datanode Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect HAWQ Segment HAWQ SegmentHAWQ Segment …
  • 38. Segments HAWQ Segment Query Executor libhdfs3 PXF HDFS Datanode Local Filesystem Temporary Data Directory Logs YARN Node Manager
  • 39. Metadata • HAWQ metadata structure is similar to Postgres catalog structure
  • 40. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table
  • 41. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field
  • 42. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field
  • 43. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field – Number of unique values in the field
  • 44. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field – Number of unique values in the field – Number of null values in the field
  • 45. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field – Number of unique values in the field – Number of null values in the field – Average width of the field in bytes
  • 46. Statistics No Statistics How many rows would produce the join of two tables?
  • 47. Statistics No Statistics How many rows would produce the join of two tables?  From 0 to infinity
  • 48. Statistics No Statistics Row Count How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?
  • 49. Statistics No Statistics Row Count How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?  From 0 to 1’000’000
  • 50. Statistics No Statistics Row Count Histograms and MCV How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?  From 0 to 1’000’000 How many rows would produce the join of two 1000- row tables, with known field cardinality, values distribution diagram, number of nulls, most common values?
  • 51. Statistics No Statistics Row Count Histograms and MCV How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?  From 0 to 1’000’000 How many rows would produce the join of two 1000- row tables, with known field cardinality, values distribution diagram, number of nulls, most common values?  ~ From 500 to 1’500
  • 52. Metadata • Table structure information ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90
  • 53. Metadata • Table structure information – Distribution fields ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(ID)
  • 54. Metadata • Table structure information – Distribution fields – Number of hash buckets ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(ID) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90
  • 55. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(ID) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90
  • 56. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) • General metadata – Users and groups
  • 57. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) • General metadata – Users and groups – Access privileges
  • 58. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) • General metadata – Users and groups – Access privileges • Stored procedures – PL/pgSQL, PL/Java, PL/Python, PL/Perl, PL/R
  • 59. Query Optimizer • HAWQ uses cost-based query optimizers
  • 60. Query Optimizer • HAWQ uses cost-based query optimizers • You have two options – Planner – evolved from the Postgres query optimizer – ORCA (Pivotal Query Optimizer) – developed specifically for HAWQ
  • 61. Query Optimizer • HAWQ uses cost-based query optimizers • You have two options – Planner – evolved from the Postgres query optimizer – ORCA (Pivotal Query Optimizer) – developed specifically for HAWQ • Optimizer hints work just like in Postgres – Enable/disable specific operation – Change the cost estimations for basic actions
  • 62. Storage Formats Which storage format is the most optimal?
  • 63. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal”
  • 64. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data
  • 65. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data – Minimal disk space usage
  • 66. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data – Minimal disk space usage – Minimal time to retrieve record by key
  • 67. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data – Minimal disk space usage – Minimal time to retrieve record by key – Minimal time to retrieve subset of columns – etc.
  • 68. Storage Formats • Row-based storage format – Similar to Postgres heap storage • No toast • No ctid, xmin, xmax, cmin, cmax
  • 69. Storage Formats • Row-based storage format – Similar to Postgres heap storage • No toast • No ctid, xmin, xmax, cmin, cmax – Compression • No compression • Quicklz • Zlib levels 1 - 9
  • 70. Storage Formats • Apache Parquet – Mixed row-columnar table store, the data is split into “row groups” stored in columnar format
  • 71. Storage Formats • Apache Parquet – Mixed row-columnar table store, the data is split into “row groups” stored in columnar format – Compression • No compression • Snappy • Gzip levels 1 – 9
  • 72. Storage Formats • Apache Parquet – Mixed row-columnar table store, the data is split into “row groups” stored in columnar format – Compression • No compression • Snappy • Gzip levels 1 – 9 – The size of “row group” and page size can be set for each table separately
  • 73. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other
  • 74. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources
  • 75. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources • Flexible cluster utilization – Query might run on a subset of nodes if it is small
  • 76. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources • Flexible cluster utilization – Query might run on a subset of nodes if it is small – Query might have many executors on each cluster node to make it run faster
  • 77. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources • Flexible cluster utilization – Query might run on a subset of nodes if it is small – Query might have many executors on each cluster node to make it run faster – You can control the parallelism of each query
  • 78. Resource Management • Resource Queue can be set with – Maximum number of parallel queries
  • 79. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority
  • 80. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits
  • 81. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit
  • 82. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit – MIN/MAX number of executors across the system
  • 83. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit – MIN/MAX number of executors across the system – MIN/MAX number of executors on each node
  • 84. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit – MIN/MAX number of executors across the system – MIN/MAX number of executors on each node • Can be set up for user or group
  • 85. External Data • PXF – Framework for external data access – Easy to extend, many public plugins available – Official plugins: CSV, SequenceFile, Avro, Hive, HBase – Open Source plugins: JSON, Accumulo, Cassandra, JDBC, Redis, Pipe
  • 86. External Data • PXF – Framework for external data access – Easy to extend, many public plugins available – Official plugins: CSV, SequenceFile, Avro, Hive, HBase – Open Source plugins: JSON, Accumulo, Cassandra, JDBC, Redis, Pipe • HCatalog – HAWQ can query tables from HCatalog the same way as HAWQ native tables
  • 87. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan
  • 88. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  • 89. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  • 90. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  • 91. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  • 92. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 93. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource
  • 94. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM
  • 95. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  • 96. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  • 97. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  • 98. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers QE QE QE QE QE
  • 99. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Execute Result Cleanup QE QE QE QE QE QE Prepare
  • 100. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Execute Result Cleanup QE QE QE QE QE QE Prepare ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 101. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Execute Result Cleanup QE QE QE QE QE QE Prepare ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 102. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Result Cleanup QE QE QE QE QE QE Prepare Execute ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 103. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Result Cleanup QE QE QE QE QE QE Prepare Execute ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 104. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Result Cleanup QE QE QE QE QE QE Prepare Execute ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 105. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Cleanup QE QE QE QE QE QE Prepare Execute Result
  • 106. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Cleanup QE QE QE QE QE QE Prepare Execute Result
  • 107. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Cleanup QE QE QE QE QE QE Prepare Execute Result
  • 108. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup
  • 109. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  • 110. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  • 111. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  • 112. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  • 113. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  • 114. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup
  • 115. Query Performance • Data does not hit the disk unless this cannot be avoided
  • 116. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided
  • 117. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP
  • 118. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP • HAWQ has a good cost-based query optimizer
  • 119. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP • HAWQ has a good cost-based query optimizer • C/C++ implementation is more efficient than Java implementation of competitive solutions
  • 120. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP • HAWQ has a good cost-based query optimizer • C/C++ implementation is more efficient than Java implementation of competitive solutions • Query parallelism can be easily tuned
  • 121. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer
  • 122. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL
  • 123. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages
  • 124. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO
  • 125. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism
  • 126. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism Distributions
  • 127. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism Distributions Stability
  • 128. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism Distributions Stability Community
  • 129. Roadmap • AWS and S3 integration
  • 130. Roadmap • AWS and S3 integration • Mesos integration
  • 131. Roadmap • AWS and S3 integration • Mesos integration • Better Ambari integration
  • 132. Roadmap • AWS and S3 integration • Mesos integration • Better Ambari integration • Cloudera, MapR and IBM Hadoop distributions native support
  • 133. Roadmap • AWS and S3 integration • Mesos integration • Better Ambari integration • Cloudera, MapR and IBM Hadoop distributions native support • Make the SQL-on-Hadoop engine ever!
  • 134. Summary • Modern SQL-on-Hadoop engine • For structured data processing and analysis • Combines the best techniques of competitive solutions • Just released to the open source • Community is very young Join our community and contribute!