Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HAWQ
Architecture
Alexey Grishchenko
Who I am
Enterprise Architect @ Pivotal
• 7 years in data processing
• 5 years of experience with MPP
• 4 years with Hadoo...
Agenda
• What is HAWQ
Agenda
• What is HAWQ
• Why you need it
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
• HAWQ Design
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
• HAWQ Design
• Query execution example
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
• HAWQ Design
• Query execution example
• Competitive solutions
What is
• Analytical SQL-on-Hadoop engine
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres Greenplum HAWQ
2005
Fork
Postgres 8.0.2
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgre...
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgre...
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgre...
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgre...
HAWQ is …
• 1’500’000 C and C++ lines of code
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 2...
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 2...
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 2...
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 2...
Apache HAWQ
• Apache HAWQ (incubating) from 09’2015
– http://hawq.incubator.apache.org
– https://github.com/apache/incubat...
Why do we need it?
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - ...
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - ...
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - ...
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - ...
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - ...
HAWQ Cluster
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
...
HAWQ Cluster
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
...
HAWQ Cluster
HAWQ Master
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
HAWQ
Standby
Server 2
ZK JM
HAWQ Segmen...
Master Servers
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
HAWQ Segment
Server 6
Datanode
HAW...
Master Servers
HAWQ Master
Query Parser
Query
Optimizer
Global
Resource
Manager
Distributed
Transactions
Manager
Query Dis...
HAWQ Master
HAWQ
Standby
Segments
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datano...
Segments
HAWQ Segment
Query Executor
libhdfs3
PXF
HDFS Datanode
Local Filesystem
Temporary Data
Directory
Logs
YARN Node M...
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the...
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the...
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the...
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the...
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the...
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the...
Statistics
No Statistics
How many rows would produce the join of two
tables?
Statistics
No Statistics
How many rows would produce the join of two
tables?
 From 0 to infinity
Statistics
No Statistics
Row Count
How many rows would produce the join of two
tables?
 From 0 to infinity
How many rows ...
Statistics
No Statistics
Row Count
How many rows would produce the join of two
tables?
 From 0 to infinity
How many rows ...
Statistics
No Statistics
Row Count
Histograms and MCV
How many rows would produce the join of two
tables?
 From 0 to infi...
Statistics
No Statistics
Row Count
Histograms and MCV
How many rows would produce the join of two
tables?
 From 0 to infi...
Metadata
• Table structure information
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Кив...
Metadata
• Table structure information
– Distribution fields
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
...
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
ID Name Num Price
1 Яблоко 10 50
2 Г...
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
I...
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
•...
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
•...
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
•...
Query Optimizer
• HAWQ uses cost-based query optimizers
Query Optimizer
• HAWQ uses cost-based query optimizers
• You have two options
– Planner – evolved from the Postgres query...
Query Optimizer
• HAWQ uses cost-based query optimizers
• You have two options
– Planner – evolved from the Postgres query...
Storage Formats
Which storage format is the most optimal?
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage f...
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage f...
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage f...
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage f...
Storage Formats
• Row-based storage format
– Similar to Postgres heap storage
• No toast
• No ctid, xmin, xmax, cmin, cmax
Storage Formats
• Row-based storage format
– Similar to Postgres heap storage
• No toast
• No ctid, xmin, xmax, cmin, cmax...
Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar ...
Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar ...
Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar ...
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAW...
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAW...
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAW...
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAW...
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory us...
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory us...
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory us...
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory us...
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory us...
External Data
• PXF
– Framework for external data access
– Easy to extend, many public plugins available
– Official plugin...
External Data
• PXF
– Framework for external data access
– Easy to extend, many public plugins available
– Official plugin...
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Ser...
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Ser...
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Ser...
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Ser...
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Ser...
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Ser...
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNod...
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNod...
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNod...
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNod...
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNod...
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNod...
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr....
Query Performance
• Data does not hit the disk unless this cannot be
avoided
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless...
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless...
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless...
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless...
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless...
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Distributi...
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Distributi...
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Distributi...
Roadmap
• AWS and S3 integration
Roadmap
• AWS and S3 integration
• Mesos integration
Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
• Cloudera, MapR and IBM Hadoop distribut...
Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
• Cloudera, MapR and IBM Hadoop distribut...
Summary
• Modern SQL-on-Hadoop engine
• For structured data processing and analysis
• Combines the best techniques of comp...
Questions
Apache HAWQ
http://hawq.incubator.apache.org
dev@hawq.incubator.apache.org
user@hawq.incubator.apache.org
Reach ...
Upcoming SlideShare
Loading in …5
×

Apache HAWQ Architecture

79,788 views

Published on

I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • GOOD SHARE!!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache HAWQ Architecture

  1. 1. HAWQ Architecture Alexey Grishchenko
  2. 2. Who I am Enterprise Architect @ Pivotal • 7 years in data processing • 5 years of experience with MPP • 4 years with Hadoop • Using HAWQ since the first internal Beta • Responsible for designing most of the EMEA HAWQ and Greenplum implementations • Spark contributor • http://0x0fff.com
  3. 3. Agenda • What is HAWQ
  4. 4. Agenda • What is HAWQ • Why you need it
  5. 5. Agenda • What is HAWQ • Why you need it • HAWQ Components
  6. 6. Agenda • What is HAWQ • Why you need it • HAWQ Components • HAWQ Design
  7. 7. Agenda • What is HAWQ • Why you need it • HAWQ Components • HAWQ Design • Query execution example
  8. 8. Agenda • What is HAWQ • Why you need it • HAWQ Components • HAWQ Design • Query execution example • Competitive solutions
  9. 9. What is • Analytical SQL-on-Hadoop engine
  10. 10. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries
  11. 11. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres Greenplum HAWQ 2005 Fork Postgres 8.0.2
  12. 12. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 Greenplum
  13. 13. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 2011 Fork GPDB 4.2.0.0 Greenplum
  14. 14. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 2011 Fork GPDB 4.2.0.0 2013 HAWQ 1.0.0.0 Greenplum
  15. 15. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 2011 Fork GPDB 4.2.0.0 2013 HAWQ 1.0.0.0 HAWQ 2.0.0.0 Open Source 2015 Greenplum
  16. 16. HAWQ is … • 1’500’000 C and C++ lines of code
  17. 17. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only
  18. 18. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC
  19. 19. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC
  20. 20. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC
  21. 21. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC • 7’000 Shell scripts LOC
  22. 22. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC • 7’000 Shell scripts LOC • More than 50 enterprise customers
  23. 23. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC • 7’000 Shell scripts LOC • More than 50 enterprise customers – More than 10 of them in EMEA
  24. 24. Apache HAWQ • Apache HAWQ (incubating) from 09’2015 – http://hawq.incubator.apache.org – https://github.com/apache/incubator-hawq • What’s in Open Source – Sources of HAWQ 2.0 alpha – HAWQ 2.0 beta is planned for 2015’Q4 – HAWQ 2.0 GA is planned for 2016’Q1 • Community is yet young – come and join!
  25. 25. Why do we need it?
  26. 26. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003
  27. 27. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos
  28. 28. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data
  29. 29. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data – Example - parse URL to extract protocol, host name, port, GET parameters
  30. 30. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data – Example - parse URL to extract protocol, host name, port, GET parameters • Good performance
  31. 31. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data – Example - parse URL to extract protocol, host name, port, GET parameters • Good performance – How many times the data would hit the HDD during a single Hive query?
  32. 32. HAWQ Cluster Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM Server 6 Datanode Server N Datanode Server 5 Datanode interconnect …
  33. 33. HAWQ Cluster Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM Server 6 Datanode Server N Datanode Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect …
  34. 34. HAWQ Cluster HAWQ Master Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM HAWQ Standby Server 2 ZK JM HAWQ Segment Server 6 Datanode HAWQ Segment Server N Datanode HAWQ Segment Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect …
  35. 35. Master Servers Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM HAWQ Segment Server 6 Datanode HAWQ Segment Server N Datanode HAWQ Segment Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect … HAWQ Master HAWQ Standby
  36. 36. Master Servers HAWQ Master Query Parser Query Optimizer Global Resource Manager Distributed Transactions Manager Query Dispatch Metadata Catalog HAWQ Standby Master Query Parser Query Optimizer Global Resource Manager Distributed Transactions Manager Query Dispatch Metadata Catalog WAL repl.
  37. 37. HAWQ Master HAWQ Standby Segments Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM Server 6 Datanode Server N Datanode Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect HAWQ Segment HAWQ SegmentHAWQ Segment …
  38. 38. Segments HAWQ Segment Query Executor libhdfs3 PXF HDFS Datanode Local Filesystem Temporary Data Directory Logs YARN Node Manager
  39. 39. Metadata • HAWQ metadata structure is similar to Postgres catalog structure
  40. 40. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table
  41. 41. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field
  42. 42. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field
  43. 43. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field – Number of unique values in the field
  44. 44. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field – Number of unique values in the field – Number of null values in the field
  45. 45. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field – Number of unique values in the field – Number of null values in the field – Average width of the field in bytes
  46. 46. Statistics No Statistics How many rows would produce the join of two tables?
  47. 47. Statistics No Statistics How many rows would produce the join of two tables?  From 0 to infinity
  48. 48. Statistics No Statistics Row Count How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?
  49. 49. Statistics No Statistics Row Count How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?  From 0 to 1’000’000
  50. 50. Statistics No Statistics Row Count Histograms and MCV How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?  From 0 to 1’000’000 How many rows would produce the join of two 1000- row tables, with known field cardinality, values distribution diagram, number of nulls, most common values?
  51. 51. Statistics No Statistics Row Count Histograms and MCV How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?  From 0 to 1’000’000 How many rows would produce the join of two 1000- row tables, with known field cardinality, values distribution diagram, number of nulls, most common values?  ~ From 500 to 1’500
  52. 52. Metadata • Table structure information ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90
  53. 53. Metadata • Table structure information – Distribution fields ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(ID)
  54. 54. Metadata • Table structure information – Distribution fields – Number of hash buckets ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(ID) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90
  55. 55. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(ID) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90
  56. 56. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) • General metadata – Users and groups
  57. 57. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) • General metadata – Users and groups – Access privileges
  58. 58. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) • General metadata – Users and groups – Access privileges • Stored procedures – PL/pgSQL, PL/Java, PL/Python, PL/Perl, PL/R
  59. 59. Query Optimizer • HAWQ uses cost-based query optimizers
  60. 60. Query Optimizer • HAWQ uses cost-based query optimizers • You have two options – Planner – evolved from the Postgres query optimizer – ORCA (Pivotal Query Optimizer) – developed specifically for HAWQ
  61. 61. Query Optimizer • HAWQ uses cost-based query optimizers • You have two options – Planner – evolved from the Postgres query optimizer – ORCA (Pivotal Query Optimizer) – developed specifically for HAWQ • Optimizer hints work just like in Postgres – Enable/disable specific operation – Change the cost estimations for basic actions
  62. 62. Storage Formats Which storage format is the most optimal?
  63. 63. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal”
  64. 64. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data
  65. 65. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data – Minimal disk space usage
  66. 66. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data – Minimal disk space usage – Minimal time to retrieve record by key
  67. 67. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data – Minimal disk space usage – Minimal time to retrieve record by key – Minimal time to retrieve subset of columns – etc.
  68. 68. Storage Formats • Row-based storage format – Similar to Postgres heap storage • No toast • No ctid, xmin, xmax, cmin, cmax
  69. 69. Storage Formats • Row-based storage format – Similar to Postgres heap storage • No toast • No ctid, xmin, xmax, cmin, cmax – Compression • No compression • Quicklz • Zlib levels 1 - 9
  70. 70. Storage Formats • Apache Parquet – Mixed row-columnar table store, the data is split into “row groups” stored in columnar format
  71. 71. Storage Formats • Apache Parquet – Mixed row-columnar table store, the data is split into “row groups” stored in columnar format – Compression • No compression • Snappy • Gzip levels 1 – 9
  72. 72. Storage Formats • Apache Parquet – Mixed row-columnar table store, the data is split into “row groups” stored in columnar format – Compression • No compression • Snappy • Gzip levels 1 – 9 – The size of “row group” and page size can be set for each table separately
  73. 73. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other
  74. 74. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources
  75. 75. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources • Flexible cluster utilization – Query might run on a subset of nodes if it is small
  76. 76. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources • Flexible cluster utilization – Query might run on a subset of nodes if it is small – Query might have many executors on each cluster node to make it run faster
  77. 77. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources • Flexible cluster utilization – Query might run on a subset of nodes if it is small – Query might have many executors on each cluster node to make it run faster – You can control the parallelism of each query
  78. 78. Resource Management • Resource Queue can be set with – Maximum number of parallel queries
  79. 79. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority
  80. 80. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits
  81. 81. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit
  82. 82. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit – MIN/MAX number of executors across the system
  83. 83. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit – MIN/MAX number of executors across the system – MIN/MAX number of executors on each node
  84. 84. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit – MIN/MAX number of executors across the system – MIN/MAX number of executors on each node • Can be set up for user or group
  85. 85. External Data • PXF – Framework for external data access – Easy to extend, many public plugins available – Official plugins: CSV, SequenceFile, Avro, Hive, HBase – Open Source plugins: JSON, Accumulo, Cassandra, JDBC, Redis, Pipe
  86. 86. External Data • PXF – Framework for external data access – Easy to extend, many public plugins available – Official plugins: CSV, SequenceFile, Avro, Hive, HBase – Open Source plugins: JSON, Accumulo, Cassandra, JDBC, Redis, Pipe • HCatalog – HAWQ can query tables from HCatalog the same way as HAWQ native tables
  87. 87. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan
  88. 88. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  89. 89. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  90. 90. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  91. 91. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  92. 92. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  93. 93. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource
  94. 94. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM
  95. 95. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  96. 96. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  97. 97. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  98. 98. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers QE QE QE QE QE
  99. 99. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Execute Result Cleanup QE QE QE QE QE QE Prepare
  100. 100. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Execute Result Cleanup QE QE QE QE QE QE Prepare ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  101. 101. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Execute Result Cleanup QE QE QE QE QE QE Prepare ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  102. 102. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Result Cleanup QE QE QE QE QE QE Prepare Execute ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  103. 103. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Result Cleanup QE QE QE QE QE QE Prepare Execute ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  104. 104. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Result Cleanup QE QE QE QE QE QE Prepare Execute ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  105. 105. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Cleanup QE QE QE QE QE QE Prepare Execute Result
  106. 106. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Cleanup QE QE QE QE QE QE Prepare Execute Result
  107. 107. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Cleanup QE QE QE QE QE QE Prepare Execute Result
  108. 108. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup
  109. 109. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  110. 110. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  111. 111. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  112. 112. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  113. 113. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  114. 114. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup
  115. 115. Query Performance • Data does not hit the disk unless this cannot be avoided
  116. 116. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided
  117. 117. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP
  118. 118. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP • HAWQ has a good cost-based query optimizer
  119. 119. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP • HAWQ has a good cost-based query optimizer • C/C++ implementation is more efficient than Java implementation of competitive solutions
  120. 120. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP • HAWQ has a good cost-based query optimizer • C/C++ implementation is more efficient than Java implementation of competitive solutions • Query parallelism can be easily tuned
  121. 121. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer
  122. 122. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL
  123. 123. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages
  124. 124. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO
  125. 125. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism
  126. 126. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism Distributions
  127. 127. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism Distributions Stability
  128. 128. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism Distributions Stability Community
  129. 129. Roadmap • AWS and S3 integration
  130. 130. Roadmap • AWS and S3 integration • Mesos integration
  131. 131. Roadmap • AWS and S3 integration • Mesos integration • Better Ambari integration
  132. 132. Roadmap • AWS and S3 integration • Mesos integration • Better Ambari integration • Cloudera, MapR and IBM Hadoop distributions native support
  133. 133. Roadmap • AWS and S3 integration • Mesos integration • Better Ambari integration • Cloudera, MapR and IBM Hadoop distributions native support • Make the SQL-on-Hadoop engine ever!
  134. 134. Summary • Modern SQL-on-Hadoop engine • For structured data processing and analysis • Combines the best techniques of competitive solutions • Just released to the open source • Community is very young Join our community and contribute!
  135. 135. Questions Apache HAWQ http://hawq.incubator.apache.org dev@hawq.incubator.apache.org user@hawq.incubator.apache.org Reach me on http://0x0fff.com

×