Pivotal hawq internals

Alexey Grishchenko
Alexey GrishchenkoTechnical Evangelist at Pivotal
Pivotal HAWQ 
A.Grishchenko 
HadoopKitchen @ Mail.ru 
27 Sep 2014 
Pivotal Confidential––Internal Use Only 1
SQL-on-Hadoop Solutions 
2008 
Hive 
 Developed by Facebook 
– Hive is used for data analysis in their data warehouse 
– DWH size is ~300PB at the moment, ~600TB of data is loaded daily. Data 
is compressed using ORCFiles, compression ratio is ~8x 
 HiveQL language is not compatible with ANSI SQL-92 
 Has many limitations on subqueries 
 Cost-based optimizer (Optiq) is only in technical preview now 
Pivotal Confidential–Internal Use Only 2
SQL-on-Hadoop Solutions 
2008 
Hive 
 Developed by Cloudera 
10.2012 
Impala 
– Open-source solution 
– Cloudera sells this solution to enterprise shops 
– Was in beta until the May’2013 
 Supports HiveQL, moving forward complete ANSI SQL-92 support 
 Written in C++, does not use Map-Reduce for running queries 
 Requires much memory, big tables join usually causes OOM error 
Pivotal Confidential–Internal Use Only 3
SQL-on-Hadoop Solutions 
2008 
Hive 
 Hortonworks initiative 
10.2012 
Impala 
02.2013 
Stinger 
– Consists of a number of steps to make Hive run 100x faster 
 Tez – solution to make Hive queries be translated to Tez jobs, which are 
similar to Map-Reduce but may have arbitrary topology 
 Optiq – cost-based query optimizer for Hive (technical preview ATM) 
 ORCFile – columnar storage format with adaptive compression and 
inline indexes 
 Hive-5317 – ACID and Update/Delete support (release at ~ 11.2014) 
Pivotal Confidential–Internal Use Only 4
SQL-on-Hadoop Solutions 
2008 
Hive 
 Pivotal product 
10.2012 
Impala 
02.2013 
Stinger 
02.2013 
HAWQ 
– Greenplum MPP DBMS, ported to store data in HDFS 
– Written in C, query optimizer is rewritten for this solution (ORCA) 
 Supports ANSI SQL-92 and analytic extensions from SQL-2003 
 Supports complex queries with correlated subqueries, window functions 
and different joins 
 Data is put on disk only if the process does not have enough memory 
Pivotal Confidential–Internal Use Only 5
SQL-on-Hadoop Solutions 
2008 
Hive 
 HP Vertica 
10.2012 
Impala 
02.2013 
Stinger 
02.2013 
HAWQ 
– Supports only MapR distribution as requires updatable storage 
– Supports ANSI SQL-92, SQL-2003 
– Supports UPDATE/DELETE 
– Officially announced as available in July’2014, no implementations yet 
 IBM BigSQL v3 
– IBM DB2 ported to store data in HDFS 
– Federated queries, good query optimizer, etc. 
 Both solutions are similar to Pivotal HAWQ in general idea 
2014 
Vertica, 
BigSQL 
Pivotal Confidential–Internal Use Only 6
Pivotal HAWQ Components 
Master 
Server 1 
Server 3 
Segment 1 
Segment 2 
… 
Segment K 
Standby 
Master 
Server 2 
Server 4 
Segment K+1 
Segment K+2 
… 
Segment 2*K 
Server M 
… 
Segment N 
… 
Pivotal Confidential–Internal Use Only 7
Pivotal HAWQ Components 
Server 1 
HAWQ Master 
Server 2 
ZK QJM ZK QJM ZK QJM 
HAWQ SBMstr 
Server 5 
Datanode 
HAWQ Segm. 
Server 3 
NameNode 
… 
Server 4 
SNameNode 
Server 6 
Datanode 
HAWQ Segm. 
Server M 
Datanode 
HAWQ Segm. 
Pivotal Confidential–Internal Use Only 8
Pivotal HAWQ Components 
HAWQ Master 
Query Parser 
Query Optimizer 
Query Executor 
Transaction 
Manager 
Metadata 
Catalog 
Process 
Manager 
HAWQ Standby Master 
Query Parser 
Query Optimizer 
Query Executor 
Transaction 
Manager 
Metadata 
Catalog 
Process 
Manager 
WAL 
replic. 
Pivotal Confidential–Internal Use Only 9
Pivotal HAWQ Components 
 Metadata is stored only on master-servers 
 Metadata is stored in modified Postgres instance, replicated 
to standby master with WAL 
 Metadata contains 
– Table information – schema, names, files 
– Statistics – number of unique values, value ranges, sample values, 
etc. 
– Information about users, groups, priorities, etc. 
 Master server shutdown causes the switch to standby with 
the loss of running sessions 
Pivotal Confidential–Internal Use Only 10
Pivotal HAWQ Components 
HAWQ Segment 
Query Executor 
libhdfs3 
PXF 
HDFS Datanode 
Segment Data Directory 
Local Filesystem (xfs) 
Spill Data Directory 
Pivotal Confidential–Internal Use Only 11
Pivotal HAWQ Components 
 Both masters and segments are modified postgres 
instances (to be clear, modified Greenplum instances) 
 Opening connection to the master server you fork 
postmaster process that starts to work with your session 
 Starting the query execution you connect to the segment 
instances and they also fork a process to execute query 
 Query execution plan is split into independent blocks 
(slices), each of them is executed as a separate OS process 
on the segment server, moving the data through UDP 
Pivotal Confidential–Internal Use Only 12
Pivotal HAWQ Components 
 Tables can be stored as: 
– Row-oriented (quicklz, zlib compression) 
– Column-oriented (quicklz, zlib, rle compression) 
– Parquet tables 
 Each segment has separate directory on HDFS where it 
stores its data shard 
 Within columnar storage each column is represented as a 
separate file 
 Parquet allows to store the table by columns and does not 
load NameNode with many files / block location requests 
Pivotal Confidential–Internal Use Only 13
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 14
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 15
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 16
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 17
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 18
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 19
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 20
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 21
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 22
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 23
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 24
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 25
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 26
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 27
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 28
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 29
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 30
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 31
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 32
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
MotionGather 
Projects.beer, s.price 
HashJoinb.name = s.bar 
MotionRedist(b.name) 
s Filterb.city = 'San Francisco' 
b 
ScanBars 
HAWQ Segment 
Backend 
QE S1 S2 S3 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
ScanSells 
Pivotal Confidential–Internal Use Only 33
Query Execution in Pivotal HAWQ 
HAWQ Master 
Parser Query Optimiz. 
Metadata 
Transact. Mgr. 
Process Mgr. 
Query Executor 
NameNode 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Backend 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 34
PXF Framework 
 Gives you ability to read different data types from HDFS 
– Text files, both compressed and uncompressed 
– Seqence-files 
– AVRO-files 
 Able to read data from external data sources 
– HBase 
– Cassandra 
– Redis 
 Extensible API 
Pivotal Confidential–Internal Use Only 35
NameNode 
PXF Framework 
HAWQ Master 
PXF Fragmenter 
Process Mgr. 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 36
NameNode 
PXF Framework 
HAWQ Master 
PXF Fragmenter 
Process Mgr. 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 37
NameNode 
PXF Framework 
HAWQ Master 
PXF Fragmenter 
Process Mgr. 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 38
NameNode 
PXF Framework 
HAWQ Master 
PXF Fragmenter 
Process Mgr. 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 39
NameNode 
PXF Framework 
HAWQ Master 
PXF Fragmenter 
Process Mgr. 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 40
NameNode 
PXF Framework 
HAWQ Master 
PXF Fragmenter 
Process Mgr. 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 41
NameNode 
PXF Framework 
HAWQ Master 
PXF Fragmenter 
Process Mgr. 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 42
NameNode 
PXF Framework 
HAWQ Master 
PXF Fragmenter 
Process Mgr. 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
HAWQ Segment 
Query Executor 
PXF Accessor 
PXF Fragmenter 
HDFS Datanode 
Segment Directory 
Local Spill Directory 
Pivotal Confidential–Internal Use Only 43
Further Steps 
 Master server scaling – pool of master servers 
 New native data storage formats and new native 
compression algorithms 
 YARN as resource manager for HAWQ 
 Dynamic segment allocation / decommission 
Pivotal Confidential–Internal Use Only 44
Questions? 
Pivotal Confidential––Internal Use Only 45
BUILT FOR THE SPEED OF BUSINESS
1 of 46

Recommended

Pivotal HD as a Cloud Foundry Service by
Pivotal HD as a Cloud Foundry ServicePivotal HD as a Cloud Foundry Service
Pivotal HD as a Cloud Foundry ServicePlatform CF
5.4K views25 slides
How to manage Hortonworks HDB Resources with YARN by
How to manage Hortonworks HDB Resources with YARNHow to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHortonworks
915 views20 slides
Architecting Applications with Hadoop by
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
765 views64 slides
Application architectures with Hadoop – Big Data TechCon 2014 by
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
6K views69 slides
Cloudera Impala: A Modern SQL Engine for Apache Hadoop by
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
5K views27 slides
Application Architectures with Hadoop - Big Data TechCon SF 2014 by
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014hadooparchbook
1.2K views95 slides

More Related Content

What's hot

Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches by
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
3.4K views47 slides
Apache Drill by
Apache DrillApache Drill
Apache DrillTed Dunning
17.7K views21 slides
Applications on Hadoop by
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
1.4K views24 slides
Modern Data Architecture by
Modern Data ArchitectureModern Data Architecture
Modern Data ArchitectureAlexey Grishchenko
31.5K views100 slides
An introduction to apache drill presentation by
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
2.7K views21 slides
Introduction to Impala by
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
2.9K views40 slides

What's hot(20)

Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches by Mithun Radhakrishnan
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Apache Drill by Ted Dunning
Apache DrillApache Drill
Apache Drill
Ted Dunning17.7K views
Applications on Hadoop by markgrover
Applications on HadoopApplications on Hadoop
Applications on Hadoop
markgrover1.4K views
An introduction to apache drill presentation by MapR Technologies
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies2.7K views
Introduction to Impala by markgrover
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover2.9K views
a Secure Public Cache for YARN Application Resources by DataWorks Summit
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
DataWorks Summit2.7K views
NYC HUG - Application Architectures with Apache Hadoop by markgrover
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
markgrover703 views
SQL Engines for Hadoop - The case for Impala by markgrover
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover1.2K views
Hadoop 3.0 - Revolution or evolution? by Uwe Printz
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz1.2K views
Big data processing meets non-volatile memory: opportunities and challenges by DataWorks Summit
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges
DataWorks Summit1.1K views
Building a Business on Hadoop, HBase, and Open Source Distributed Computing by Bradford Stephens
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens42.2K views
Hadoop User Group - Status Apache Drill by MapR Technologies
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
MapR Technologies1.6K views

Viewers also liked

Apache HAWQ Architecture by
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ ArchitectureAlexey Grishchenko
85.6K views135 slides
MPP vs Hadoop by
MPP vs HadoopMPP vs Hadoop
MPP vs HadoopAlexey Grishchenko
91.6K views53 slides
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T... by
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...VMware Tanzu
3.4K views25 slides
Apache Spark Architecture by
Apache Spark ArchitectureApache Spark Architecture
Apache Spark ArchitectureAlexey Grishchenko
76K views114 slides
Архитектура Apache HAWQ Highload++ 2015 by
Архитектура Apache HAWQ Highload++ 2015Архитектура Apache HAWQ Highload++ 2015
Архитектура Apache HAWQ Highload++ 2015Alexey Grishchenko
747 views143 slides
HAWQ: a massively parallel processing SQL engine in hadoop by
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
4K views12 slides

Viewers also liked(20)

Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T... by VMware Tanzu
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
VMware Tanzu3.4K views
Архитектура Apache HAWQ Highload++ 2015 by Alexey Grishchenko
Архитектура Apache HAWQ Highload++ 2015Архитектура Apache HAWQ Highload++ 2015
Архитектура Apache HAWQ Highload++ 2015
Alexey Grishchenko747 views
HAWQ: a massively parallel processing SQL engine in hadoop by BigData Research
HAWQ: a massively parallel processing SQL engine in hadoopHAWQ: a massively parallel processing SQL engine in hadoop
HAWQ: a massively parallel processing SQL engine in hadoop
BigData Research4K views
Webinar turbo charging_data_science_hawq_on_hdp_final by Hortonworks
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
Hortonworks4.8K views
How to Use Apache Zeppelin with HWX HDB by Hortonworks
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
Hortonworks1.4K views
SQL and Machine Learning on Hadoop using HAWQ by pivotalny
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
pivotalny3.4K views
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake... by NoSQLmatters
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters1.3K views
Introduction to Impala ~Hadoop用のSQLエンジン~ #hcj13w by Cloudera Japan
Introduction to Impala ~Hadoop用のSQLエンジン~ #hcj13wIntroduction to Impala ~Hadoop用のSQLエンジン~ #hcj13w
Introduction to Impala ~Hadoop用のSQLエンジン~ #hcj13w
Cloudera Japan5.3K views
Pivotal Big Data Suite: A Technical Overview by VMware Tanzu
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
VMware Tanzu2.3K views
Introduction to Greenplum by Dave Cramer
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
Dave Cramer2.2K views
Hadoop distributions - ecosystem by Jakub Stransky
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
Jakub Stransky1.4K views
[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita by Insight Technology, Inc.
[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita
[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita
gsoc_mentor for Shivram Mani by Shivram Mani
gsoc_mentor for Shivram Manigsoc_mentor for Shivram Mani
gsoc_mentor for Shivram Mani
Shivram Mani160 views

Similar to Pivotal hawq internals

Pivotal HAWQ 소개 by
Pivotal HAWQ 소개Pivotal HAWQ 소개
Pivotal HAWQ 소개Seungdon Choi
2.8K views58 slides
SQL on Hadoop: Defining the New Generation of Analytics Databases by
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases DataWorks Summit
2.2K views29 slides
Big data Hadoop Analytic and Data warehouse comparison guide by
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
3.9K views59 slides
Big data hadooop analytic and data warehouse comparison guide by
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
384 views59 slides
5. pivotal hd 2013 by
5. pivotal hd 20135. pivotal hd 2013
5. pivotal hd 2013Chiou-Nan Chen
1.1K views12 slides
SQL et in-memory sur Hadoop avec Pivotal et HAWQ by
SQL et in-memory sur Hadoop avec Pivotal et HAWQSQL et in-memory sur Hadoop avec Pivotal et HAWQ
SQL et in-memory sur Hadoop avec Pivotal et HAWQModern Data Stack France
1.5K views14 slides

Similar to Pivotal hawq internals(20)

Pivotal HAWQ 소개 by Seungdon Choi
Pivotal HAWQ 소개Pivotal HAWQ 소개
Pivotal HAWQ 소개
Seungdon Choi2.8K views
SQL on Hadoop: Defining the New Generation of Analytics Databases by DataWorks Summit
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases
DataWorks Summit2.2K views
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015 by Rajit Saha
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Rajit Saha606 views
Hoodie - DataEngConf 2017 by Vinoth Chandar
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar1.2K views
Learn Hadoop Administration by Edureka!
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
Edureka!27K views
HAWQ Meets Hive - Querying Unmanaged Data by DataWorks Summit
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
DataWorks Summit849 views
Eric Baldeschwieler Keynote from Storage Developers Conference by Hortonworks
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks2.2K views
Hawq meets Hive - DataWorks San Jose 2017 by Alex Diachenko
Hawq meets Hive - DataWorks San Jose 2017Hawq meets Hive - DataWorks San Jose 2017
Hawq meets Hive - DataWorks San Jose 2017
Alex Diachenko78 views
Key trends in Big Data and new reference architecture from Hewlett Packard En... by Ontico
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Ontico2.6K views
HUG Meetup 2013: HCatalog / Hive Data Out by Sumeet Singh
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
Sumeet Singh652 views
Savanna - Elastic Hadoop on OpenStack by Sergey Lukjanov
Savanna - Elastic Hadoop on OpenStackSavanna - Elastic Hadoop on OpenStack
Savanna - Elastic Hadoop on OpenStack
Sergey Lukjanov2.9K views
Learn to setup a Hadoop Multi Node Cluster by Edureka!
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
Edureka!2.8K views
Big Data Analytics with Hadoop, MongoDB and SQL Server by Mark Kromer
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer6.9K views

Recently uploaded

Quality Assurance by
Quality Assurance Quality Assurance
Quality Assurance interworksoftware2
5 views6 slides
predicting-m3-devopsconMunich-2023-v2.pptx by
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptxTier1 app
11 views33 slides
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptxanimuscrm
15 views19 slides
Introduction to Git Source Control by
Introduction to Git Source ControlIntroduction to Git Source Control
Introduction to Git Source ControlJohn Valentino
7 views18 slides
MS PowerPoint.pptx by
MS PowerPoint.pptxMS PowerPoint.pptx
MS PowerPoint.pptxLitty Sylus
7 views14 slides
Ports-and-Adapters Architecture for Embedded HMI by
Ports-and-Adapters Architecture for Embedded HMIPorts-and-Adapters Architecture for Embedded HMI
Ports-and-Adapters Architecture for Embedded HMIBurkhard Stubert
29 views19 slides

Recently uploaded(20)

predicting-m3-devopsconMunich-2023-v2.pptx by Tier1 app
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptx
Tier1 app11 views
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm15 views
Introduction to Git Source Control by John Valentino
Introduction to Git Source ControlIntroduction to Git Source Control
Introduction to Git Source Control
John Valentino7 views
Ports-and-Adapters Architecture for Embedded HMI by Burkhard Stubert
Ports-and-Adapters Architecture for Embedded HMIPorts-and-Adapters Architecture for Embedded HMI
Ports-and-Adapters Architecture for Embedded HMI
Burkhard Stubert29 views
AI and Ml presentation .pptx by FayazAli87
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptx
FayazAli8714 views
Airline Booking Software by SharmiMehta
Airline Booking SoftwareAirline Booking Software
Airline Booking Software
SharmiMehta9 views
Top-5-production-devconMunich-2023-v2.pptx by Tier1 app
Top-5-production-devconMunich-2023-v2.pptxTop-5-production-devconMunich-2023-v2.pptx
Top-5-production-devconMunich-2023-v2.pptx
Tier1 app6 views
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with... by sparkfabrik
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
sparkfabrik8 views
360 graden fabriek by info33492
360 graden fabriek360 graden fabriek
360 graden fabriek
info33492162 views
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy14 views
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... by NimaTorabi2
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
NimaTorabi216 views
Generic or specific? Making sensible software design decisions by Bert Jan Schrijver
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
tecnologia18.docx by nosi6702
tecnologia18.docxtecnologia18.docx
tecnologia18.docx
nosi67025 views

Pivotal hawq internals

  • 1. Pivotal HAWQ A.Grishchenko HadoopKitchen @ Mail.ru 27 Sep 2014 Pivotal Confidential––Internal Use Only 1
  • 2. SQL-on-Hadoop Solutions 2008 Hive  Developed by Facebook – Hive is used for data analysis in their data warehouse – DWH size is ~300PB at the moment, ~600TB of data is loaded daily. Data is compressed using ORCFiles, compression ratio is ~8x  HiveQL language is not compatible with ANSI SQL-92  Has many limitations on subqueries  Cost-based optimizer (Optiq) is only in technical preview now Pivotal Confidential–Internal Use Only 2
  • 3. SQL-on-Hadoop Solutions 2008 Hive  Developed by Cloudera 10.2012 Impala – Open-source solution – Cloudera sells this solution to enterprise shops – Was in beta until the May’2013  Supports HiveQL, moving forward complete ANSI SQL-92 support  Written in C++, does not use Map-Reduce for running queries  Requires much memory, big tables join usually causes OOM error Pivotal Confidential–Internal Use Only 3
  • 4. SQL-on-Hadoop Solutions 2008 Hive  Hortonworks initiative 10.2012 Impala 02.2013 Stinger – Consists of a number of steps to make Hive run 100x faster  Tez – solution to make Hive queries be translated to Tez jobs, which are similar to Map-Reduce but may have arbitrary topology  Optiq – cost-based query optimizer for Hive (technical preview ATM)  ORCFile – columnar storage format with adaptive compression and inline indexes  Hive-5317 – ACID and Update/Delete support (release at ~ 11.2014) Pivotal Confidential–Internal Use Only 4
  • 5. SQL-on-Hadoop Solutions 2008 Hive  Pivotal product 10.2012 Impala 02.2013 Stinger 02.2013 HAWQ – Greenplum MPP DBMS, ported to store data in HDFS – Written in C, query optimizer is rewritten for this solution (ORCA)  Supports ANSI SQL-92 and analytic extensions from SQL-2003  Supports complex queries with correlated subqueries, window functions and different joins  Data is put on disk only if the process does not have enough memory Pivotal Confidential–Internal Use Only 5
  • 6. SQL-on-Hadoop Solutions 2008 Hive  HP Vertica 10.2012 Impala 02.2013 Stinger 02.2013 HAWQ – Supports only MapR distribution as requires updatable storage – Supports ANSI SQL-92, SQL-2003 – Supports UPDATE/DELETE – Officially announced as available in July’2014, no implementations yet  IBM BigSQL v3 – IBM DB2 ported to store data in HDFS – Federated queries, good query optimizer, etc.  Both solutions are similar to Pivotal HAWQ in general idea 2014 Vertica, BigSQL Pivotal Confidential–Internal Use Only 6
  • 7. Pivotal HAWQ Components Master Server 1 Server 3 Segment 1 Segment 2 … Segment K Standby Master Server 2 Server 4 Segment K+1 Segment K+2 … Segment 2*K Server M … Segment N … Pivotal Confidential–Internal Use Only 7
  • 8. Pivotal HAWQ Components Server 1 HAWQ Master Server 2 ZK QJM ZK QJM ZK QJM HAWQ SBMstr Server 5 Datanode HAWQ Segm. Server 3 NameNode … Server 4 SNameNode Server 6 Datanode HAWQ Segm. Server M Datanode HAWQ Segm. Pivotal Confidential–Internal Use Only 8
  • 9. Pivotal HAWQ Components HAWQ Master Query Parser Query Optimizer Query Executor Transaction Manager Metadata Catalog Process Manager HAWQ Standby Master Query Parser Query Optimizer Query Executor Transaction Manager Metadata Catalog Process Manager WAL replic. Pivotal Confidential–Internal Use Only 9
  • 10. Pivotal HAWQ Components  Metadata is stored only on master-servers  Metadata is stored in modified Postgres instance, replicated to standby master with WAL  Metadata contains – Table information – schema, names, files – Statistics – number of unique values, value ranges, sample values, etc. – Information about users, groups, priorities, etc.  Master server shutdown causes the switch to standby with the loss of running sessions Pivotal Confidential–Internal Use Only 10
  • 11. Pivotal HAWQ Components HAWQ Segment Query Executor libhdfs3 PXF HDFS Datanode Segment Data Directory Local Filesystem (xfs) Spill Data Directory Pivotal Confidential–Internal Use Only 11
  • 12. Pivotal HAWQ Components  Both masters and segments are modified postgres instances (to be clear, modified Greenplum instances)  Opening connection to the master server you fork postmaster process that starts to work with your session  Starting the query execution you connect to the segment instances and they also fork a process to execute query  Query execution plan is split into independent blocks (slices), each of them is executed as a separate OS process on the segment server, moving the data through UDP Pivotal Confidential–Internal Use Only 12
  • 13. Pivotal HAWQ Components  Tables can be stored as: – Row-oriented (quicklz, zlib compression) – Column-oriented (quicklz, zlib, rle compression) – Parquet tables  Each segment has separate directory on HDFS where it stores its data shard  Within columnar storage each column is represented as a separate file  Parquet allows to store the table by columns and does not load NameNode with many files / block location requests Pivotal Confidential–Internal Use Only 13
  • 14. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 14
  • 15. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 15
  • 16. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 16
  • 17. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 17
  • 18. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 18
  • 19. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 19
  • 20. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 20
  • 21. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 21
  • 22. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 22
  • 23. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 23
  • 24. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 24
  • 25. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 25
  • 26. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 26
  • 27. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 27
  • 28. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 28
  • 29. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 29
  • 30. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 30
  • 31. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 31
  • 32. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 32
  • 33. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory MotionGather Projects.beer, s.price HashJoinb.name = s.bar MotionRedist(b.name) s Filterb.city = 'San Francisco' b ScanBars HAWQ Segment Backend QE S1 S2 S3 HDFS Datanode Segment Directory Local Spill Directory ScanSells Pivotal Confidential–Internal Use Only 33
  • 34. Query Execution in Pivotal HAWQ HAWQ Master Parser Query Optimiz. Metadata Transact. Mgr. Process Mgr. Query Executor NameNode HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Backend HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 34
  • 35. PXF Framework  Gives you ability to read different data types from HDFS – Text files, both compressed and uncompressed – Seqence-files – AVRO-files  Able to read data from external data sources – HBase – Cassandra – Redis  Extensible API Pivotal Confidential–Internal Use Only 35
  • 36. NameNode PXF Framework HAWQ Master PXF Fragmenter Process Mgr. HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 36
  • 37. NameNode PXF Framework HAWQ Master PXF Fragmenter Process Mgr. HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 37
  • 38. NameNode PXF Framework HAWQ Master PXF Fragmenter Process Mgr. HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 38
  • 39. NameNode PXF Framework HAWQ Master PXF Fragmenter Process Mgr. HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 39
  • 40. NameNode PXF Framework HAWQ Master PXF Fragmenter Process Mgr. HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 40
  • 41. NameNode PXF Framework HAWQ Master PXF Fragmenter Process Mgr. HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 41
  • 42. NameNode PXF Framework HAWQ Master PXF Fragmenter Process Mgr. HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 42
  • 43. NameNode PXF Framework HAWQ Master PXF Fragmenter Process Mgr. HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory HAWQ Segment Query Executor PXF Accessor PXF Fragmenter HDFS Datanode Segment Directory Local Spill Directory Pivotal Confidential–Internal Use Only 43
  • 44. Further Steps  Master server scaling – pool of master servers  New native data storage formats and new native compression algorithms  YARN as resource manager for HAWQ  Dynamic segment allocation / decommission Pivotal Confidential–Internal Use Only 44
  • 46. BUILT FOR THE SPEED OF BUSINESS