12. What’s the best way to answer questions that span these
two worlds?
Can we interface SQL atop Hadoop?
Can we combine the strengths of parallel databases with
those of Hadoop?
16. Cloudera Impala : Architecture
Clients
Impala Shell JDBC/ODBC Client SQL Tools
Data Node Data Node
Impala Daemon Impala Daemon Impala Daemon
Data Node
Query Execution
Query Planning
Query Coordination
Query Execution
Query Planning
Query Coordination
Query Execution
Query Planning
Query Coordination
State StoreMetadata Catalog HDFS Name Node
Unified Metadata Store
17. Life Cycle of an Impala Query
Clients
Impala Shell JDBC/ODBC Client SQL Tools
Impala Daemon
Data Node
State StoreMetadata Catalog HDFS Name Node
Impala Daemon
Data Node
Impala Daemon
Data Node
Impala Daemon
Data Node
Coordinate Execution
Plan and Optimize
Parse Query
19. Polybase + PDW : Architecture
Clients
ADO.NET JDBC/ODBC Client OLEDB
PDW Engine Service DMS Controller Loader Manager SQL Server
HDFS Bridge
Compute Node
Data Move Service
SQL Server
Job Tracker
Hadoop Cluster
Name Node
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
PDW Cluster
SQL Server
Compute Node
Data Move Service
HDFS Bridge
Compute Node
Data Move Service
SQL Server
SQL Server
Compute Node
Data Move Service
SQL Server PDW : Architecture
Control Node
20. CREATE HADOOP_CLUSTER GSL_CLUSTER WITH
(namenode=‘hadoop-head’,namenode_port=9000,
jobtracker=‘hadoop-head’,jobtracker_port=9010);
Register the Hadoop Cluster with PDW
21. Map HDFS File to External Tables in PDW
CREATE EXTERNAL TABLE hdfsCustomer
( c_custkey!! bigint not null,
c_name!! varchar(25) not null,
c_address!! varchar(40) not null,
c_nationkey! integer not null,
c_phone! ! char(15) not null,
c_acctbal!! decimal(15,2) not null,
c_mktsegment! char(10) not null,
c_comment!! varchar(117) not null)
WITH (LOCATION='/tpch1gb/customer.tbl',
FORMAT_OPTIONS (EXTERNAL_CLUSTER = GSL_CLUSTER,
EXTERNAL_FILEFORMAT = TEXT_FORMAT));
22. Life Cycle of a Split Query
Clients
ADO.NET JDBC/ODBC Client OLEDB
Loader Manager
Control Node
DMS Controller
Engine Service SQL Server
HDFS Bridge
Compute Node
Data Move Service
SQL Server
Hadoop Cluster
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
PDW Cluster
HDFS Bridge
Compute Node
Data Move Service
SQL Server
Plan
Job Tracker
Name Node
Data Node
Task Tracker