…
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
SQL App 
ODBC 
HDFS NN 
Statestore 
& 
Catalog 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
SQL request 
Hive 
Metastore
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
SQL App 
ODBC 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
HDFS NN 
Statestore 
& 
Catalog 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Planner turns request into collections of plan fragments 
Coordinator initiates execution on remotes nodes 
Hive 
Metastore
Intermediate results are streamed between nodes 
Operation permitted, query results are streamed back to client 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
SQL App 
ODBC 
Hive 
Metastore HDFS NN 
Statestore 
& 
Catalog 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
Query Planner 
Query Coordinator 
Query Executor 
HDFS DN HBase 
query 
results
void MaterializeTuple(char* tuple) { 
for (int i = 0; i < num_slots_; ++i) { 
char* slot = tuple + offsets_[i]; 
switch (types_[i]) { 
case BOOLEAN: 
*slot = ParseBoolean(); 
break; 
case INT: 
*slot = ParseInt(); 
case FLOAT: … 
case STRING: … 
// etc. 
} 
} 
} 
void MaterializeTuple(char* tuple) { 
// i = 0 
*(tuple + 0) = ParseInt(); 
// i = 1 
*(tuple + 4) = ParseBoolean(); 
// i = 2 
*(tuple + 5) = ParseInt(); 
} 
Hot code path, called per row
Query 
Fragment 
Impala Daemon 
Query 
Fragment 
Query 
Fragment 
IO Manager 
Disk Disk Disk 
Disk Disk 
Thread 
0 
Thread 
1 
Thread 
2 
Thread 
3 
Thread 
4
container format for all popular serialization formats: Avro, Thrift, 
Protocol Buffers
From Twitter’s “Dremel Made Simple” blog 
The most efficient IO, is one that never happens at all
OVER PARTITION, RANK, LEAD, LAG, NTILE, .. 
• 
VARCHAR, CHAR
ROLLUP, CUBE, GROUPING SET 
SET MINUS INTERSECT
SELECT question FROM audience WHERE has_question = true;

Impala: A Modern, Open-Source SQL Engine for Hadoop

  • 17.
  • 23.
    Query Planner QueryCoordinator Query Executor HDFS DN HBase SQL App ODBC HDFS NN Statestore & Catalog Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request Hive Metastore
  • 24.
    Query Planner QueryCoordinator Query Executor HDFS DN HBase SQL App ODBC Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS NN Statestore & Catalog Query Planner Query Coordinator Query Executor HDFS DN HBase Planner turns request into collections of plan fragments Coordinator initiates execution on remotes nodes Hive Metastore
  • 25.
    Intermediate results arestreamed between nodes Operation permitted, query results are streamed back to client Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore & Catalog Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase query results
  • 29.
    void MaterializeTuple(char* tuple){ for (int i = 0; i < num_slots_; ++i) { char* slot = tuple + offsets_[i]; switch (types_[i]) { case BOOLEAN: *slot = ParseBoolean(); break; case INT: *slot = ParseInt(); case FLOAT: … case STRING: … // etc. } } } void MaterializeTuple(char* tuple) { // i = 0 *(tuple + 0) = ParseInt(); // i = 1 *(tuple + 4) = ParseBoolean(); // i = 2 *(tuple + 5) = ParseInt(); } Hot code path, called per row
  • 33.
    Query Fragment ImpalaDaemon Query Fragment Query Fragment IO Manager Disk Disk Disk Disk Disk Thread 0 Thread 1 Thread 2 Thread 3 Thread 4
  • 35.
    container format forall popular serialization formats: Avro, Thrift, Protocol Buffers
  • 36.
    From Twitter’s “DremelMade Simple” blog The most efficient IO, is one that never happens at all
  • 48.
    OVER PARTITION, RANK,LEAD, LAG, NTILE, .. • VARCHAR, CHAR
  • 49.
    ROLLUP, CUBE, GROUPINGSET SET MINUS INTERSECT
  • 51.
    SELECT question FROMaudience WHERE has_question = true;