A high level overview.
By Ahad Rana
Director of Engineering @ Factual
What is Impala ?
● Real-time SQL Query Engine from Cloudera.
● Built on top of Hadoop.
● Open Source (Apache License).
Hadoop Ecosystem Keeps Growing/Maturing
● High Availability NameNodes
● Quorum based NameNode Journaling.
● Zookeeper integration.
● YARN support for Linux Containers/ Multi-Tenancy.
● HDFS optimizations - direct reads, mmap reads.
● Kerberos based security throughout, ACLs (in CDH5)
● HBase improvements.
● New file formats: Parquet (based on Google Dremel).
MapReduce:Not the Solution for Everything
● Batch oriented, high latency.
● Joins are complex.
● DSLs are complex.
● Only really accessible to developers.
Needed:Better Accessibility to “Big Data”
● SQL is still the lingua franca of the Business Analytics community.
● Real-time / faster query response times can have a dramatic affect on
● The previous generation of enabling technologies (like Hive) are
batch oriented and too slow.
The Solution:MPP Query Engines on Hadoop
● That is, MPP SQL Query Engines on Hadoop.
● Proprietary vendors have been in this game for a while and have been
charging a lot for custom hardware and software.
● New engines, like Impala, are game changers:
○ Built from the ground up to integrate with Hadoop - Your data is
already in Hadoop - co-locate the queries next to your data!
○ Read data from existing Hadoop File Formats: Text,LZO-P,
Sequence File,AVRO etc.
○ No time consuming ETL cycle or indexing.
○ Open Source, non-proprietary, and they run on the same commodity
○ Bypass MapReduce framework, tuned for speed.
○ All support similar subset of SQL-92 as Hive.
Impala - One of Many in a Crowded Field
● Many parallel efforts to address this need:
○ Presto (Facebook).
○ Shark (AmpLab).
○ Hive on Tez (Hortonworks).
○ Impala (Cloudera).
○ Redshift (Amazon/proprietary).
Impala vs. Other MPP Engines
● 21 Nodes (2 CPU - 12 Core, 12 disks, 384GB RAM)
● TPC-DS derived benchmark scaled to 15TB.
Impala Technology Highlights
● Designed with performance in mind - backend is C++.
● Operates on native types in memory wherever possible.
● Uses JIT compiler (LLVM) to generate optimized expression evaluation
code (minimal branching, no v-table calls) on the fly.
● Asynchronously streams results between various stages of the
● Keeps all intermediate results in memory.
● Parallelizes scans and locates them near data (DataNodes),and
directly reads HDFS blocks from disk.
● Supports Broadcast Joins (Fact table is Scanned in a partitioned
manner, Smaller Dimension table(s) are broadcast to all nodes)
● Also supports Partitioned Joins (both Fact/Dimension tables are
partitioned on join columns / expressions).
● Supports a variety of file formats (Text,Text compressed, Seq File,
RCFile, AVRO, HFile, Parquet) AND Scans against HBase.
Impala Deployment Environment
● Three binaries
○ impalad - one on each Hadoop Datanode, handles client requests,
runs query execution pipelines.
○ statestored - single instance, name service (like Zookeeper),
used to coordinate catalog changes, impalad availability etc.
○ catalogd - single instance, talks to Hive metastore, caches
table metadata, distributes metadata to impalads, broadcasts
metadata changes via statestored.
Impala - Life of a Query
1. Query is routed to arbitrary impalad in cluster.
2. Java code running in impalad creates parse tree from SQL statement.
3. Planner evaluates parse tree and create query Plan Tree.
4. Data Sources and Sinks (Tables) are validated against the metadata
catalog (via catalogd).
5. Planner uses table metadata to make optimal placement decisions,
establish sharding strategy,memory requirements per query stage,and
constructs Plan Fragments.
6. catalogd caches HDFS block locations for every HDFS file, and Planner
attaches block locations to Query Plan metadata.
7. Planner converts Plan Fragments into Thrift objects, and passes them
on to the Query Coordinator (C++).
8. Coordinator runs root Fragment locally and distributes remaining
Fragments to other backends.
9. Coordinator monitors Fragment execution until completion,cancellation
10. Fragments stream data to other Fragments or to the primary Fragment
(User or Table).
● Fact to Fact table joins are expensive!
● Scalar data types (INT,BIGINT) better than generic (String).
● Coerce types in metadata during load.
● Push Filter Predicates downstream as much as possible (Avoid upstream
● Use Impala’s Partitioning support to pre-partition tables.
● Compression means less network / disk IO.
● Always use compression.
● Choose codec wisely - make the appropriate size vs. cpu cost
● Snappy - Low CPU requirements, Low compression ratio.
● GZip - High CPU,High compression ratio.
● LZO - Somewhere in the middle.
● Snappy is generally a good default choice.
● Use a Columnar format (like Parquet) if possible.
● Columnar formats give you:
○ Less IO for Impala - Only read relevant column data.
○ Better compression, more efficient encoding.
● Parquet is the new preferred file format for Impala.
Parquet Format Detail
● Row Group - group of rows in columnar format
○ A Row Group’s worth of data is buffered in RAM when writing.
○ Impala prefers large (1GB row group sizes)
○ Potentially multiple row groups per split (when reading).
● Column Chunk / Page(s)
○ Each column’s data is stored in a separate chunk.
○ Each chunk is further divided into Page(s) worth of data.
○ Each page is ~1MB in size.
○ Compression is applied at Page level.
○ Smallest level of granularity for read is the Page.
● Good for writing data in bulk, not good for small writes (Insert
Values … ).
● Supports nested columns (see Google Dremel Paper).
● Map-Reduce InputFormat can materialize data back as Thrift objects.
● Impala currently only uses HBase RPCs to scan HBase tables.
● Map HBase binary key/columns to types using Hive/Impala metadata.
● Best to only use HBase key column in predicates, otherwise you will
have to scan entire table.
● Best to do large fact table scans in Impala first and then do a late
stage join to a limited number of hbase rows.
● HBase now has snapshotting support, so perhaps bulk scans are
possible in the future ?
Impala Codebase Commentary
● Good use of best practices throughout codebase
○ Robust Comments
○ Modern C++ standards (STL,Boost, aggressive smart pointer
utilization to avoid leaks)
○ Good instrumentation framework throughout codebase.
○ Smart memory management.
○ Thrift RPC and Thrift data structures for exchange of data /
messages between services and layers.
○ Smart use of Java for front-end query processing/parsing, DDL,
DML integration with Hive etc, and then transfer of Query Plan
from Java layer to C++ via Thrift objects.
○ Very efficient use of memory in pipeline
■ Tuple (collection of typed column data, allocated using
■ Awareness of various distinct tuples flowing through various
stages of pipeline means they can generate super efficient
just-in-time code to read/write these tuples.
● High, potentially unconstrained demand on Hadoop nodes
○ Containerization (in YARN) helps, but still, nodes can consume a
lot of RAM and CPU.
● Allocate enough memory to support your needs, avoid swap!
● Potentially large impact on shared network IO when doing running
inefficient queries that broadcast lots of data upstream.
● No quota management, resource tracking by user.
P.S. Factual is always looking for good engineers!