4. Why do we care, about internals?
► SQL is declarative, no need for internals...
► In the same time, even small problems in
engine operation require good understanding of
its work principles to fix...
► It is hardly possible to optimize without
understanding algorithms under the hood.
► It is hard to make decisions about engine
suitability to future needs without knowing
technical limitations.
6. How to understand engine?
What it is doing?
Main principle of operation
Main building block
Operation sequence
Operation environment
Efficiency
Design decisions
Materials
Main problems and fixes
7. What it is doing
Impala is Relation engine. It executes SQL
queries.
Data is append-able only. There is no “Update” or
“Delete” statements.
8. Principle of operation
Main differentiators are:
Distribution of Query among nodes (MPP)
LLVM and Code generation. Impala is compiler.
Relay on HDFS
Use external metadata – hive metastore.
Parallel query capability (per node, per cluster).
9. Sequence of operation
Query parsing – translate SQL to AST(Abstract
syntax tree)
Match objects to metadata
Query planning – create physical execution plan.
In case of MPP – divide plan into plan fragments
for nodes.
Distribute plan fragments to nodes
Execute plan fragments.
10. Main building blocks
Front End. This is Java code which implements a
lot of logic with non-critical performance
- database objects
fe/src/main/java/com/cloudera/impala/analysis/
- execution plan parts :
fe/src/main/java/com/cloudera/impala/planner/
11. BackEnd (Be)
Backend is written on C++, and used mostly for
performance critical parts. Specifically:
- Execution of the plan fragments on nodes
- Services implementation
ImpalaD service
StateStore
Catalog Service
12. Services - ImpalaD
This is “main” service of impala which runs on
each node. It logically consists of the following
sub-services of our interest.
ImpalaService – service, used to execute query.
Console, JDBC/ODBC connects here.
ImpalaInternalService – service is used to
coordinate work within the impala cluster.
Example of usage – to coordinate the job of
running query fragments on planned impala
nodes.
What is interesting for us? Each node can serve
13. Dual role of ImpalaD service
Query coordinator
Fragment executor
17. Services - StateStore
In many clusters we have to solve “cluster
synchronization” problem on some or other way.
In impala it is solved by StateStore –
published/subscriber service, similar to
Zookeeper. Why Zookeeper is not used?
It speaks with its clients in terms of topics. Clients
can subscribe to different topics. So to find
“endpoints” - look in the sources for the usage of
“StatestoreSubscriber”
18. StateStore – main topics
IMPALA_MEMBERSHIP_TOPIC – updates about
attached and detached nodes.
IMPALA_CATALOG_TOPIC – updates about
metadata changes.
IMPALA_REQUEST_QUEUE_TOPIC – updates
in the queue of waiting queries.
19. Admission control
There is module called AdmissionController.
Via topic impala-request-queue it is know about
queries currently running and their basic
statistics like memory and CPU consumption.
Based on this info it can decide to:
-run query
-queue query
-reject query
20. Catalog Service
It caches in Java code metadata from hive
metastore:
/fe/src/main/java/com/cloudera/impala/catalog/
It is important since Hive's native partition pruning
is slow especially with large number of
partitions.
It use C++ code be/src/catalog/
To relay changes (delta's) to other nodes via
StateStore.
21. Differance with hive
Catalog Service store in memory and operate on
metadata, leaving MetaStore for persistance
only.
Technically it mean that disconnection from
MetaStore is not that complicated.
22. ImpalaInternalService - details
This is place where the real heavy lifting takes
place.
Before diving in, what we want to understand
here:
Threading model
File System interface
Predicate pushdown
Resource management
23. Threading model
DiskIoMgr schedules access of all readers to all
disks. It should include predicates.
It can give optimal concurrency. Sounds coherent
to the Intel TBB / Java Executor service
approach: give me small tasks and I will
schedule them.
The rest of operations – like Joins, Group By looks
like single threaded in current version.
IMHO – sort joins and group by are better for
concurrency.
24. File System interface
Impala is working via LibHDFS – so HDFS (not
DFS) is hard coded.
Impala required and checked that short circuit is
enabled.
During planning phase names of the block files to
be scanned are determined.
25. Main “database” algorithm
It is interesting to see, how main operations are
implemented, what options do we have:
Group By,
Order By (Sort),
Join
26. Join
Join is probably most powerful and performance
critical part of any analytical RDBMS.
Impala implements BroadCastJoin and
GraceHashJoin.(be/src/exec/partitioned-hash-join-
node.h). Both are kinds of Hash Join.
Basic idea of GraceHashJoin is to partition data,
and load in memory corresponding partitions of
the tables for the join.
27. DiskMemory
Part 2 Part 3 Part 4Part 1 Part 5
Part 2 Part 3 Part 4Part 1 Part 5
Part 2 Part 3 Part 4Part 1 Part 5Part 3 Part 4 Part 5
In-memory hash join
DiskMemory
Part 3 Part 4
Part 3 Part 4 Part 5
Part 5
28. BroadCast join
Just send small table to all nodes and join with big
one.
It is very similar to Map Side join in Hive.
Selection of join algorithm can be hinted.
29. Group by
There are two main approaches – using dictionary
or sorting.
Aggregation can be subject to memory problems
with too many groups.
Impala is using Partitioned Hash join which can
spill to disk using BufferedBlockManager.
It is somewhat analogous to join implementation.
30. User defined functions
Impala supports two kinds of UDF / UDAF
- Native, written in C/C++
- Hive's UDF written in java.
31. Caching
Impala does not cache data by itself.
It delegates it to the new HDFS caching capability.
In a nutshell – HDFS is capable to keep given
directory in memory.
Zero copy access via MMAP is implemented.
Why it is better then buffer cache?
Less task switching
No CRC Check
32. Spill to Disk
In order to be reliable, especially in face of Data
Skews, some sort of spilling data to disk is
needed.
Impala approach this problem with introduction of
BufferedBlockMgr
It implements mechanism somewhat similar to
virtual memory – pin, unpin blocks, persist them.
It can use many disks to distribute load.
It is used in all places where memory can be not
sufficient
33. Why not Virtual Memory?
Some databases offload all buffer management to
the OS Virtual Memory. Most popular example:
MongoDB.
Impala create BufferedBlockManager per
PlanFragment.
It gives control how much memory consumed by
single query on given node.
We can summarize answer as : better resource
management.
35. Memory Management
Impala BE has its own MemPool class for memory
allocation.
It is used across the board by runtime primitives
and plan nodes.
36. Why own Runtime?
Impala has implemented own runtime – memory
management, virtual memory?
IMHO Existing runtime (both Posix, and C++
runtime) are not multi-tenant. It is hard to track
and limit resource usage by different requests in
the same process.
To solve this problem Impala has its own runtime
with tracking and limiting capabilities.
37. YARN integration
When Impala run as part of the Hadoop stack
resource sharing is important question...
Two main options are
- Just divide resources between Impala and Yarn
using cgroups.
- Use YARN for the resource management.
38. Yarn Impala Impedance
YARN is built to schedule batch processing.
Impala is aimed to sub-second queries.
Running application master per query does not
sounds “low latency”.
Requesting resources “as execution go” does not
suit pipeline execution of query fragments.
40. LLAMA
Low Latency Application Master
Or
Long Living Application Master
It enable low latency requests by living longer –
for a whole application lifetime.
41. How LLAMA works
1. There is single LLAMA daemon to broker
resources between Impala and YARN
2. Impala ask for all resources at once - “gang
scheduling”
3. LLAMA cache resources before return them to
YARN.
42. Important point
Impala is capable of:
- Run real time queries In YARN environment
- Ask for more resources (especially memory)
when needed.
Main drawbacks:
Impala implements own resource management among concurrent
queries, thus partially duplicating YARN functionality.
Possible deadlocks between two YARN applications.
44. What is source of similarity
With all the difference, they solve similar problem:
How to survive in Africa...
O, sorry,
How to run and coordinate number of tasks in the
cluster.
46. ImpalaToGo
While being a perfect product Impala is chained to
the hadoop stack
- HDFS
- Management
47. Why it is a the problem?
HDFS is perfect to store vast amounts of data.
HDFS is built from large inexpensive SATA drives.
For the interactive analytics we want fast storage.
We can not afford FLASH drives for whole big
data.
48. What is solution
We can create another hadoop cluster on flash
storage.
Minus – another namenode to manage, replication
will waste space.
If replication factor is one – any problems should
be manually repaired.
49. Cache Layer in place of DFS
HDFS/Hadoop cluster
ImpalaToGo cluster
Data caching (LRU)
Auto load
50. Elasticity
Having cache layer in place of distributed file
system it is much easier to resize cluster.
ImpalaToGo is used consistent hashing for its data
placement – to minimize impact on resize.
51. Who we are?
Group of like minded developers, working on
making Impala even greater.