2. What is ImpalaToGo
ImpalaToGo is a fork of Cloudera Impala with a
proxy layer between a generic fs/dfs, not strictly
HDFS. This proxy acts as a cache layer.
3. Why caching layer?
We want ImpalaToGo to work efficiently with
remote storages, especially cloud object
storage, like S3.
Why not another storage, such as GlusterFS?
We believe that a caching layer is inherently
simpler and more elastic than storage
4. Storage volume vs speed
We believe that it is optimal to:
- store big data volumes on slow & cheap
HDD, which are part of object storage.
- store the hot data set on local SSD drives.
Impala needs about 100 mb/sec bandwidth per
CPU. SSD are ideal for this purpose.
5. Optimize local drive space
On AWS, cloud ephemeral disk space is a
scarce resource.
Cache can store all data without replication,
thus maximizing its usage.
Storage would require redundancy, thus
wasting space.
6. Cache layer design
During the design of a distributed cache system
we have to answer the following design
questions:
● How are files distributed among the nodes?
● How are files are stored on individual drives?
● How to ensure data locality?
7. File distribution
We use a consistent hash over full file names
to map files to cluster nodes.
Benefits:
● No metadata storage required
● Efficient resize
8. File storage on nodes
We store files in a single directory, under the same
structure as in DFS.
For instance, a file in S3://someBucket/SomeDir/SomeFile
will be stored in
/var/cache/impalaToGo/someBucket/SomeDir/SomeFile
Since files are distributed by consistent hash, each file will
be stored on exactly one node.
9. File storage - assessment
Easy to find files in the local path with relative
ease, since we can predict the cache structure.
DevOps is left to choose how multiple drives
are organized together.
Storage of files, not blocks - a single huge file
can not be processed by several nodes.
10. Cache eviction
Currently, we have implemented a simple LRU
algorithm.
If nothing can be evicted - cache is bypassed
11. Roadmap : Pre-fetch
Configurable pre-fetch capability. We want user
to have the ability to specify rules on which
data should be pre-fetched into the cache when
written to the remote storage.
12. Roadmap - Tachyon
We are working on Tachyon integration as one
of the possible caching layers for ImpalaToGo