An Efficient Cache Handling Technique in Database
Abhishek Shah George Sam Nikhil Lakade
University of Southern University of Southern University of Southern
California California California
In various commercial database
systems, the queries which have complex
structures often take longer time to execute. The
efficiency of the query processing can be greatly
improved if the results of the previous queries are
stored in the form of caches.
These caches can be used to answer
queries later on. Furthermore, the cost factor to
process large and complex queries is huge in
commercial databases due to the size of the
databases, and hence we need a way to optimize
processing by automatically caching the
intermediate results. Creating such an automatic
system to cache the results would help in saving
Existing cache systems do manage to
store the intermediate results, but they suffer
from the problem of not knowing how efficiently
to use the cache memory to store the results. It
also becomes a problem, if the database gets
regularly updated. The cache would then become
obsolete. It is necessary to decide when to discard
a cache and the frequency of checking the
updates in the database.
2.1 The Problem
and decision support usually ranges from minutes
to hours, but depends mostly on the extent of the
database, the type of query and the processing
capabilities of the servers. Usually, large scale
servers require minimal time to process the
decision support queries. This processing time
however, reduces vastly if multiple queries are
executed simultaneously, or if the structure of the
query is complex. Complex queries are built up
of many sub queries, and the result set of the final
query depends on the result set obtained by
executing the sub queries.
Traditional databases employ the
method of treating every query independently.
However, this results in increase in processing
time. Moreover, if there is a scenario where a
particular query is frequently used, then every
time one needs to fire the same query, which
brings the problem of redundancy in processes.
2.2 Challenges
a. One main concern is knowing and deciding
which cache entries are to be deleted, and
whether to delete the entire cache or just part
of the cache. This becomes crucial if part of
the result is needed for further queries.
b. Many commercial database systems are
frequently updated. Our challenge here is to
update the results of the intermediate queries
as and when CRUD[1]
operations take place
on the database.
Seeing how commercial databases and
OLAP struggle to fetch query results efficiently,
having an automated system closely coupled with
the query optimizer to cache results and manage
them will provide an efficient way to retrieve
data. For example, consider a website which has
to fetch data from a large database every time a
page is accessed by a user.
The load time of the page increases
significantly because of large data and relatively
slow query processing time. It is also useful if the
system can automatically decide when to discard
or keep a particular cache entry, and handle
frequent database updates.
4.1 Exchequer
is an intermediate query
caching system developed for the purpose of
storing relevant sub query results. The authors of
the paper on this system have differentiated their
system from normal Operating System caching
techniques based on the following aspects:
a. In traditional cache systems, size and cost of
computation aren’t considered, while the recency
of a used data object is sufficient.
b. In a query cache system, results from previous
cached queries can be used further for use.
c. In the traditional cache systems, the pages are
independent of each other, and so can be easily
deleted without affecting other pages.
In the Exchequer system, they take into
account the dynamic nature of the cache, so here
the traditional materialized view or an index
solution will not work efficiently. In the
materialized view scenario, there are techniques
to decide which entities to materialize and other
previous materialized views are taken into
consideration. This does not work in a static
cache system. Thus exchequer uses the dynamic
nature of cache system. Another system uses
multi-query caching, and takes into account the
cost of materializing the selected views, but
makes static decision on what to materialize. .
The Exchequer system also uses an AND-OR
DAG representation of queries and the cached
results. The use of DAG makes it extensible for
new operations and efficiently encodes
alternative ways of evaluating queries .The
exchequer DAG representation also takes into
account sort orders and presence of indexes.
In the exchequer architecture, there is a
tight coupling between cache manager and
optimizer. It uses a query execution plan to refer
to cached relations which is got by the execution
engine and the new intermediate query results
produced by the query are sometimes cached. It
uses an incremental greedy algorithm to decide
which results should be cached. The algorithm
first checks if any of the nodes of the chosen
query plan should be cached.
The incremental decision is made by
updating the representative set with a new query
and a selection algorithm is applied to the nodes
selected when the previous query was considered
and the nodes of the chosen query plan. The
output of this algorithm is a set of nodes that are
marked for caching and the best plan for the
current query. Thus when the query gets
executed, the nodes in its best plan that are
marked are added to cache, which then replace
the unmarked nodes. The unmarked nodes are
chosen for replacement using LCS/LRU, i.e. the
largest results are evicted, and amongst the
remaining results, the least recently used is
evicted. The exchequer optimizes the query
before fresh caching decisions are made the
chosen plan for each query is optimal for that
query, given the current cache contents.
4.2 Composite Information Server Platform &
Query Processing Engine:
Composite Information Server
is an intermediate server platform that
works over REST, HTTP and SOAP protocols in
a client- server web application environment. It is
provided by Oracle. It receives client requests
and authenticates them either through LDAP or
Composite Domain Authentication and then
passes it to the Query Processing Engine. The
query processing engine then executes this
request over some data source and retrieves a
data result. It then combines this data into a
single SQL or XML result set and returns to the
client. The Query Processing Engine provides
various optimization methods so that the SQL
query is efficient. It basically translates all
requests into a distribution plan. It then analyzes
queries and creates an optimized execution plan
to determine the intermediate steps and data
source requests. The Query Processing Engine
also employs a Caching technique for the queries.
These sequence of queries are then executed
against the relevant data.
The engine minimizes the overheads and
creates efficient join methods that leverage
against a suboptimal query. The techniques that
the query engine provide include:
a. SQL Pushdown:
The Query Processing Engine offloads most of
the query processing. It pushes down the select
query operations like string searches,
comparisons, sorting etc. into the underlying
relational data sources.
b. Parallel Processing:
The Query Engine processes requests in a parallel
and asynchronous way on separate threads, thus
reducing the wait time and data source response
c. Caching:
The Composite Information Server is
configured to cache the results of query, web
service calls and the procedures. It does this on a
per view/query basis. It stores this intermediate
results in either relational database or in a file
based cache. The Engine always checks if the
result of the query is already present in the cache,
and uses this cache data. It is most useful when
used on data which is frequently invoked and
which change rarely. In the scenario where the
data is constantly changing, the query engine
does not perform very well and cannot handle
frequent changes to data.
4.3 Multidimensional Query Cache using Chunks
To improve query response time in
OLAP caching of queries has been proposed,
which consists of mainly two approaches, table
level caching and query level caching. A
proposed previous work uses chunks[3]
to reuse
results of previous queries to answer future
queries. To achieve performance later, chunked
cache is combined with chunked file
Chunk file organization is basically
redefines the organization for relation tables. This
new organization of chunk files reduces cost
chunk cache miss. Concern about this
methodology is select required chunk. Smaller
chunks results into efficient query optimization
but efficiency downgrades when total number of
chunks increases in system and hence another
paradigm comes, which is to decide replace
policies for chunked caches.
4.4 Usability Based Caching
Another similar work that has been
done on this topic is the usability-based caching[4]
of query results in OLAP systems. In this method
they propose a new cache management scheme
for OLAP systems which is based on the usability
of query results in rewriting and processing of
related future queries. Not only they take into
consideration the queries that are currently being
executed but it also predicts the future queries
based on the present and past queries that are
being executed on this system using the
probability model.
5.1 Architecture
The architecture of our proposed model of
optimizer is shown in figure 1.
Fig1. Architecture of System
The optimizer and cache manager
works in close coupling with our intermediate
query cache system. The optimizer uses the
chunked cache to efficiently cache incoming
database queries.
The query execution plan and the cache
management plan are designed inside the
Optimizer and Cache Manager. This block is
responsible for changing the current cache. The
query execution plan is created using the cached
chunks. This chunked cache is obtained from the
Execution Engine when as and when required.
5.2 Use of Chunks
There are systems which are becoming
increasingly dynamic like that of OLAP and
important for business data analysis. Usually the
data sets in such systems are of multidimensional
nature. The traditional relational systems are
designed in such a way that they cannot provide
the required performance for these data sets.
Hence such systems are built by using a three tier
architecture. The first tier gives an easy to use
graphical tool that allows the users to build
queries. The second tier provides a
multidimensional view of the data stored In the
final tier, which can be a RDBMS. Queries that
occur in systems like OLAP are very interactive
and demand quick response time even if they are
of complex nature.
At times OLAP queries are repetitive in
nature a d follow a predictable pattern. An
OLAP type session can be characterized using
different kinds of locality.
1. Terrestrial: The same data might be accessed
repeatedly by the same user or a different user.
2. Hierarchical: This kind of locality is specific to
the OLAP domain and is consequence of the
presence of hierarchies on the dimensions. Data
members which are related by the parent/child or
sibling relationships will be accessed over and
over again. For example if a user is looking at
data for United States his next query is likely to
be about Canada or Mexico.
We are going to use dynamic caching
scheme where the cache contents vary
dynamically, since new items may be inserted
and old items may be removed from the cache. A
dynamic approach will be significantly beneficial
at the middle tier, since it adapts to the query
profile. Also, we use chunks for dynamic caching
and demonstrate its feasibility under realistic
query workloads without much overhead. We use
multidimensional arrays to represent data. Instead
of storing a large array in simple rows or columns
we break them down to chunks and store them in
a chunked format. The different values for each
dimension are divided into ranges, and chunks
are created based on this division. The figure 1
shows how multidimensional space can be
broken up into chunks.
Fig 2 Chunks
5.3 Caching with the Chunks
In this type of caching using chunks the
query results to be stored in the cache are broken
up into chunks and the chunks are cached. When
a user inputs a new query the existing chunks are
required to answer that query. Depending on the
content available in cache, the list of chunks is
divided into two. One part is answered from the
cache. The other consisting of the missing chunk,
has to be computed from the backend. The cost is
reduced here by just computing the missing
chunk from the backend.
Caching chunks improves the
granularity of caching. This leads to better
utilization of the cache in two ways.
1. Frequently accessed chunks of a query get
cached. The chunks which are not frequently
accessed are replaced eventually.
2. Previous queries can be used much more
effectively. For example Figure 2 shows a chunk
based cache, in which each query represents a
portion of multidimensional space. Say we have
three queries Q1, Q2 and Q3 and Q1 and they are
called in the increasing order. Now Q3 is not
contained in Q1 and Q2 or their union. Thus,
methods based on query containment will not be
able to use Q1 and/or Q2 to answer Q3. With
chunk based caching, Q3 can use the chunks it
has in common with Q1 and Q2. Only the
remaining chunks which are shown below in the
figure 2 have to be computed. The chunked file
organization in the relational backend enables
these remaining chunks to be computed in time
proportional to their size rather than in time
proportional to the size of Q3.
Fig 3 Reuse of Cached Chunks
5.4 Replacement Scheme using Chunks
Replacement schemes become a very import
structure of this system as the future queries are largely
dependent on this. The old chunks has to be removed and
the new chunks have to be added to the cache for an
efficient caching. There are different replacements
strategies like LRU but are not efficient enough.
Schemes which make use of a profit metric
consisting of the size and execution cost of a query are
considered in [6]. We also something similar for the
replacement scheme, we combined the TIME scheme with
the notion of benefit. Let Benefit(C) denote the benefit of a
chunk. We associate one more quality, called Weight(C)
with each chunk C in the cache. The replacement algorithm
is as follows:
Algorithm: TimeBenefit
Input: chunk N to be inserted in the cache
while [ space not available for N]:
Let C be the chunk corresponding to current
time position.
If [ Wieght (C) ≤ 0] :
Evict C from the cache
Else :
Weight (C) = Weight (C) – Benefit (N)
Advance Time position
Insert N into cache
Weight (N) = Benefit (N)
5.2 DAG Representation
Dag is a Directed Acyclic Graph. In our
implementation of a cache query optimizer
system, it is important that we find an efficient
way to represent a query. It is done so that we
find an optimized query plan to execute. The
query execution is optimized if we have an
efficient query plan. To use the query evaluation
structurally, we use the concept of Directed
Acyclic graphs. The DAG is a way to optimally
represent the set of queries and operations. Using
the DAG an efficient query plan can be
generated, and this query DAG can be further
used to create a query caching algorithm.
An efficient algorithm using the DAG structure
for queries is the Volcano Algorithm [5]. This
algorithm represents the queries and the set of
queries in the form of DAG. In a DAG, there are
2 set of nodes:
AND nodes and OR nodes. AND nodes are used
to represent operations performed on the result
sets and queries. It represents the operations like
select, join and other operations on result sets.
OR nodes are used to represent queries and result
sets. In our implementation the OR nodes will
represent the sub queries which will get cached.
The OR nodes are called equivalence nodes in the
Volcano Algorithm. The equivalence nodes do
not have any operational representation and are to
describe the data in the system.
How single queries are handled:
The single queries are directly
representable in DAG. In a single query, a query
tree is first created using the relations in the
query and the operations. Once the query tree is
created it is sequentially expanded to generate
further equivalence nodes over the operational
nodes. It is given in the following diagrams. The
squares denote the equivalence nodes and the
circles denote the operational nodes.
Let us assume the query is of the form
A⋈B⋈C. It is represented as DAG inFig4.a,
Fig4.b Fig4.c. The relations A, B, C and the
intermediate relations are represented in the
equivalence nodes. The join operator are
represented as the circle nodes. It will be
represented as the following steps. Fig 4.a shows
the query tree for the query. The additional
equivalence nodes for the intermediate results are
created for AB in Fig 4.b.
Now the DAG is expanded to
accommodate for all possible combinations of the
join operators. In Fig4.c we take all possible
initial joins. It is done between AB, AC and BC.
These are then stored as intermediate result sets.
These are later used for the join query to create
the final result set.
Fig4.a- Initial Query Tree
Fig 4.b- Intermediate DAG
Fig 4.c- DAG of Single Query
How query sets are handled:
Query sets are handle a little differently
in the Volcano Algorithm. In this version of the
Volcano Algorithm, the intermediate equivalence
nodes represent the result sets. Each query set is a
set of multiple queries. The deletion of queries in
this case is done by reference counting
mechanisms. The queries are added into the DAG
one at a time. At each time a query is inserted, a
new equivalence node and operational node is
Sometimes, the expressions may match
existing subexpressions in the DAG. Query
subexpressions may be equal too. The volcano
optimizer algorithms handles these subexpression
anomalies. An example of this could be the
problem arriving due to associativity of the join
operators in multi relation queries. The Volcano
Algorithm applies the associativity and then
unifies the nodes by replacing them with a single
equivalence node.
5.3 Query Optimization over DAG:
Now that the Dag is created, the
algorithm will perform certain functions on the nodes to
evalueate the cost value of each node based on its type. The
equivalence nodes cost and the operational nodes costs are
evaluated separately. The optimizer also takes into account
the cost of reading the input when pipelines are not used.
The cost at each node is a function of its children and the
subtrees below it.
For operational node o the cost function is defined as
cost(o) = cost of executing (o) + ∑ei∈children(o) cost(ei)
The children of o are all equivalence nodes. The cost of
each equivalence node is
cost(e) = min{cost(oi) | oi ∈ children€}
= 0 if there are no children
We now have to take into account the case when
some subset of nodes may be materialized and we may
need to reuse these materialized nodes. We introduce a new
function called reusecost(ei) which gives us the cost if an
equivalence node ei is re used again from a materialized set
Thus the modified cost factor will be
Cost(o) = cost of executing (o) + ∑ei∈children(o) CC(ei)
Where CC(ei) = cost(ei) if ei ∉ M
= min(cost(ei), reusecost(ei)) if ei ∈ M
5.4 Algorithm to Handle Cache Delete and Insertion
The algorithm proposed finds out if any of nodes
in the DAG and the chunk system are worth caching. We
need to find out the benefit of adding or deleting these
nodes. We create benefit functions for DAG too which will
be similar to the Benefit(N) function in the chunk system.
The benefit function also takes into account the number of
times the previous query was used. The proposed optimizer
needs to know the nodes selected when the previous query
was considered and all the nodes of the query plan. Now
suppose S is a set of nodes selected to be cached from
representative set R, then for a query q
Cost(R,S) = ∑q∈R (cost(q, S) * weight(q))
We now find out the benefit function.
Benefit(R,x,S) = cost(R, S) – (cost(R, {x} ∪ S) + cost(x, S))
This finds out the benefit we get by adding node
x to the DAG. In cases where x is computed already we
assume the cost(x, S) to be 0.
We can now create a modified algorithm that will
handle cache deletions and insertions.
Algorithm: TimeBenefit
Input: chunk N to be inserted in the cache
Set X of the expanded DAG with nodes cached
Node x with benefit(x,R,X)
while [ space not available for N]:
Let C be the chunk corresponding to current
time position.
If [ Wieght (C) ≤ 0] :
Evict C from the cache
Delete x and its equivalence nodes from
Else :
Weight (C) = Weight (C) – Benefit (N)
Advance Time position
Insert N into cache
Weight (N) = Benefit (N)
Algorithm Modify_Cache
Expanded DAG for R, the representative set of queries,
and the set of candidate equivalence nodes for caching
Chunk N to be inserted
Output: Set of nodes to be cached
Y = set of candidates equivalence nodes for caching
while (Y = φ)
Among nodes y ∈ Y such that size({y} ∪ X) < CacheSize)
Pick the node x with the highest benef it(R, x, X)/size(x)
/* i.e., highest benefit per unit space */
if (benef it(R, x, X) < 0)
break; /* No further benefits to be had, stop */
Y = Y - x; X = X ∪ {x}
return X
The Modify_Cache algorithm now handles the
deletion of nodes and chunks from the optimizer and also
creates optimum cache mechanism for the queries.
5.5 Handling Frequent Updates
Our solution system also handles the frequent updates
made on the database and the coherency with the cache
system. As new data keeps getting added to the databases,
the cache needs to be modified. We also need to discard or
modify necessary chunks in the chunk system.
To do this we create a sub module which will act
as a proxy between the database and the client. As and
when the client system proposes changes to the database,
the data will first enter the proxy system. The proxy system
will then decide which relations and attributes in each
relation need to be modified in the database. The proxy
module is also connected to the Optimizer and Cache
Manager. This ensures that the chunks are mapped on to
the respective relation attributes in the proxy. Whenever a
new update enters the proxy module, it creates a map
pointer to the necessary chunks in the chunk system.
The proxy module then finds out which chunks in the
chunk system will need to be modified. Once this mapping
is created it then sends the update to the database to be
materialized. Further analysis of the proxy system will be
done in the future work.
6.1 Evaluating the use of DAG
The Dag based approach is also followed by The
Exchequer system. In this method we use the concept of
making the query into a set of nodes to be evaluated as a
DAG. This DAG is then expanded to analyze the nodes and
then the operations on the nodes are performed one by one,
We then get the cached nodes of the Dag and nodes which
need to be deleted form the DAG.
The query structure taken for evaluation is of the following
WHERE join-list AND select-list
GROUP BY groupby-list;
It has a central Orders fact table and dimension tables- Part,
Supplier, Customer and Time. The size of these tables is
the same as used in [1]. The join-list us used to have
equality between attributes of the order fact table and
primary keys. The select-list are generated by selecting 0
too 2 attributes from the join-list. The groupby-list is
generated by picking at random a subset of all the keys.
These queries are decided so that a fair comparison can be
The metrics used to measure the goodness of the algorithm
is the total response time of the set of queries. The report is
generated for a sequence of 100 queries after 50 queries
have already allowed the cache to be generated. This total
response time is denoted by the estimated cost which is
calculated using the cost functions mentioned in secition
5.3 and 5.4.
To evaluate, the representative set is initially set to 10. And
then we check for different sized of caches, which is
around 5%, 10%, 20% of the database.
We compare our algorithm with the Exchequer system and
with a system which has no cache, it has the LRU method
for cache management. The LRU policy is found widely in
ADBMS systems., it picks the least recently used chunk
and Dag node to be replaced. In LRU the system is
unaware of the load of the work.
Analyzing the Estimated Response Time:
In this part, initially the cache size is kept at a minimum.
Then the algorithms are run on the queries. The
Exchequer’s Algorithm and our Modify_Cache perform
better than the NoCache system. Now the size of the cache
is increased to accommodate 5% of the database. The
algorithms show an improvement in their performance.
This can be because the increase in cache size makes it
easier for the algorithms to find out the chunks which will
be able to answer the queries on their own. In low sized
caches, this poses as a problem as it becomes costlier and
longer to find out chunks to look for answers and then
using the replacement methods for the DAG and the
For low cache sizes the Modify_Cache algorithm performs
better than the Exchequer algorithm, with a higher rate of
improvement. However, as the cache size is increased to
30% the estimated cost increases for the Exchequer system.
After a given cache size, the system investment in caching
the extra results obtained in this increased size does not
help in the Benefit factor. With the cache being 30% of the
database size, the Modify_Cache algorithm still returns a
better estimated time than the Exchequer’s algorithm.
6.2 Evaluating for Chunks
We considered various performance measures to evaluate
the effectiveness of the schemes we have employed.
1. Using our system we executed 100 queries to
calculate the average execution time.
2. Cost Saving Ratio: This performance measure is the
percentage of the total cost of the queries saved due to hits
in the cache. The cost to execute the query at the backend
to compute the savings in cost due to a cache hit. Consider
a query stream consisting of a mix of n queries q1, q2,…qn.
c: this is the cost when we execute the query at the backend
hi: when we satisfy qi references made to the cache
ri: number of references to the query qi
Comparing the CSR of a query based system and a chunk
based cache we have come to a conclusion that query based
system gives a value of 0.42 because of the redundant
storage in the cache and the chunk based system the CSR
was 0.98 showing that the cache storage was not redundant.
This concept of using queries for caching can be further
optimized. Our algorithm of Modify_Cache currently only
uses the Benefit anf Weight functions to evaluate the utility
of cache chunks. It can be further improved by
accommodating other aspects of the chunks, and also
taking into account eh different operations that can be
performed on the OLAP databases. Further work can be
done in improving the run time of our algorithm. Work also
needs to be done to better use the DAGs and imporving the
run time of the DAG expansions. The time requirement and
additional space requirement for the DAGs to store query
cache information play a crucial role. This needs to be
taken into account for further work. Also more work needs
to be done to find out hoe the frequent updates of the
databases can be handled. Our algorithm, at this point of
time cannot efficiently handle high frequency cache
updates. Work needs to be to implement advanced methods
to identify high frequency cache updates and hence
maintain the efficiency and consistency of the cache.
Thus it can be seen that the use of chunks and DAG in
implementing a Cache management system proves useful.
The performance of the Query Engine improves with the
use of the Modify_Cache algorithm. The estimated time to
run the queries also decreases tremendously with the use of
query caching and the use of chunks to store caches. The
use of chunks in caching helps in utility of large datastores
and results in decrement in running OLAP queries on the
same datastores.
