Icde2019 improving rdf query performance using in-memory virtual columns in oracle database

Improving RDF Query Performance using
In-Memory Virtual Columns in Oracle Database
Eugene Inseok Chong Matthew Perry Souripriya Das
New England Development Center
Oracle
Nashua, NH, USA
firstname.lastname@oracle.com
Abstract— Many RDF Knowledge Graph Stores use IDs to
represent triples to save storage and for ease of maintenance.
Oracle is no exception. While this design is good for a small
footprint on disk, it incurs overhead in query processing, as it
requires joins with the value table to return results or process
aggregates/filter/order-by queries. It becomes especially
problematic as the result size increases or the number of projected
variables increases. Depending on queries, the value table join
could take up most of the query processing time. In this paper, we
propose to use in-memory virtual columns to avoid value table
joins. It has advantages in that it does not increase the footprint on
disk, and at the same time it avoids the value table joins by
utilizing in-memory virtual columns. The idea is to materialize the
values only in memory and utilize the compression and vector
processing that come with Oracle Database In-Memory
technology. Typically, the value table in RDF is small compared to
the triples table. Therefore, its footprint is manageable in memory,
especially with compression in columnar format. The same
mechanism can be applied to any application where there exists a
one-to-one mapping between an ID and its value, such as data
warehousing or data marts. The mechanism has been
implemented in Oracle 18c. Experimental results using the
LUBM1000 benchmark show up to two orders of magnitude query
performance improvement.
Keywords—
-
Virtual Columns.
I. INTRODUCTION
Resource Description Framework (RDF) [1] and its query
language, SPARQL [2], have drawn a lot of attention due to the
capabilities of representing and querying knowledge graphs,
which have many applications such as linked data, intelligent
search, and logical reasoning. While it is straightforward to
formulate a query using SPARQL, the query processing is
challenging, as it frequently requires a large number of self-
joins. Many researchers have investigated the problem of how
to efficiently process RDF queries [4,5,6,7]. Due to the large size
of URIs, often times self-joins are executed using small size IDs
to reduce the size of intermediate join results and improve query
performance. Therefore, the RDF triples table is often
normalized so that the triples table contains IDs and the value
table (also known as dictionary table or symbol table) is kept
separately. Many RDF stores [8,13,15,19] adopt this approach.
In Oracle, the underlying RDF tables are normalized into a
triples table where subject, predicate, object and named graph
IDs are stored and a value table where all relevant information
for those IDs are stored, like value type, literal type, string
values, etc. [3].
Typically, RDF queries are processed using only IDs to get
the self-join results, and then the IDs are joined with the value
table to get the final results. However, when the query needs to
process aggregate, filter, or order-by expressions, or when the
query returns a large number of results to users, joining the
triples table with its value table could incur a significant
performance hit because the join is performed for every variable
that is returned or used in an expression. As more variables are
returned, more joins are performed. For some queries, the value
table join could consume more than 90% of the query processing
time. This behavior was observed when we experimented with
customer data by measuring the difference in execution time
before the value table join and after the join. We can remove
these joins by complete de-normalization of values, but it incurs
large persistent storage requirements as well as anomalies
associated with data redundancy, such as integrity, consistency,
and so on. It also incurs a large intermediate join result size
because URI and literal values would need to be carried through
all the join operations.
To eliminate the join between the triples table and the value
table, we propose in-memory materialization of values by
utilizing in-memory virtual columns. The in-memory virtual
columns can speed up query performance without increasing
disk storage requirements. The values corresponding to IDs in
the triples table are materialized in memory. All these values
come from the value table, and there must be many duplicates
because the same IDs are used in a number of places in the
triples table. These values are materialized in columnar format,
and the same values are compressed away. Given a triples/quad
table (GID, SID, PID, OID), where GID is a graph ID, SID
subject ID, PID predicate ID, and OID object ID, we materialize
in memory the virtual columns using functions, GetVal(GID),
GetVal(SID), GetVal(PID), and GetVal(OID), where GetVal
(ID) does a lookup in the values table and returns the
corresponding RDF resource value.
Our early prototype [4] showed promising results, hence we
have implemented the approach in Oracle Database 18c. Our
experiments on LUBM [9] benchmarks show up to two orders
of magnitude query performance improvement. The RDF in-
1814
2019 IEEE 35th International Conference on Data Engineering (ICDE)
2375-026X/19/$31.00 ©2019 IEEE
DOI 10.1109/ICDE.2019.00197

memory virtual column approach can be applied to other
application areas such as data mart/data warehousing
star/snowflake schema queries where frequent joins with
dimension tables are common so that the join between the fact
table and dimension table can be eliminated. In fact, the
approach can be applied to any applications as long as there is a
one-to-one mapping between the ID and its value. For example,
the following snowflake schema query [11] from Wikipedia can
be reduced to one without joins by defining in-memory virtual
columns Year, Country, Brand, Product_Category in the
Fact_Sales table:
SELECT B.Brand, G.Country,SUM(F.Units_Sold)
FROM Fact_Sales F
INNER JOIN Dim_Date D ON F.Date_Id=D.Id
INNER JOIN Dim_Store S ON F.Store_Id=S.Id
INNER JOIN Dim_Geography G ON S.Geography_Id = G.Id
INNER JOIN Dim_Product P ON F.Product_Id = P.Id
INNER JOIN Dim_Brand B ON P.Brand_Id = B.Id
INNER JOIN Dim_Product_Category C ON P.Product_Category_Id
= C.Id
WHERE D.Year = 1997 AND C.Product_Category = 'tv'
GROUP BY B.Brand, G.Country;
This query is translated as follows using the virtual columns:
SELECT F.Brand, F.Country, SUM(F.Units_Sold)
FROM Fact_Sales F
WHERE F.Year = 1997 AND F.Product_Category = 'tv'
GROUP BY F.Brand, F.Country;
Our approach is not specific to Oracle. It could be applied to
any application where similar configurations are used. In
Section 2, we discuss related work, and in Section 3, we describe
our RDF in-memory processing. The in-memory virtual column
processing is described in Section 4. Section 5 describes
SPARQL to SQL translation. Section 6 discusses memory
footprint, and Section 7 describes our experimental study. We
conclude in Section 8.
II. RELATED WORK
RDF in-memory processing utilizes two different in-
memory structures, one using pointers (memory addresses)
embedded in the data structure so that the traversal is done
without any joins, such as in [14], and the other using IDs in a
relational table structure so that the traversal is done via joins,
such as in [8]. Some systems [13] do not use memory addresses
to enable reloading the memory structure from disk. Both
approaches have pros and cons. While the first method
mimicking the graph structure works well for processing path
queries, it is not very efficient for set-oriented operations, such
as aggregates. The second approach is very cumbersome in
handling path queries, as it requires joins, and sometimes the
intermediate join results can be large, slowing down the query
performance. HDT files [19] uses adjacency lists to alleviate
some of these problems, but it is read-only.
Much research [3,5,6,7,16] has been published on efficiently
processing self-joins utilizing indexes, column stores and some
other auxiliary structures such as materialized views. Typical
RDF data has a small number of distinct predicates compared to
subjects or objects, and many RDF queries have constants on
predicates. Hence, the data is sometimes partitioned on the
predicate so that only relevant data is accessed [5].
Whatever underlying data structure is adopted, it usually
maintains a separate dictionary for strings to represent URIs and
literals. Therefore, a join is required to get the values to present
to users or process aggregates, filters, or order-by queries. Our
paper focuses on removing this join to accelerate the query
processing. Systems using sequence numbers or plain numbers
as IDs would have small footprint in memory and faster load
time, but it would be difficult to integrate new data from other
sources because the dictionary table needs to be consulted to
generate or lookup an ID for a resource. Oracle uses hash IDs,
therefore unique IDs can be obtained by applying a function to
the resource value. This approach makes data integration more
efficient because unique IDs for resources can quickly be
generated without consulting the dictionary table. However, the
8-byte ID entails a bigger footprint and more processing during
load. It will also burden join processing as the bigger IDs
produce bigger intermediate results. The elimination of joins to
get resource values will help overall query processing.
III. RDF IN-MEMORY PROCESSING
In-memory processing is increasingly used as memory cost
is dropping and performance improvement across different
workloads is desired without much tuning. The RDF in-memory
processing utilizes the Oracle Database In-Memory Column
Store (IMC) [10, 18]. Frequently accessed columns from the
triples table and the value table are loaded into memory. RDF
queries often perform hash joins and the hash joins require a full
scan of triples and value tables. The in-memory column store
accelerates these table scans. In addition, the in-memory column
store employs compression and uses 4-byte dictionary code
instead of values. It also does smart scans using in-memory
storage index where min and max values of each column in the
in-memory segment unit called IMCU (in-memory compression
unit) are stored. In addition, it uses Bloom filter [12] joins and
SIMD (Single Instruction Multiple Data) vector processing for
queries with filter. The SIMD filters a number of rows in a single
instruction.
If insufficient memory is available to load all the requested
data into memory, Oracle IMC will partially load the data. While
it would be ideal if all data fits in memory, partial in-memory
population also delivers some performance improvement [4].
In Oracle Database 18c, enabling and disabling the RDF in-
memory population are controlled by the following PL/SQL
APIs:
EXEC SEM_APIS.ENABLE_INMEMORY(TRUE);
EXEC SEM_APIS.DISABLE_INMEMORY;
The argument ‘TRUE’ means that we wait until the data is
populated in memory. When on-disk data is changed due to
insert, delete, or update, background processes automatically
modify the in-memory data by creating a new IMCU.
1815

Table 1: Quads/Triples Table Table 2: Value Table (VALUE$)
IV. ELIMINATION OF VALUE JOIN USING RDF IN-
MEMORY VIRTUAL COLUMN
RDF query execution spends significant time joining with
the value table to get column values. Materializing values can
avoid these joins. However, materializing values violates the
normalization principle, and string value materialization, in
particular, becomes prohibitively expensive due to space
requirements. Therefore, instead of materializing on disk we do
it in memory. By populating the column values in memory as
virtual columns [17], we can retrieve values without joining
with the value table. We add virtual columns to the triples table,
and the values for these virtual columns are materialized in
memory. We need values for subject ID (SID), predicate ID
(PID), object ID (OID) and graph ID (GID). For example, the
value for the subject, SVAL, is obtained by the function
GetVal(SID). These values are organized in columnar format
and compressed. Table 3 shows the structure at a conceptual
level for the triples table (Table 1) and the value table (Table
2). A 4-byte dictionary code is actually stored in memory and a
separate symbol table is maintained in memory to map the
dictionary code to its value. The virtual columns are stored in
the in-memory segment called IMEU (in-memory expression
unit).
There are many duplicates in SVAL, PVAL, OVAL, and
GVAL, and these duplicates are compressed away. All queries
will work on the triples table only. Note that this kind of
materialization is possible only if there is a one-to-one mapping
between the ID and its value.
Here is one of the virtual column functions. It extracts values
from the value table given an ID:
FUNCTION GetVal (i_id NUMBER)
RETURN VARCHAR2 DETERMINISTIC IS
r_val VARCHAR2(4000);
BEGIN
EXECUTE IMMEDIATE
'SELECT /*+ index(m C_PK_VID) */ VAL
FROM VALUE$ m
WHERE ID = :1' INTO r_val USING i_id;
RETURN r_val;
END;
Here is how the virtual column SVAL is defined using the
function GetVal():
EXECUTE IMMEDIATE
'ALTER TABLE VALUE$ ADD
SVAL GENERATED ALWAYS AS
(GetVal(SID))
VIRTUAL INMEMORY';
Once the virtual columns are defined, the virtual column
name and its virtual column function name can be used
interchangeably in the query to retrieve the value from memory.
In other words, if a query contains GetVal(SID), the subject
value is fetched directly from memory instead of executing the
virtual column function. In this case, either SVAL or
GetVal(SID) is used to get the value.
In general, any application that utilizes in-memory virtual
columns can identify columns that are essential for fast query
performance and materialize only those columns in memory.
The columns to be materialized in memory can be determined
by the query workload.
ID VAL
101 <ns:g1>
201 <ns:s1>
302 <ns:s2>
402 <ns:p1>
403 <ns:p2>
611 "100"^^xsd:decimal
612 “200”^^xsd:decimal
723 "2000-01-02T01:00:01"^^xsd:dateTime
GID SID PID OID
101 201 402 611
101 302 403 723
101 302 402 612
GID SID PID OID GVAL SVAL PVAL OVAL
101 201 402 611 <ns:g1> <ns:s1> <ns:p1> "100"^^xsd:decimal
101 302 403 723 <ns:g1> <ns:s2> <ns:p2> "2000-01-
02T01:00:01"^^xsd:dateTime
101 302 402 612 <ns:g1> <ns:s2> <ns:p1> "200"^^xsd:decimal
Table 3: Quads/Triples Table in Memory
1816

V. SPARQL TO SQL TRANSLATION
As the underlying triples table and the value table are stored
in the relational database, all SPARQL queries are translated
into equivalent SQL queries against the triples table and value
table. Typically, an RDF query is processed first via self-joins
using IDs followed by joins with the value table. The in-
memory virtual column employs late materialization, hence the
4-byte dictionary code is used for interim processing until the
full value is needed. All value table joins are replaced with
fetching virtual columns from the triples table. The SPARQL-
to-SQL query translation routines maintain a few HashMaps to
map the SPARQL query variables to virtual columns and to
triple patterns in the SPARQL query. Because the same variable
can appear in more than one triple pattern, we need to keep in
the HashMap the variable along with its position in the triple
pattern so that the correct value is fetched. For example, in the
following triple pattern:
{ ?s <p1> ?o. ?t <p2> ?s }
The value of the variable ?s in the first triple is fetched from
SVAL while in the second triple it is fetched from OVAL.
VI. DEALING WITH MEMORY REQUIREMENT
The size of typical user applications’ RDF data that we have
observed is about a few hundred million triples. This size of data
should fit in memory easily. The 242 million triples table
(242,297,052 triples) for LUBM data we use in our experiment
requires 8.99 GB (8,991,866,880 bytes) of memory including
the in-memory virtual columns. Its size on disk is 5.55 GB
(5,557,166,080 bytes). The actual memory requirement depends
on the data characteristics such as the extent of value repetition
in the triples. Because the in-memory columnar representation
gives better compression than the on-disk row format, the data
size in memory can be smaller than the on-disk size in some
cases. With increasing memory size available these days, it will
not be a problem fitting billions of triples in memory, especially
on server machines.
In-memory data is fetched from the memory while out-of-
memory data is fetched from the disk. For out-of-memory data,
the virtual columns are automatically assembled using the data
on disk. If a large amount of data resides on disk, it may
deteriorate the query performance. However, in Oracle
Database, the RDF data is partitioned into separate datasets
based on user-defined groupings [3], and the in-memory
population is controlled at the partition/subpartition level so that
only relevant datasets are populated in memory. If a query
suffers from significant performance degradation due to on-disk
virtual column fetches, the query can resort to in-memory real
columns only using the option DISABLE_IM_VIRTUAL_COL
so that the query is processed without using the virtual columns.
VII. EXPERIMENTS
A. Hardware Setup
The RDF in-memory virtual column performance is
conducted on a virtual machine with 256 GB memory and 2TB
of disk space. It has 32 CPUs. The machines use Oracle Linux
Server 6 operating system. The database used is Oracle
Database 18c Enterprise Edition - 64bit Production. The timing
values are an average of three runs for each query and the
timing resolution is 10ms as Linux default.
B. RDF In-Memory Virtual Columns Performance
This experiment is conducted to check the performance of
the RDF in-memory virtual columns (IMVC) against RDF non-
in-memory (non-IM) configuration. LUBM1K benchmark is
used. The LUBM1K data set contains the total 242,297,052
rows including entailment. The LUBM benchmark queries are
used for evaluation. Because the server is shared, the maximum
SGA we can have is 140 GB and INMEMORY_SIZE is set to
60 GB. The numbers in the parentheses in Figures 1-4 represent
the number of rows in the result set. Figures 1 and 2 show the
execution time for sequential run in logarithmic scale for both
configurations. In the warm run, some non-IM queries with a
small result set run faster as the IMVC still does full scan in
memory. However, some non-IM queries require tuning.
The timing values for Q3 and Q10 are 0.00 for both
configurations and for Q1 IMVC shows 0.00 because the timing
values were measured only up to hundredths of a second. The
Figures 3 and 4 show the execution time in logarithmic scale
for parallel run with degree 32. The timing values for the
queries Q1, Q3, and Q10, in IMVC show 0.00 for the warm run.
The performance improvement of the in-memory virtual
columns against non-IM shows 43x gain (cold) and 50x gain
(warm) for Q8 in sequential run, and 20x gain (Q8) and 144x
gain (Q12) for parallel run.
The parallel run with degree 32 requires a lot of memory as
more inter-process communication is needed. Because in-
memory virtual columns require more memory than non-IM
configuration, for some queries we ran out of memory and
therefore some data was written onto disk, causing performance
degradation.
Figure 1: Sequential execution time (in sec, log scale) for
LUBM benchmark queries (cold run)
1817

Figure 2: Sequential execution time (in sec, log scale) for
LUBM benchmark queries (warm run)
Figure 3: Parallel execution time (in sec, log scale) for
LUBM benchmark queries (cold run)
Figure 4: Parallel execution time (in sec, log scale) for
LUBM benchmark queries (warm run)
As more values are fetched and the number of variables in-
creases, a bigger performance gain will be achieved. However,
IMVC does not control self-joins of the triples table. Therefore,
if a non-IM query produces a better execution plan of self-joins
using indexes, it could outperform the IMVC performance as
can be seen in Q2, Q9, and Q13 above. In general, in-memory
based query processing provides consistently good
performance without tuning and it does not show erratic
behavior on different workloads.
Figure 5: Execution time (in sec, log scale) for fetching all
values
1818

We have fetched all values from the triples table to check
its impact as more values are fetched. Figure 5 shows the
execution time for fetching all values. It shows 41x
improvement for sequential run and 436x gain (986.05 vs. 2.26
sec) for parallel run against non-IM.
VIII.CONCLUSION AND FUTURE WORK
Efficient materialization of RDF data in memory
significantly improves query performance. In-memory
materialization using virtual columns does not increase
persistent storage requirements, and its columnar format is also
good for compression. We have shown that this approach can
make a significant performance enhancement. Though we have
applied the scheme to RDF data, it has potential to be applied
to any area where a one-to-one mapping is maintained between
ID and its value. In sum, by materializing one-to-one join
operations in memory, we have achieved up to two orders of
magnitude performance improvement.
While this paper provides a viable solution to value table
joins in RDF query processing and reduces the possibility of
generating poor execution plans by reducing the overall number
of joins in the query, it does not propose a solution to speed up
or reduce the number of self-joins on the triples table. It could
be interesting to develop a new scheme to handle self-joins
along the same lines by eliminating actual joins.
REFERENCES
[1] RDF 1.1 Concepts and Abstract Syntax.
https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/, Feb. 2014.
[2] SPARQL 1.1 Query Language. https://www.w3.org/TR/sparql11-query/,
Mar. 2013.
[3] E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An Efficient SQL-based
RDF Querying Scheme. In Proc. of VLDB Conference, 1216–1227, 2005.
[4] E. I. Chong. Balancing Act to improve RDF Query Performance in Oracle
Database. Invited Talk, LDBC 8th TUC meeting, Jun. 2016.
[5] D. J. Abadi, A. Marcus, S. Madden, and K. J. Hollenbach. Scalable
Semantic Web Data Management using Vertical Partitioning. In Proc. of
VLDB Conference, 411-422, 2007.
[6] C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for
Semantic Web Data Management. In Proc. of VLDB Conference, 1008-1019,
2008.
[7] T. Neumann and G. Weikum. RDF3X: a RISCstyle Engine for RDF. In
Proc. of VLDB Conference, 647-659, 2008.
[8] Orri Erling, Virtuoso, a Hybrid RDBMS/Graph Column Store, Bulletin of
the IEEE Computer Society Technical Committee on Data Engineering,
35(1), 3-8, 2012.
[9] Lehigh University Benchmark. http://swat.cse.lehigh.edu/projects/lubm/,
Jul. 2005.
[10] Oracle Database In-Memory Guide.
http://docs.oracle.com/database/122/INMEM/title.htm, Jan. 2017.
[11] Snowflake schema. https://en.wikipedia.org/wiki/Snowflake_schema
[12] B.H. Bloom. Space/Time Trade-Offs in Hash Coding with Allowable
Errors. CACM, 13(7), 422-426. 1970.
[13] M. Janik and K. Kochut, BRAHMS: A WorkBench RDF Store And High
Performance Memory System for Semantic Association Discovery, In Proc.
of ISWC Confer-ence, 2005.
[14] R. Binna, W. Gassler, E. Zangerle, D. Pacher, G. Specht, SpiderStore:
Exploiting Main Memory for Efficient RDF Graph Representation and Fast
Querying, Workshop on Semantic Data Management, 2010.
[15] B. Motik, Y. Nenov, R. Piro, I. Horrocks and D. Olteanu, Parallel
Materialisation of Datalog Programs in Centralised, Main-Memory RDF
Systems, In Proceedings of the Twenty-Eighth AAAI Conference on Artificial
Intelligence, 129-137, 2014.
[16] T. Neumann and G. Weikum. Scalable Join Processing on Very Large
RDF Graphs. In Proceedings of the 35th SIGMOD International Conference
on Management of Data, 627-640, New York, NY, USA, 2009.
[17] A. Mishra, et al., Accelerating Analytics with Dynamic In-Memory
Expressions, In Proc. of VLDB Conference, 1437–1448, 2016.
[18] Lahiri, T. et al. Oracle Database In-Memory: A Dual Format In-Memory
Database, In Proc. of ICDE Conference, 1253-1258, 2015.
[19] Fernández J.D., Martínez-Prieto M.A., Gutierrez C. Compact
Representation of Large RDF Data Sets for Publishing and Exchange. In Proc.
of ISWC Conference, 193-208, 2010.
1819

Icde2019 improving rdf query performance using in-memory virtual columns in oracle database

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Icde2019 improving rdf query performance using in-memory virtual columns in oracle database

Similar to Icde2019 improving rdf query performance using in-memory virtual columns in oracle database (20)

More from Jean Ihm

More from Jean Ihm (10)

Recently uploaded

Recently uploaded (20)

Icde2019 improving rdf query performance using in-memory virtual columns in oracle database