Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Icde2019 improving rdf query performance using in-memory virtual columns in oracle database


Published on

Academic paper from ICDE 2019: Improving RDF Query Performance using In-Memory Virtual Columns in Oracle Database

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Icde2019 improving rdf query performance using in-memory virtual columns in oracle database

  1. 1. Improving RDF Query Performance using In-Memory Virtual Columns in Oracle Database Eugene Inseok Chong Matthew Perry Souripriya Das New England Development Center Oracle Nashua, NH, USA Abstract— Many RDF Knowledge Graph Stores use IDs to represent triples to save storage and for ease of maintenance. Oracle is no exception. While this design is good for a small footprint on disk, it incurs overhead in query processing, as it requires joins with the value table to return results or process aggregates/filter/order-by queries. It becomes especially problematic as the result size increases or the number of projected variables increases. Depending on queries, the value table join could take up most of the query processing time. In this paper, we propose to use in-memory virtual columns to avoid value table joins. It has advantages in that it does not increase the footprint on disk, and at the same time it avoids the value table joins by utilizing in-memory virtual columns. The idea is to materialize the values only in memory and utilize the compression and vector processing that come with Oracle Database In-Memory technology. Typically, the value table in RDF is small compared to the triples table. Therefore, its footprint is manageable in memory, especially with compression in columnar format. The same mechanism can be applied to any application where there exists a one-to-one mapping between an ID and its value, such as data warehousing or data marts. The mechanism has been implemented in Oracle 18c. Experimental results using the LUBM1000 benchmark show up to two orders of magnitude query performance improvement. Keywords— - Virtual Columns. I. INTRODUCTION Resource Description Framework (RDF) [1] and its query language, SPARQL [2], have drawn a lot of attention due to the capabilities of representing and querying knowledge graphs, which have many applications such as linked data, intelligent search, and logical reasoning. While it is straightforward to formulate a query using SPARQL, the query processing is challenging, as it frequently requires a large number of self- joins. Many researchers have investigated the problem of how to efficiently process RDF queries [4,5,6,7]. Due to the large size of URIs, often times self-joins are executed using small size IDs to reduce the size of intermediate join results and improve query performance. Therefore, the RDF triples table is often normalized so that the triples table contains IDs and the value table (also known as dictionary table or symbol table) is kept separately. Many RDF stores [8,13,15,19] adopt this approach. In Oracle, the underlying RDF tables are normalized into a triples table where subject, predicate, object and named graph IDs are stored and a value table where all relevant information for those IDs are stored, like value type, literal type, string values, etc. [3]. Typically, RDF queries are processed using only IDs to get the self-join results, and then the IDs are joined with the value table to get the final results. However, when the query needs to process aggregate, filter, or order-by expressions, or when the query returns a large number of results to users, joining the triples table with its value table could incur a significant performance hit because the join is performed for every variable that is returned or used in an expression. As more variables are returned, more joins are performed. For some queries, the value table join could consume more than 90% of the query processing time. This behavior was observed when we experimented with customer data by measuring the difference in execution time before the value table join and after the join. We can remove these joins by complete de-normalization of values, but it incurs large persistent storage requirements as well as anomalies associated with data redundancy, such as integrity, consistency, and so on. It also incurs a large intermediate join result size because URI and literal values would need to be carried through all the join operations. To eliminate the join between the triples table and the value table, we propose in-memory materialization of values by utilizing in-memory virtual columns. The in-memory virtual columns can speed up query performance without increasing disk storage requirements. The values corresponding to IDs in the triples table are materialized in memory. All these values come from the value table, and there must be many duplicates because the same IDs are used in a number of places in the triples table. These values are materialized in columnar format, and the same values are compressed away. Given a triples/quad table (GID, SID, PID, OID), where GID is a graph ID, SID subject ID, PID predicate ID, and OID object ID, we materialize in memory the virtual columns using functions, GetVal(GID), GetVal(SID), GetVal(PID), and GetVal(OID), where GetVal (ID) does a lookup in the values table and returns the corresponding RDF resource value. Our early prototype [4] showed promising results, hence we have implemented the approach in Oracle Database 18c. Our experiments on LUBM [9] benchmarks show up to two orders of magnitude query performance improvement. The RDF in- 1814 2019 IEEE 35th International Conference on Data Engineering (ICDE) 2375-026X/19/$31.00 ©2019 IEEE DOI 10.1109/ICDE.2019.00197
  2. 2. memory virtual column approach can be applied to other application areas such as data mart/data warehousing star/snowflake schema queries where frequent joins with dimension tables are common so that the join between the fact table and dimension table can be eliminated. In fact, the approach can be applied to any applications as long as there is a one-to-one mapping between the ID and its value. For example, the following snowflake schema query [11] from Wikipedia can be reduced to one without joins by defining in-memory virtual columns Year, Country, Brand, Product_Category in the Fact_Sales table: SELECT B.Brand, G.Country,SUM(F.Units_Sold) FROM Fact_Sales F INNER JOIN Dim_Date D ON F.Date_Id=D.Id INNER JOIN Dim_Store S ON F.Store_Id=S.Id INNER JOIN Dim_Geography G ON S.Geography_Id = G.Id INNER JOIN Dim_Product P ON F.Product_Id = P.Id INNER JOIN Dim_Brand B ON P.Brand_Id = B.Id INNER JOIN Dim_Product_Category C ON P.Product_Category_Id = C.Id WHERE D.Year = 1997 AND C.Product_Category = 'tv' GROUP BY B.Brand, G.Country; This query is translated as follows using the virtual columns: SELECT F.Brand, F.Country, SUM(F.Units_Sold) FROM Fact_Sales F WHERE F.Year = 1997 AND F.Product_Category = 'tv' GROUP BY F.Brand, F.Country; Our approach is not specific to Oracle. It could be applied to any application where similar configurations are used. In Section 2, we discuss related work, and in Section 3, we describe our RDF in-memory processing. The in-memory virtual column processing is described in Section 4. Section 5 describes SPARQL to SQL translation. Section 6 discusses memory footprint, and Section 7 describes our experimental study. We conclude in Section 8. II. RELATED WORK RDF in-memory processing utilizes two different in- memory structures, one using pointers (memory addresses) embedded in the data structure so that the traversal is done without any joins, such as in [14], and the other using IDs in a relational table structure so that the traversal is done via joins, such as in [8]. Some systems [13] do not use memory addresses to enable reloading the memory structure from disk. Both approaches have pros and cons. While the first method mimicking the graph structure works well for processing path queries, it is not very efficient for set-oriented operations, such as aggregates. The second approach is very cumbersome in handling path queries, as it requires joins, and sometimes the intermediate join results can be large, slowing down the query performance. HDT files [19] uses adjacency lists to alleviate some of these problems, but it is read-only. Much research [3,5,6,7,16] has been published on efficiently processing self-joins utilizing indexes, column stores and some other auxiliary structures such as materialized views. Typical RDF data has a small number of distinct predicates compared to subjects or objects, and many RDF queries have constants on predicates. Hence, the data is sometimes partitioned on the predicate so that only relevant data is accessed [5]. Whatever underlying data structure is adopted, it usually maintains a separate dictionary for strings to represent URIs and literals. Therefore, a join is required to get the values to present to users or process aggregates, filters, or order-by queries. Our paper focuses on removing this join to accelerate the query processing. Systems using sequence numbers or plain numbers as IDs would have small footprint in memory and faster load time, but it would be difficult to integrate new data from other sources because the dictionary table needs to be consulted to generate or lookup an ID for a resource. Oracle uses hash IDs, therefore unique IDs can be obtained by applying a function to the resource value. This approach makes data integration more efficient because unique IDs for resources can quickly be generated without consulting the dictionary table. However, the 8-byte ID entails a bigger footprint and more processing during load. It will also burden join processing as the bigger IDs produce bigger intermediate results. The elimination of joins to get resource values will help overall query processing. III. RDF IN-MEMORY PROCESSING In-memory processing is increasingly used as memory cost is dropping and performance improvement across different workloads is desired without much tuning. The RDF in-memory processing utilizes the Oracle Database In-Memory Column Store (IMC) [10, 18]. Frequently accessed columns from the triples table and the value table are loaded into memory. RDF queries often perform hash joins and the hash joins require a full scan of triples and value tables. The in-memory column store accelerates these table scans. In addition, the in-memory column store employs compression and uses 4-byte dictionary code instead of values. It also does smart scans using in-memory storage index where min and max values of each column in the in-memory segment unit called IMCU (in-memory compression unit) are stored. In addition, it uses Bloom filter [12] joins and SIMD (Single Instruction Multiple Data) vector processing for queries with filter. The SIMD filters a number of rows in a single instruction. If insufficient memory is available to load all the requested data into memory, Oracle IMC will partially load the data. While it would be ideal if all data fits in memory, partial in-memory population also delivers some performance improvement [4]. In Oracle Database 18c, enabling and disabling the RDF in- memory population are controlled by the following PL/SQL APIs: EXEC SEM_APIS.ENABLE_INMEMORY(TRUE); EXEC SEM_APIS.DISABLE_INMEMORY; The argument ‘TRUE’ means that we wait until the data is populated in memory. When on-disk data is changed due to insert, delete, or update, background processes automatically modify the in-memory data by creating a new IMCU. 1815
  3. 3. Table 1: Quads/Triples Table Table 2: Value Table (VALUE$) IV. ELIMINATION OF VALUE JOIN USING RDF IN- MEMORY VIRTUAL COLUMN RDF query execution spends significant time joining with the value table to get column values. Materializing values can avoid these joins. However, materializing values violates the normalization principle, and string value materialization, in particular, becomes prohibitively expensive due to space requirements. Therefore, instead of materializing on disk we do it in memory. By populating the column values in memory as virtual columns [17], we can retrieve values without joining with the value table. We add virtual columns to the triples table, and the values for these virtual columns are materialized in memory. We need values for subject ID (SID), predicate ID (PID), object ID (OID) and graph ID (GID). For example, the value for the subject, SVAL, is obtained by the function GetVal(SID). These values are organized in columnar format and compressed. Table 3 shows the structure at a conceptual level for the triples table (Table 1) and the value table (Table 2). A 4-byte dictionary code is actually stored in memory and a separate symbol table is maintained in memory to map the dictionary code to its value. The virtual columns are stored in the in-memory segment called IMEU (in-memory expression unit). There are many duplicates in SVAL, PVAL, OVAL, and GVAL, and these duplicates are compressed away. All queries will work on the triples table only. Note that this kind of materialization is possible only if there is a one-to-one mapping between the ID and its value. Here is one of the virtual column functions. It extracts values from the value table given an ID: FUNCTION GetVal (i_id NUMBER) RETURN VARCHAR2 DETERMINISTIC IS r_val VARCHAR2(4000); BEGIN EXECUTE IMMEDIATE 'SELECT /*+ index(m C_PK_VID) */ VAL FROM VALUE$ m WHERE ID = :1' INTO r_val USING i_id; RETURN r_val; END; Here is how the virtual column SVAL is defined using the function GetVal(): EXECUTE IMMEDIATE 'ALTER TABLE VALUE$ ADD SVAL GENERATED ALWAYS AS (GetVal(SID)) VIRTUAL INMEMORY'; Once the virtual columns are defined, the virtual column name and its virtual column function name can be used interchangeably in the query to retrieve the value from memory. In other words, if a query contains GetVal(SID), the subject value is fetched directly from memory instead of executing the virtual column function. In this case, either SVAL or GetVal(SID) is used to get the value. In general, any application that utilizes in-memory virtual columns can identify columns that are essential for fast query performance and materialize only those columns in memory. The columns to be materialized in memory can be determined by the query workload. ID VAL 101 <ns:g1> 201 <ns:s1> 302 <ns:s2> 402 <ns:p1> 403 <ns:p2> 611 "100"^^xsd:decimal 612 “200”^^xsd:decimal 723 "2000-01-02T01:00:01"^^xsd:dateTime GID SID PID OID 101 201 402 611 101 302 403 723 101 302 402 612 GID SID PID OID GVAL SVAL PVAL OVAL 101 201 402 611 <ns:g1> <ns:s1> <ns:p1> "100"^^xsd:decimal 101 302 403 723 <ns:g1> <ns:s2> <ns:p2> "2000-01- 02T01:00:01"^^xsd:dateTime 101 302 402 612 <ns:g1> <ns:s2> <ns:p1> "200"^^xsd:decimal Table 3: Quads/Triples Table in Memory 1816
  4. 4. V. SPARQL TO SQL TRANSLATION As the underlying triples table and the value table are stored in the relational database, all SPARQL queries are translated into equivalent SQL queries against the triples table and value table. Typically, an RDF query is processed first via self-joins using IDs followed by joins with the value table. The in- memory virtual column employs late materialization, hence the 4-byte dictionary code is used for interim processing until the full value is needed. All value table joins are replaced with fetching virtual columns from the triples table. The SPARQL- to-SQL query translation routines maintain a few HashMaps to map the SPARQL query variables to virtual columns and to triple patterns in the SPARQL query. Because the same variable can appear in more than one triple pattern, we need to keep in the HashMap the variable along with its position in the triple pattern so that the correct value is fetched. For example, in the following triple pattern: { ?s <p1> ?o. ?t <p2> ?s } The value of the variable ?s in the first triple is fetched from SVAL while in the second triple it is fetched from OVAL. VI. DEALING WITH MEMORY REQUIREMENT The size of typical user applications’ RDF data that we have observed is about a few hundred million triples. This size of data should fit in memory easily. The 242 million triples table (242,297,052 triples) for LUBM data we use in our experiment requires 8.99 GB (8,991,866,880 bytes) of memory including the in-memory virtual columns. Its size on disk is 5.55 GB (5,557,166,080 bytes). The actual memory requirement depends on the data characteristics such as the extent of value repetition in the triples. Because the in-memory columnar representation gives better compression than the on-disk row format, the data size in memory can be smaller than the on-disk size in some cases. With increasing memory size available these days, it will not be a problem fitting billions of triples in memory, especially on server machines. In-memory data is fetched from the memory while out-of- memory data is fetched from the disk. For out-of-memory data, the virtual columns are automatically assembled using the data on disk. If a large amount of data resides on disk, it may deteriorate the query performance. However, in Oracle Database, the RDF data is partitioned into separate datasets based on user-defined groupings [3], and the in-memory population is controlled at the partition/subpartition level so that only relevant datasets are populated in memory. If a query suffers from significant performance degradation due to on-disk virtual column fetches, the query can resort to in-memory real columns only using the option DISABLE_IM_VIRTUAL_COL so that the query is processed without using the virtual columns. VII. EXPERIMENTS A. Hardware Setup The RDF in-memory virtual column performance is conducted on a virtual machine with 256 GB memory and 2TB of disk space. It has 32 CPUs. The machines use Oracle Linux Server 6 operating system. The database used is Oracle Database 18c Enterprise Edition - 64bit Production. The timing values are an average of three runs for each query and the timing resolution is 10ms as Linux default. B. RDF In-Memory Virtual Columns Performance This experiment is conducted to check the performance of the RDF in-memory virtual columns (IMVC) against RDF non- in-memory (non-IM) configuration. LUBM1K benchmark is used. The LUBM1K data set contains the total 242,297,052 rows including entailment. The LUBM benchmark queries are used for evaluation. Because the server is shared, the maximum SGA we can have is 140 GB and INMEMORY_SIZE is set to 60 GB. The numbers in the parentheses in Figures 1-4 represent the number of rows in the result set. Figures 1 and 2 show the execution time for sequential run in logarithmic scale for both configurations. In the warm run, some non-IM queries with a small result set run faster as the IMVC still does full scan in memory. However, some non-IM queries require tuning. The timing values for Q3 and Q10 are 0.00 for both configurations and for Q1 IMVC shows 0.00 because the timing values were measured only up to hundredths of a second. The Figures 3 and 4 show the execution time in logarithmic scale for parallel run with degree 32. The timing values for the queries Q1, Q3, and Q10, in IMVC show 0.00 for the warm run. The performance improvement of the in-memory virtual columns against non-IM shows 43x gain (cold) and 50x gain (warm) for Q8 in sequential run, and 20x gain (Q8) and 144x gain (Q12) for parallel run. The parallel run with degree 32 requires a lot of memory as more inter-process communication is needed. Because in- memory virtual columns require more memory than non-IM configuration, for some queries we ran out of memory and therefore some data was written onto disk, causing performance degradation. Figure 1: Sequential execution time (in sec, log scale) for LUBM benchmark queries (cold run) 1817
  5. 5. Figure 2: Sequential execution time (in sec, log scale) for LUBM benchmark queries (warm run) Figure 3: Parallel execution time (in sec, log scale) for LUBM benchmark queries (cold run) Figure 4: Parallel execution time (in sec, log scale) for LUBM benchmark queries (warm run) As more values are fetched and the number of variables in- creases, a bigger performance gain will be achieved. However, IMVC does not control self-joins of the triples table. Therefore, if a non-IM query produces a better execution plan of self-joins using indexes, it could outperform the IMVC performance as can be seen in Q2, Q9, and Q13 above. In general, in-memory based query processing provides consistently good performance without tuning and it does not show erratic behavior on different workloads. Figure 5: Execution time (in sec, log scale) for fetching all values 1818
  6. 6. We have fetched all values from the triples table to check its impact as more values are fetched. Figure 5 shows the execution time for fetching all values. It shows 41x improvement for sequential run and 436x gain (986.05 vs. 2.26 sec) for parallel run against non-IM. VIII.CONCLUSION AND FUTURE WORK Efficient materialization of RDF data in memory significantly improves query performance. In-memory materialization using virtual columns does not increase persistent storage requirements, and its columnar format is also good for compression. We have shown that this approach can make a significant performance enhancement. Though we have applied the scheme to RDF data, it has potential to be applied to any area where a one-to-one mapping is maintained between ID and its value. In sum, by materializing one-to-one join operations in memory, we have achieved up to two orders of magnitude performance improvement. While this paper provides a viable solution to value table joins in RDF query processing and reduces the possibility of generating poor execution plans by reducing the overall number of joins in the query, it does not propose a solution to speed up or reduce the number of self-joins on the triples table. It could be interesting to develop a new scheme to handle self-joins along the same lines by eliminating actual joins. REFERENCES [1] RDF 1.1 Concepts and Abstract Syntax., Feb. 2014. [2] SPARQL 1.1 Query Language., Mar. 2013. [3] E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An Efficient SQL-based RDF Querying Scheme. In Proc. of VLDB Conference, 1216–1227, 2005. [4] E. I. Chong. Balancing Act to improve RDF Query Performance in Oracle Database. Invited Talk, LDBC 8th TUC meeting, Jun. 2016. [5] D. J. Abadi, A. Marcus, S. Madden, and K. J. Hollenbach. Scalable Semantic Web Data Management using Vertical Partitioning. In Proc. of VLDB Conference, 411-422, 2007. [6] C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for Semantic Web Data Management. In Proc. of VLDB Conference, 1008-1019, 2008. [7] T. Neumann and G. Weikum. RDF3X: a RISCstyle Engine for RDF. In Proc. of VLDB Conference, 647-659, 2008. [8] Orri Erling, Virtuoso, a Hybrid RDBMS/Graph Column Store, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 35(1), 3-8, 2012. [9] Lehigh University Benchmark., Jul. 2005. [10] Oracle Database In-Memory Guide., Jan. 2017. [11] Snowflake schema. [12] B.H. Bloom. Space/Time Trade-Offs in Hash Coding with Allowable Errors. CACM, 13(7), 422-426. 1970. [13] M. Janik and K. Kochut, BRAHMS: A WorkBench RDF Store And High Performance Memory System for Semantic Association Discovery, In Proc. of ISWC Confer-ence, 2005. [14] R. Binna, W. Gassler, E. Zangerle, D. Pacher, G. Specht, SpiderStore: Exploiting Main Memory for Efficient RDF Graph Representation and Fast Querying, Workshop on Semantic Data Management, 2010. [15] B. Motik, Y. Nenov, R. Piro, I. Horrocks and D. Olteanu, Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF Systems, In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 129-137, 2014. [16] T. Neumann and G. Weikum. Scalable Join Processing on Very Large RDF Graphs. In Proceedings of the 35th SIGMOD International Conference on Management of Data, 627-640, New York, NY, USA, 2009. [17] A. Mishra, et al., Accelerating Analytics with Dynamic In-Memory Expressions, In Proc. of VLDB Conference, 1437–1448, 2016. [18] Lahiri, T. et al. Oracle Database In-Memory: A Dual Format In-Memory Database, In Proc. of ICDE Conference, 1253-1258, 2015. [19] Fernández J.D., Martínez-Prieto M.A., Gutierrez C. Compact Representation of Large RDF Data Sets for Publishing and Exchange. In Proc. of ISWC Conference, 193-208, 2010. 1819