After completing this module, you will be able to:
List and describe the major components of the Teradata architecture.
Describe how the components interact to manage incoming and outgoing data.
List 5 types of Teradata database objects.
After completing this module, you will be able to:
List and describe the major components of the Teradata architecture.
Describe how the components interact to manage incoming and outgoing data.
List 5 types of Teradata database objects.
Data Warehousing and Business Intelligence is one of the hottest skills today, and is the cornerstone for reporting, data science, and analytics. This course teaches the fundamentals with examples plus a project to fully illustrate the concepts.
htttps://www.smartprogram.in/sas
Learn SAS programming, SAS slides, SAS tutorials, SAS certification, SAS Sample Code, SAS Macro examples,SAS video tutorials, SAS ebooks, SAS tutorials, SAS tips and Techniques, Base SAS and Advanced SAS certification, SAS interview Questions and answers, Proc SQL, SAS syntax, Advanced SAS
Aan introduction to SAS, one of the more frequently used statistical packages in business. With hands-on exercises, explore SAS's many features and learn how to import and manage datasets and and run basic statistical analyses. This is an introductory workshop appropriate for those with little or no experience with SAS.
Complete workshop materials include demo SAS programs available at http://projects.iq.harvard.edu/rtc/sas-intro
With help of this small Proof of Concept, I have tried to demonstrate the usage of Neo4J (Graph DB) as a metastore for a Data Lake or a DW. Graph DBs can store highly relational data and help us in doing data discovery and impact analysis, which bit more complex to bee done in an RDBMS.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
Data Warehousing and Business Intelligence is one of the hottest skills today, and is the cornerstone for reporting, data science, and analytics. This course teaches the fundamentals with examples plus a project to fully illustrate the concepts.
htttps://www.smartprogram.in/sas
Learn SAS programming, SAS slides, SAS tutorials, SAS certification, SAS Sample Code, SAS Macro examples,SAS video tutorials, SAS ebooks, SAS tutorials, SAS tips and Techniques, Base SAS and Advanced SAS certification, SAS interview Questions and answers, Proc SQL, SAS syntax, Advanced SAS
Aan introduction to SAS, one of the more frequently used statistical packages in business. With hands-on exercises, explore SAS's many features and learn how to import and manage datasets and and run basic statistical analyses. This is an introductory workshop appropriate for those with little or no experience with SAS.
Complete workshop materials include demo SAS programs available at http://projects.iq.harvard.edu/rtc/sas-intro
With help of this small Proof of Concept, I have tried to demonstrate the usage of Neo4J (Graph DB) as a metastore for a Data Lake or a DW. Graph DBs can store highly relational data and help us in doing data discovery and impact analysis, which bit more complex to bee done in an RDBMS.
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
An unstructured data poses challenges to storing da ta. Experts estimate that 80 to 90 percent of the d ata in any organization is unstructured. And the amount of uns tructured data in enterprises is growing significan tly� often many times faster than structured databases are gro wing. As structured data is existing in table forma t i,e having proper scheme but unstructured data is schema less database So it�s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processi ng unstructured data,where in existing it is given to Cassandra dataset. Here in present system along wit h Cassandra dataset,Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amou nt of options for querying unstructured data. Where as Cassandra model their data in such a way as to mini mize the total number of queries through more caref ul planning and renormalizations. It offers basic secondary ind exes but for the best performance it�s recommended to model our data as to use them infrequently. So to process
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEEMEMTECHSTUDENTPROJECTS
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
Organizations adopt different databases for big data which is huge in volume and have different data models. Querying big data is challenging yet crucial for any business. The data warehouses traditionally built with On-line Transaction Processing (OLTP) centric technologies must be modernized to scale to the ever-growing demand of data. With rapid change in requirements it is important to have near real time response from the big data gathered so that business decisions needed to address new challenges can be made in a timely manner. The main focus of our research is to improve the performance of query execution for big data.
Column store databases approaches and optimization techniquesIJDKP
Column-Stores database stores data column-by-column. The need for Column-Stores database arose for
the efficient query processing in read-intensive relational databases. Also, for read-intensive relational
databases,extensive research has performed for efficient data storage and query processing. This paper
gives an overview of storage and performance optimization techniques used in Column-Stores.
Performance Benchmarking of Key-Value Store NoSQL Databases IJECEIAES
Increasing requirements for scalability and elasticity of data storage for web applications has made Not Structured Query Language NoSQL databases more invaluable to web developers. One of such NoSQL Database solutions is Redis. A budding alternative to Redis database is the SSDB database, which is also a key-value store but is disk-based. The aim of this research work is to benchmark both databases (Redis and SSDB) using the Yahoo Cloud Serving Benchmark (YCSB). YCSB is a platform that has been used to compare and benchmark similar NoSQL database systems. Both databases were given variable workloads to identify the throughput of all given operations. The results obtained shows that SSDB gives a better throughput for majority of operations to Redis’s performance.
As technology and needs evolve and the need for scalable and high availability solutions increase there is a need to evaluate new databases. The lack of clarity in the market makes in difficult for IT stakeholders to understand the differences between the solutions available and the choice to make. The key areas to consider while evaluating NoSql databases are data model, query model, consistency model, APIs, support and community strength.
Big Data is used to store huge volume of both structured and unstructured data which is so large and is
hard to process using current / traditional database tools and software technologies. The goal of Big Data
Storage Management is to ensure a high level of data quality and availability for business intellect and big
data analytics applications. Graph database which is not most popular NoSQL database compare to
relational database yet but it is a most powerful NoSQL database which can handle large volume of data in
very efficient way. It is very difficult to manage large volume of data using traditional technology. Data
retrieval time may be more as per database size gets increase. As solution of that NoSQL databases are
available. This paper describe what is big data storage management, dimensions of big data, types of data,
what is structured and unstructured data, what is NoSQL database, types of NoSQL database, basic
structure of graph database, advantages, disadvantages and application area and comparison of various
graph database.
A Study on Graph Storage Database of NOSQLIJSCAI Journal
Big Data is used to store huge volume of both structured and unstructured data which is so large and is
hard to process using current / traditional database tools and software technologies. The goal of Big Data
Storage Management is to ensure a high level of data quality and availability for business intellect and big
data analytics applications. Graph database which is not most popular NoSQL database compare to
relational database yet but it is a most powerful NoSQL database which can handle large volume of data in
very efficient way. It is very difficult to manage large volume of data using traditional technology. Data
retrieval time may be more as per database size gets increase. As solution of that NoSQL databases are
available. This paper describe what is big data storage management, dimensions of big data, types of data,
what is structured and unstructured data, what is NoSQL database, types of NoSQL database, basic
structure of graph database, advantages, disadvantages and application area and comparison of various
graph database.
Big Data is used to store huge volume of both structured and unstructured data which is so large and is
hard to process using current / traditional database tools and software technologies. The goal of Big Data
Storage Management is to ensure a high level of data quality and availability for business intellect and big
data analytics applications. Graph database which is not most popular NoSQL database compare to
relational database yet but it is a most powerful NoSQL database which can handle large volume of data in
very efficient way. It is very difficult to manage large volume of data using traditional technology. Data
retrieval time may be more as per database size gets increase. As solution of that NoSQL databases are
available. This paper describe what is big data storage management, dimensions of big data, types of data,
what is structured and unstructured data, what is NoSQL database, types of NoSQL database, basic
structure of graph database, advantages, disadvantages and application area and comparison of various
graph database.
A Study on Graph Storage Database of NOSQLIJSCAI Journal
Big Data is used to store huge volume of both structured and unstructured data which is so large and is
hard to process using current / traditional database tools and software technologies. The goal of Big Data
Storage Management is to ensure a high level of data quality and availability for business intellect and big
data analytics applications. Graph database which is not most popular NoSQL database compare to
relational database yet but it is a most powerful NoSQL database which can handle large volume of data in
very efficient way. It is very difficult to manage large volume of data using traditional technology. Data
retrieval time may be more as per database size gets increase. As solution of that NoSQL databases are
available. This paper describe what is big data storage management, dimensions of big data, types of data,
what is structured and unstructured data, what is NoSQL database, types of NoSQL database, basic
structure of graph database, advantages, disadvantages and application area and comparison of various
graph database.
Enhancing keyword search over relational databases using ontologiescsandit
Keyword Search Over Relational Databases (KSORDB) provides an easy way for casual users
to access relational databases using a set of keywords. Although much research has been done
and several prototypes have been developed recently, most of this research implements exact
(also called syntactic or keyword) match. So, if there is a vocabulary mismatch, the user cannot
get an answer although the database may contain relevant data. In this paper we propose a
system that overcomes this issue. Our system extends existing schema-free KSORDB systems
with semantic match features. So, if there were no or very few answers, our system exploits
domain ontology to progressively return related terms that can be used to retrieve more
relevant answers to user.
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES cscpconf
Keyword Search Over Relational Databases (KSORDB) provides an easy way for casual users to access relational databases using a set of keywords. Although much research has been done and several prototypes have been developed recently, most of this research implements exact also called syntactic or keyword) match. So, if there is a vocabulary mismatch, the user cannotget an answer although the database may contain relevant data. In this paper we propose a
system that overcomes this issue. Our system extends existing schema-free KSORDB systems with semantic match features. So, if there were no or very few answers, our system exploits
domain ontology to progressively return related terms that can be used to retrieve morerelevant answers to user.
Similar to Icde2019 improving rdf query performance using in-memory virtual columns in oracle database (20)
Oracle Spatial Studio: Fast and Easy Spatial Analytics and MapsJean Ihm
Learn about a new tool, Spatial Studio, that lets you quickly and easily do spatial analytics and create maps, even if you don't have GIS or Spatial knowledge. Now business users and non-GIS developers have a simple user interface to access the spatial features in Oracle Database.
Spatial Studio lets you prepare your data for spatial analysis, perform spatial analysis operations, publish, and share the results – as well access spatial analyses results via REST and incorporate in applications and workflows. Presented by Carol Palmer, Sr. Principal Product Manager, and David Lapp, Sr. Principal Product Manager, Oracle Spatial and Graph.
Presentation video including demo and resources available here: https://devgym.oracle.com/pls/apex/dg/office_hours/3084 .
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your DataJean Ihm
AnD Summit '19 slides - Souri Das, Matthew Perry, Melli Annamalai. This presentation covers knowledge graphs built using the RDF capabilities of Oracle Spatial and Graph. We will illustrate how to define a knowledge graph, create virtual or materialized graphs from existing data (relational tables, CSV files, etc.), derive new knowledge through logical inference, navigate and query graphs using W3C standards, analyze knowledge graphs with graph algorithms, and more. Real-world use cases from various industries will also be shared.
Powerful Spatial Features You Never Knew Existed in Oracle Spatial and Graph ...Jean Ihm
Dan Geringer - BIWA Summit 2018 presentation. Even expert users may not know some of the powerful functions available in Oracle Spatial and Graph, or how to optimize common spatial requirements. I often find myself working with customers that implement spatial requirements the way they had to with other spatial solutions, instead of the best way they can by leveraging powerful unique capabilities available in Oracle Spatial and Graph. Many times the reason is "I didn't know that existed". This session will cover how Oracle Spatial and Graph natively integrates with key Oracle Database features such as transparent data encryption (TDE), redaction, partitioning (all types), and also powerful nearest neighbor strategies, new spatial functions introduced in 12c, as well as an overview of spatial functions you never knew existed. Customer use cases and code examples will be included. This session is intended for a technical audience, but others will also gain useful insights on the powerful capabilities of Oracle Spatial and Graph.
Learn how graph technologies can be applied to real-world use cases, using medical, network security, and financial data. By combining graph models and machine learning techniques, we can discover relationships, classify information, and identify patterns and anomalies in data. We can answer questions such as “How did other investigators approach similar cases?” and “Do these symptoms seem similar to ones we’ve seen in other diseases?” Presented by Sungpack Hong, Research Director, Oracle Labs.
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
4th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Learn how to visualize graphs – a powerful, intuitive way to interact with data. Using open source tools like Cytoscape or third party tools, you have several choices on how to visualize and interact with graphs from Oracle Database and big data platforms. Albert Godfrind (EMEA Solutions Architect) and Gabriela Montiel-Moreno (Software Development Manager) share all you need to get started, with detailed demos using a banking customer data set.
3rd in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
See the magic of graphs in this session. Graph analysis can answer questions like detecting patterns of fraud or identifying influential customers - and do it quickly and efficiently. We’ll show you the APIs for accessing graphs and running analytics such as finding influencers, communities, anomalies, and how to use them from various languages including Groovy, Python, and Javascript, with Jupiter and Zeppelin notebooks.
Albert Godfrind (EMEA Solutions Architect), Zhe Wu (Architect), and Jean Ihm (Product Manager) walk you through, and take your questions.
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...Jean Ihm
2nd in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
With property graphs in Oracle Database, you can perform powerful analysis on big data such as social networks, financial transactions, sensor networks, and more.
To use property graphs, first, you’ll need a graph model. For a new user, modeling and generating a suitable graph for an application domain can be a challenge. This month, we’ll describe key steps required to construct a meaningful graph, and offer a few tips on validating the generated graph.
Albert Godfrind (EMEA Solutions Architect), Zhe Wu (Architect), and Jean Ihm (Product Manager) walk you through, and take your questions.
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
1st in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
Xavier Lopez (PM Senior Director) and Zhe Wu (Graph Architect) will share a brief intro to what property graphs can do for you, and take your questions - on property graphs or any other aspect of Oracle Database Spatial and Graph features. With property graphs, you can analyze relationships in Big Data like social networks, financial transactions, or IoT sensor networks; identify influencers; discover patterns of fraudulent behavior; recommend products, and much more -- right inside Oracle Database.
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
Graph analysis employs powerful algorithms to explore and discover relationships in social network, IoT, big data, and complex transaction data. Learn how graph technologies are used in applications such as fraud detection for banking, customer 360, public safety, and manufacturing. This session will provide an overview and demos of graph technologies for Oracle Cloud Services, Oracle Database, NoSQL, Spark and Hadoop, including PGX analytics and PGQL property graph query language.
Presented at Analytics and Data Summit, March 20, 2018
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. memory virtual column approach can be applied to other
application areas such as data mart/data warehousing
star/snowflake schema queries where frequent joins with
dimension tables are common so that the join between the fact
table and dimension table can be eliminated. In fact, the
approach can be applied to any applications as long as there is a
one-to-one mapping between the ID and its value. For example,
the following snowflake schema query [11] from Wikipedia can
be reduced to one without joins by defining in-memory virtual
columns Year, Country, Brand, Product_Category in the
Fact_Sales table:
SELECT B.Brand, G.Country,SUM(F.Units_Sold)
FROM Fact_Sales F
INNER JOIN Dim_Date D ON F.Date_Id=D.Id
INNER JOIN Dim_Store S ON F.Store_Id=S.Id
INNER JOIN Dim_Geography G ON S.Geography_Id = G.Id
INNER JOIN Dim_Product P ON F.Product_Id = P.Id
INNER JOIN Dim_Brand B ON P.Brand_Id = B.Id
INNER JOIN Dim_Product_Category C ON P.Product_Category_Id
= C.Id
WHERE D.Year = 1997 AND C.Product_Category = 'tv'
GROUP BY B.Brand, G.Country;
This query is translated as follows using the virtual columns:
SELECT F.Brand, F.Country, SUM(F.Units_Sold)
FROM Fact_Sales F
WHERE F.Year = 1997 AND F.Product_Category = 'tv'
GROUP BY F.Brand, F.Country;
Our approach is not specific to Oracle. It could be applied to
any application where similar configurations are used. In
Section 2, we discuss related work, and in Section 3, we describe
our RDF in-memory processing. The in-memory virtual column
processing is described in Section 4. Section 5 describes
SPARQL to SQL translation. Section 6 discusses memory
footprint, and Section 7 describes our experimental study. We
conclude in Section 8.
II. RELATED WORK
RDF in-memory processing utilizes two different in-
memory structures, one using pointers (memory addresses)
embedded in the data structure so that the traversal is done
without any joins, such as in [14], and the other using IDs in a
relational table structure so that the traversal is done via joins,
such as in [8]. Some systems [13] do not use memory addresses
to enable reloading the memory structure from disk. Both
approaches have pros and cons. While the first method
mimicking the graph structure works well for processing path
queries, it is not very efficient for set-oriented operations, such
as aggregates. The second approach is very cumbersome in
handling path queries, as it requires joins, and sometimes the
intermediate join results can be large, slowing down the query
performance. HDT files [19] uses adjacency lists to alleviate
some of these problems, but it is read-only.
Much research [3,5,6,7,16] has been published on efficiently
processing self-joins utilizing indexes, column stores and some
other auxiliary structures such as materialized views. Typical
RDF data has a small number of distinct predicates compared to
subjects or objects, and many RDF queries have constants on
predicates. Hence, the data is sometimes partitioned on the
predicate so that only relevant data is accessed [5].
Whatever underlying data structure is adopted, it usually
maintains a separate dictionary for strings to represent URIs and
literals. Therefore, a join is required to get the values to present
to users or process aggregates, filters, or order-by queries. Our
paper focuses on removing this join to accelerate the query
processing. Systems using sequence numbers or plain numbers
as IDs would have small footprint in memory and faster load
time, but it would be difficult to integrate new data from other
sources because the dictionary table needs to be consulted to
generate or lookup an ID for a resource. Oracle uses hash IDs,
therefore unique IDs can be obtained by applying a function to
the resource value. This approach makes data integration more
efficient because unique IDs for resources can quickly be
generated without consulting the dictionary table. However, the
8-byte ID entails a bigger footprint and more processing during
load. It will also burden join processing as the bigger IDs
produce bigger intermediate results. The elimination of joins to
get resource values will help overall query processing.
III. RDF IN-MEMORY PROCESSING
In-memory processing is increasingly used as memory cost
is dropping and performance improvement across different
workloads is desired without much tuning. The RDF in-memory
processing utilizes the Oracle Database In-Memory Column
Store (IMC) [10, 18]. Frequently accessed columns from the
triples table and the value table are loaded into memory. RDF
queries often perform hash joins and the hash joins require a full
scan of triples and value tables. The in-memory column store
accelerates these table scans. In addition, the in-memory column
store employs compression and uses 4-byte dictionary code
instead of values. It also does smart scans using in-memory
storage index where min and max values of each column in the
in-memory segment unit called IMCU (in-memory compression
unit) are stored. In addition, it uses Bloom filter [12] joins and
SIMD (Single Instruction Multiple Data) vector processing for
queries with filter. The SIMD filters a number of rows in a single
instruction.
If insufficient memory is available to load all the requested
data into memory, Oracle IMC will partially load the data. While
it would be ideal if all data fits in memory, partial in-memory
population also delivers some performance improvement [4].
In Oracle Database 18c, enabling and disabling the RDF in-
memory population are controlled by the following PL/SQL
APIs:
EXEC SEM_APIS.ENABLE_INMEMORY(TRUE);
EXEC SEM_APIS.DISABLE_INMEMORY;
The argument ‘TRUE’ means that we wait until the data is
populated in memory. When on-disk data is changed due to
insert, delete, or update, background processes automatically
modify the in-memory data by creating a new IMCU.
1815
3. Table 1: Quads/Triples Table Table 2: Value Table (VALUE$)
IV. ELIMINATION OF VALUE JOIN USING RDF IN-
MEMORY VIRTUAL COLUMN
RDF query execution spends significant time joining with
the value table to get column values. Materializing values can
avoid these joins. However, materializing values violates the
normalization principle, and string value materialization, in
particular, becomes prohibitively expensive due to space
requirements. Therefore, instead of materializing on disk we do
it in memory. By populating the column values in memory as
virtual columns [17], we can retrieve values without joining
with the value table. We add virtual columns to the triples table,
and the values for these virtual columns are materialized in
memory. We need values for subject ID (SID), predicate ID
(PID), object ID (OID) and graph ID (GID). For example, the
value for the subject, SVAL, is obtained by the function
GetVal(SID). These values are organized in columnar format
and compressed. Table 3 shows the structure at a conceptual
level for the triples table (Table 1) and the value table (Table
2). A 4-byte dictionary code is actually stored in memory and a
separate symbol table is maintained in memory to map the
dictionary code to its value. The virtual columns are stored in
the in-memory segment called IMEU (in-memory expression
unit).
There are many duplicates in SVAL, PVAL, OVAL, and
GVAL, and these duplicates are compressed away. All queries
will work on the triples table only. Note that this kind of
materialization is possible only if there is a one-to-one mapping
between the ID and its value.
Here is one of the virtual column functions. It extracts values
from the value table given an ID:
FUNCTION GetVal (i_id NUMBER)
RETURN VARCHAR2 DETERMINISTIC IS
r_val VARCHAR2(4000);
BEGIN
EXECUTE IMMEDIATE
'SELECT /*+ index(m C_PK_VID) */ VAL
FROM VALUE$ m
WHERE ID = :1' INTO r_val USING i_id;
RETURN r_val;
END;
Here is how the virtual column SVAL is defined using the
function GetVal():
EXECUTE IMMEDIATE
'ALTER TABLE VALUE$ ADD
SVAL GENERATED ALWAYS AS
(GetVal(SID))
VIRTUAL INMEMORY';
Once the virtual columns are defined, the virtual column
name and its virtual column function name can be used
interchangeably in the query to retrieve the value from memory.
In other words, if a query contains GetVal(SID), the subject
value is fetched directly from memory instead of executing the
virtual column function. In this case, either SVAL or
GetVal(SID) is used to get the value.
In general, any application that utilizes in-memory virtual
columns can identify columns that are essential for fast query
performance and materialize only those columns in memory.
The columns to be materialized in memory can be determined
by the query workload.
ID VAL
101 <ns:g1>
201 <ns:s1>
302 <ns:s2>
402 <ns:p1>
403 <ns:p2>
611 "100"^^xsd:decimal
612 “200”^^xsd:decimal
723 "2000-01-02T01:00:01"^^xsd:dateTime
GID SID PID OID
101 201 402 611
101 302 403 723
101 302 402 612
GID SID PID OID GVAL SVAL PVAL OVAL
101 201 402 611 <ns:g1> <ns:s1> <ns:p1> "100"^^xsd:decimal
101 302 403 723 <ns:g1> <ns:s2> <ns:p2> "2000-01-
02T01:00:01"^^xsd:dateTime
101 302 402 612 <ns:g1> <ns:s2> <ns:p1> "200"^^xsd:decimal
Table 3: Quads/Triples Table in Memory
1816
4. V. SPARQL TO SQL TRANSLATION
As the underlying triples table and the value table are stored
in the relational database, all SPARQL queries are translated
into equivalent SQL queries against the triples table and value
table. Typically, an RDF query is processed first via self-joins
using IDs followed by joins with the value table. The in-
memory virtual column employs late materialization, hence the
4-byte dictionary code is used for interim processing until the
full value is needed. All value table joins are replaced with
fetching virtual columns from the triples table. The SPARQL-
to-SQL query translation routines maintain a few HashMaps to
map the SPARQL query variables to virtual columns and to
triple patterns in the SPARQL query. Because the same variable
can appear in more than one triple pattern, we need to keep in
the HashMap the variable along with its position in the triple
pattern so that the correct value is fetched. For example, in the
following triple pattern:
{ ?s <p1> ?o. ?t <p2> ?s }
The value of the variable ?s in the first triple is fetched from
SVAL while in the second triple it is fetched from OVAL.
VI. DEALING WITH MEMORY REQUIREMENT
The size of typical user applications’ RDF data that we have
observed is about a few hundred million triples. This size of data
should fit in memory easily. The 242 million triples table
(242,297,052 triples) for LUBM data we use in our experiment
requires 8.99 GB (8,991,866,880 bytes) of memory including
the in-memory virtual columns. Its size on disk is 5.55 GB
(5,557,166,080 bytes). The actual memory requirement depends
on the data characteristics such as the extent of value repetition
in the triples. Because the in-memory columnar representation
gives better compression than the on-disk row format, the data
size in memory can be smaller than the on-disk size in some
cases. With increasing memory size available these days, it will
not be a problem fitting billions of triples in memory, especially
on server machines.
In-memory data is fetched from the memory while out-of-
memory data is fetched from the disk. For out-of-memory data,
the virtual columns are automatically assembled using the data
on disk. If a large amount of data resides on disk, it may
deteriorate the query performance. However, in Oracle
Database, the RDF data is partitioned into separate datasets
based on user-defined groupings [3], and the in-memory
population is controlled at the partition/subpartition level so that
only relevant datasets are populated in memory. If a query
suffers from significant performance degradation due to on-disk
virtual column fetches, the query can resort to in-memory real
columns only using the option DISABLE_IM_VIRTUAL_COL
so that the query is processed without using the virtual columns.
VII. EXPERIMENTS
A. Hardware Setup
The RDF in-memory virtual column performance is
conducted on a virtual machine with 256 GB memory and 2TB
of disk space. It has 32 CPUs. The machines use Oracle Linux
Server 6 operating system. The database used is Oracle
Database 18c Enterprise Edition - 64bit Production. The timing
values are an average of three runs for each query and the
timing resolution is 10ms as Linux default.
B. RDF In-Memory Virtual Columns Performance
This experiment is conducted to check the performance of
the RDF in-memory virtual columns (IMVC) against RDF non-
in-memory (non-IM) configuration. LUBM1K benchmark is
used. The LUBM1K data set contains the total 242,297,052
rows including entailment. The LUBM benchmark queries are
used for evaluation. Because the server is shared, the maximum
SGA we can have is 140 GB and INMEMORY_SIZE is set to
60 GB. The numbers in the parentheses in Figures 1-4 represent
the number of rows in the result set. Figures 1 and 2 show the
execution time for sequential run in logarithmic scale for both
configurations. In the warm run, some non-IM queries with a
small result set run faster as the IMVC still does full scan in
memory. However, some non-IM queries require tuning.
The timing values for Q3 and Q10 are 0.00 for both
configurations and for Q1 IMVC shows 0.00 because the timing
values were measured only up to hundredths of a second. The
Figures 3 and 4 show the execution time in logarithmic scale
for parallel run with degree 32. The timing values for the
queries Q1, Q3, and Q10, in IMVC show 0.00 for the warm run.
The performance improvement of the in-memory virtual
columns against non-IM shows 43x gain (cold) and 50x gain
(warm) for Q8 in sequential run, and 20x gain (Q8) and 144x
gain (Q12) for parallel run.
The parallel run with degree 32 requires a lot of memory as
more inter-process communication is needed. Because in-
memory virtual columns require more memory than non-IM
configuration, for some queries we ran out of memory and
therefore some data was written onto disk, causing performance
degradation.
Figure 1: Sequential execution time (in sec, log scale) for
LUBM benchmark queries (cold run)
1817
5. Figure 2: Sequential execution time (in sec, log scale) for
LUBM benchmark queries (warm run)
Figure 3: Parallel execution time (in sec, log scale) for
LUBM benchmark queries (cold run)
Figure 4: Parallel execution time (in sec, log scale) for
LUBM benchmark queries (warm run)
As more values are fetched and the number of variables in-
creases, a bigger performance gain will be achieved. However,
IMVC does not control self-joins of the triples table. Therefore,
if a non-IM query produces a better execution plan of self-joins
using indexes, it could outperform the IMVC performance as
can be seen in Q2, Q9, and Q13 above. In general, in-memory
based query processing provides consistently good
performance without tuning and it does not show erratic
behavior on different workloads.
Figure 5: Execution time (in sec, log scale) for fetching all
values
1818
6. We have fetched all values from the triples table to check
its impact as more values are fetched. Figure 5 shows the
execution time for fetching all values. It shows 41x
improvement for sequential run and 436x gain (986.05 vs. 2.26
sec) for parallel run against non-IM.
VIII.CONCLUSION AND FUTURE WORK
Efficient materialization of RDF data in memory
significantly improves query performance. In-memory
materialization using virtual columns does not increase
persistent storage requirements, and its columnar format is also
good for compression. We have shown that this approach can
make a significant performance enhancement. Though we have
applied the scheme to RDF data, it has potential to be applied
to any area where a one-to-one mapping is maintained between
ID and its value. In sum, by materializing one-to-one join
operations in memory, we have achieved up to two orders of
magnitude performance improvement.
While this paper provides a viable solution to value table
joins in RDF query processing and reduces the possibility of
generating poor execution plans by reducing the overall number
of joins in the query, it does not propose a solution to speed up
or reduce the number of self-joins on the triples table. It could
be interesting to develop a new scheme to handle self-joins
along the same lines by eliminating actual joins.
REFERENCES
[1] RDF 1.1 Concepts and Abstract Syntax.
https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/, Feb. 2014.
[2] SPARQL 1.1 Query Language. https://www.w3.org/TR/sparql11-query/,
Mar. 2013.
[3] E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An Efficient SQL-based
RDF Querying Scheme. In Proc. of VLDB Conference, 1216–1227, 2005.
[4] E. I. Chong. Balancing Act to improve RDF Query Performance in Oracle
Database. Invited Talk, LDBC 8th TUC meeting, Jun. 2016.
[5] D. J. Abadi, A. Marcus, S. Madden, and K. J. Hollenbach. Scalable
Semantic Web Data Management using Vertical Partitioning. In Proc. of
VLDB Conference, 411-422, 2007.
[6] C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for
Semantic Web Data Management. In Proc. of VLDB Conference, 1008-1019,
2008.
[7] T. Neumann and G. Weikum. RDF3X: a RISCstyle Engine for RDF. In
Proc. of VLDB Conference, 647-659, 2008.
[8] Orri Erling, Virtuoso, a Hybrid RDBMS/Graph Column Store, Bulletin of
the IEEE Computer Society Technical Committee on Data Engineering,
35(1), 3-8, 2012.
[9] Lehigh University Benchmark. http://swat.cse.lehigh.edu/projects/lubm/,
Jul. 2005.
[10] Oracle Database In-Memory Guide.
http://docs.oracle.com/database/122/INMEM/title.htm, Jan. 2017.
[11] Snowflake schema. https://en.wikipedia.org/wiki/Snowflake_schema
[12] B.H. Bloom. Space/Time Trade-Offs in Hash Coding with Allowable
Errors. CACM, 13(7), 422-426. 1970.
[13] M. Janik and K. Kochut, BRAHMS: A WorkBench RDF Store And High
Performance Memory System for Semantic Association Discovery, In Proc.
of ISWC Confer-ence, 2005.
[14] R. Binna, W. Gassler, E. Zangerle, D. Pacher, G. Specht, SpiderStore:
Exploiting Main Memory for Efficient RDF Graph Representation and Fast
Querying, Workshop on Semantic Data Management, 2010.
[15] B. Motik, Y. Nenov, R. Piro, I. Horrocks and D. Olteanu, Parallel
Materialisation of Datalog Programs in Centralised, Main-Memory RDF
Systems, In Proceedings of the Twenty-Eighth AAAI Conference on Artificial
Intelligence, 129-137, 2014.
[16] T. Neumann and G. Weikum. Scalable Join Processing on Very Large
RDF Graphs. In Proceedings of the 35th SIGMOD International Conference
on Management of Data, 627-640, New York, NY, USA, 2009.
[17] A. Mishra, et al., Accelerating Analytics with Dynamic In-Memory
Expressions, In Proc. of VLDB Conference, 1437–1448, 2016.
[18] Lahiri, T. et al. Oracle Database In-Memory: A Dual Format In-Memory
Database, In Proc. of ICDE Conference, 1253-1258, 2015.
[19] Fernández J.D., Martínez-Prieto M.A., Gutierrez C. Compact
Representation of Large RDF Data Sets for Publishing and Exchange. In Proc.
of ISWC Conference, 193-208, 2010.
1819