SlideShare a Scribd company logo
1
HEURISTICS-BASED QUERY OPTIMISATION SOLUTION IMPLEMENTATION IN RSP
ENGINES: THE CQELS AND C-SPARQL
Submitted in fulfilment of the requirements for the degree of Masters of Science
Supervisor:
Co-supervisor
The Insight Centre for Data Analytics, National University of Ireland, Galway
September, 2016
2
Abstract
This thesis addresses the gravity of basing and constructing the query optimisation process
executed in RDF stream processing engines around an efficient heuristics engine. The Resource
Description Framework (RDF has taken the world over by storm and by its golden standard of
data stream processing and communication of real-time data items collected from medical
institutions, industrial plants, financial entities, and telecommunication service providers. For
instance, DBPedia and Yago help reinforce structural querying in Wikipedia searches by
retrieving metadata and encoding them in an RDF format. Also, biological information such as
experiments and their distinctive results are stored as RDF data compilations to enable sufficient
communication between chemists and biological specialists.
The data streaming framework has been highlighted by the invention of the Semantic
Web by Tim-Berners Lee that works to stream linked data from sourced documents and
applications, thus serving users with precise web pages. However, the query optimisation
performed in both of these query languages is still somewhat deficient in regards to the time
expended before the results of the search are delivered. The execution of flawed queries is also
another worrying factor in the query optimisation function of the RSP engines. All these
elements: lengthy run-time, extravagant computational costs such as join operations, and the
implementation of inaccurate queries contribute to the downgrade of RDF stream processing.
Heuristics will help identify early error signs in the user queries and solve them by use of
its inbuilt configurations and algorithms. The novel heuristics optimisation model can be used as
a benchmark in querying of the Semantic Web metadata in departments such as in military
logistics, data warehousing, engineering analysis, and health care. Some of the main
3
contributions of this research work include: (i) Deploy an implementation of reference on
existing CQELS and C-SPARQL execution framework; (ii) Extend the two RSP engines
(CQELS and C-SPARQL). This new engine helps in allowing the processing and resource space
sharing among multiple concurrent queries; (iii)Evaluate the performance of the extended RSP
engines and compare them with the first released CQELS and C-SPARQL engines. The results
of the evaluation show a remarkable improvement in the performance in addition to the
demonstration of the practicality of the approach used.
4
Table of Contents
Table of Contents........................................................................................................................................4
Chapter 1: Introduction...............................................................................................................................9
1.1 Motivation.........................................................................................................................................9
1.2 Problem Statement and Hypotheses...............................................................................................10
1.3 The Outcome of the Thesis..............................................................................................................14
1.3.1 Adaptive execution framework.................................................................................................14
1.3.2The linked data stream adaptive processing model...................................................................14
1.3.3 Algorithms and data structures for triple-based windowing operator incremental evaluation15
1.3.4 The techniques for optimization for multiway joins.................................................................16
1.4The Outline of This Thesis.................................................................................................................16
Chapter 2: The General Background..........................................................................................................17
2.1Introduction......................................................................................................................................17
2.2 Comparative and Survey Evaluations...............................................................................................24
2.3Query Optimization..........................................................................................................................27
2.4RDF Stream Processing and Semantic Web......................................................................................29
Chapter 3: Background to RSP Engines......................................................................................................32
3.1 C-SPARQL.........................................................................................................................................32
3.2CQELS................................................................................................................................................32
3.2.1 Introduction..............................................................................................................................34
3.2.2 Proposed heuristics approach...................................................................................................37
3.2.3 Results simulation.....................................................................................................................43
3.2.4 The performance comparison graph between new improved model and the previous version
of CQELS and C-SPARQL.....................................................................................................................46
Chapter 4: State of The Art in LSDP or the Linked Stream Data Processing...............................................53
4.1 Query Semantics and Data Models..................................................................................................53
4.2 Data Model......................................................................................................................................53
4.3 Query Semantics..............................................................................................................................55
4.4 Query Languages.............................................................................................................................55
5
Chapter 5: The Optimization Solutions for the CQELS...............................................................................59
5.1 The Adaptive Optimizer...................................................................................................................65
5.2 The Dynamic Executor.....................................................................................................................67
Chapter 6:Exploration of the RDF Engine – Continuous C-SPARQL............................................................69
Chapter 7: Adaptive Query Optimiser in RDF Engines...............................................................................74
7.1 Adaptive Query Optimiser...............................................................................................................74
7.2 MultiwayJoins Adaptive Cost-based Optimisation...........................................................................74
7.3 Shared Window Joins Optimisation.................................................................................................76
7.4 Multiple Join Operator.....................................................................................................................76
7.5 Features of Adaptive Query Optimization.......................................................................................78
7.6 Adaptive Plans Concepts..................................................................................................................79
Chapter 8: Conclusion and Future Work....................................................................................................81
8.1 Conclusion.......................................................................................................................................81
8.2 Future Work.....................................................................................................................................84
References.................................................................................................................................................87
6
List of Figures
Figure 1: Semantic Web processing...........................................................................................................29
Figure 2: Query flow through a DBMS.......................................................................................................37
Figure 3: Binary tree..................................................................................................................................38
Figure 4: Magic tree...................................................................................................................................39
Figure 5: Cost versus time graph...............................................................................................................45
Figure 6: Performance versus complexity..................................................................................................46
Figure 7: Graphical performance comparison...........................................................................................48
Figure 8: An architecture of the C-SPARQL engine....................................................................................72
7
List of Tables
Table 1: Algorithm 1..................................................................................................................................40
Table 2: Algorithm 2..................................................................................................................................42
Table 3: Query 1........................................................................................................................................44
Table 4: ThePerformance Comparison by Features...................................................................................47
Table 5: Performance Comparison by the Mechanism of Execution.........................................................47
8
Summary
This work aims at exploring query optimisation solution implementation in RSP engines namely;
the CQELS and C – SPARQL. The framework presents one of the continuous query languages
which are compatible with SPARQL. This structure is introduced over both linked data and
linked stream data. In practice, the framework is very flexible, hence enabling performance gains
of various magnitude orders over other related systems. An efficient hybrid physical data
organisation that uses a novel data structure in supporting algorithms helps to deal with high
update throughput RDF streams and large RDF datasets. Additionally, this framework also gives
provision for various adaptive optimisation algorithms. This thesis also provides extensive
experimental evaluations for the demonstration of the advantages of the CQELS and C –
SPARQL processing engines and framework regarding performance. Furthermore, these
assessments aim at covering a comprehensive set of parameters that plays a significant role in
dictating the performance of the continuous queries over both the linked data and the linked stream
data.
9
Chapter 1: Introduction
As the primary purpose of this research study is the exploration of the gravity of basing
and constructing the query optimisation process executed in RDF stream processing engines
around an efficient heuristics engine, the introduction starts with the motivation. Afterward, it
discusses the problem statement and hypotheses. Next, this chapter touches on the thesis
outcome, and lastly the thesis outline.
1.1 Motivation
It is crucial to note that the world is currently witnessing a scenario of paradigm shift
(Abdulla and Matzke 2006, p.29). In essence, the real time and data that depends on such time
continuous to become ubiquitous (MacLennan and Tang 2009, p.61). In the last few years, little
was known about things such as the sensor devices (Mueller 2009). For example, compasses,
cameras, mobile phones, GPS, accelerometer and so on. Additionally, stations for observing the
weather such as humidity, temperature and so forth are on the continuous rise in producing a
large quantity of information in the form of data streams (Cheung, Hong, and Fong 2006, p.55).
Furthermore, things like the systems that monitor the patients such as the blood pressures, heart
rate and so on, and also the systems that track the locations such as the RFID, GPS, etc., play a
vital role in this process. Moreover, the building management systems that includes the
conditions of the environment, the consumption of energy and so on, the cars that include the
both the driver and engine (Abdulla and Matzke 2006) monitoring also records an equal
tremendous increase in the production of such quality information (Cole and Conley 2009, p.53).
In addition, the web equally has several services that include the use of Facebook, Twitter, and
10
some blogs that helps in the delivery of streams of real-time data that is typically unstructured on
various topics.
1.2 Problem Statement and Hypotheses
In practice, the motivation of this kind of thesis results in larger problems of research that
always arise at the time of building an efficient Linkage Stream Data query processing engine.
One of the major problems is how to design a new declarative query language. According to
research (e.g. Abdulla and Matzke 2006, 145; Buchanan and Shortliffe 1984, p.99), this kind of a
problem mostly arises due to neither SPARQL nor the state of the art continuous query
languages could assist in querying Linkage Stream Data. In practice, a query language requires a
sound semantics and a formal data model of continuous query operators (MacLennan and Tang
2009, p.187). In essence, the data model must have the ability to represent both the Linked Data
and the Linked Stream Data in a unified view. In this case, the new data model must be an
extension of the Resource Description Framework model to allow for a transparent integration of
conventional Resource Description Framework databases (Zhang and Kollios 2007, p.85). In
light of the continuous query processing, there is a required property of a temporal aspect of the
data that has not been earlier covered by any of the Resource Description Framework
(MacLennan and Tang 2009, p.22). Alongside the data model, there must also be a definition of a
graph base query operators that have continuous semantics in the specification of the meaning of
the declarative query patterns (Buchanan and Shortliffe 1984, p.163). Worth noting, for the
primary purposes of reducing the efforts of learning, it is important to have a query pattern that
resembles an SPARQL. Furthermore, the kind of activity requires some alignment of the query
operators with the semantics of SPARQL. Additionally, this kind of alignment must be
11
compatible with the operations of the window according to its definition in the traditional
continuous queries, for example, the CQL.
It is important to note that when given the disadvantages of using unmodified triple
storages and the Data stream management systems for the Linked Stream Data, Resource
Description Framework based stream data displays new issues for the physical organization for
both the Linked Data and Linked Stream Data (MacLennan and Tang 2009, p.149). Most
importantly, a triple table storing identifiers that represent literals and URIs are the standard
models of storing bags (Abdulla and Matzke 2006, p.109). In the process, such activity combines
with mapping tables in the form of dictionaries in translating that identifies back into the form of
lexical (Cole and Conley 2009, p.203). The Linked Stream Data necessitates a high writing
throughout, on the other hand, this kind of data has the design for heavy read intensive context
(Zhang and Kollios 2007, p.148). Another important thing to note is the remedy of the Data
stream management systems for the Linked Stream that can write intensive requirement by the
use of storage of in-memory. However, this kind of data entails a Linked Data that can
sometimes be not possible in hosting in the main memory (Cole and Conley 2009, p.209).
Furthermore, Resource Description Framework based data elements such as temporal RDF
triples and RDF triples are very small. In effect, they display an enormous individual data points
in comparison to the quantity of the encoded information.
In practice, the efficiency of the raw-based data structure used in relational Data stream
management systems is very is not sufficient since it needs sizes of tuple header that can
dominate the total size of the storage (Cole and Conley 2009, p.211). It is important to note that
the raw based data structure designed for shorter and wider tables can sometimes rises
significantly the ways for processing stream. In effect, there is a need for a new physical
12
approach of organisation for processing both Linked Data and Linked Stream Data (Buchanan
and Shortliffe 1984, p.92). Resource Description Framework based operators of continuous
query typically operate on a few or one very large tables (MacLennan and Tang 2009).
Therefore, it plays a vital role in having indexes for the random data items’ access. It is also
important to note that most of the modern Resource Description Framework stores give
provision for a massive strategy of indexing in overcoming their large handicap (Cole and
Conley 2009, p.173). In essence, it is always possible in bypassing such tables since the indexes
cover all the accessing patterns. Notably, a comprehensive indexing scheme has a very high
maintenance cost hence making it impractical for stream processing. In addition, some of the
stream data indexing solutions might appear helpful but their designs only make them applicable
for relational streams (Abdulla and Matzke 2006). In effect, an investment of a hybrid solutions
that can be applicable for strategies of indexing of both stream data processing and triple
storages forms part of an interesting problem (Cole and Conley 2009, p.239). Additionally,
another issue that associates with the physical representation of Resource Description
Framework based stream data is the way of efficient evaluation of the unbound nature of streams
versus the window operators.
It is both worthy and in order to note that there are several attempts in Data stream
management systems to support the queries of the sliding windows (Cole and Conley 2009,
p.243). Most importantly, one of such efforts is the independent re-evaluation over each of the
windows from all other windows. In practice, this kind of process is referred to as the re-
evaluation computation (Abdulla and Matzke 2006, p.199). Worth noting, this approach is useful
in both the Borealis and Aurora. Additionally, there is also another method known as the
incremental evaluation computation that only plays a significant role in processing changes that
13
get expired and inserted tuples in the windows in the pipeline of the query (MacLennan and Tang
2009, p.272). In essence, this kind of approach is useful in Nile and Stream. In contexts of these
activities, there exist some shortcoming to employ incremental methods of evaluation (Cole and
Conley 2009, p.287). Practically, these methods include both the negative tuples and the direct
time stamps. Notably, the method of the negative tuple doubles the tuple number through the
pipeline of the query. On the other hand, the direct method of the timestamps requires some extra
timestamps. In practice, with the introduction of the new data structures in this thesis, the
associated effective algorithms to compute operators of windowing must always address the
unusual characteristics of data.
A Resource Description Framework triple storage has an exceptionally thin and long
table that are not standard optimization (Cole and Conley 2009, p.368). In this case, it is always
quite challenging for the traditional Data stream management systems to give statistics that are
relevant for query optimizer. In addition, this kind of challenge is still applicable to the
processing Linkage Data and Linked Stream Data. It becomes even more challenging to maintain
high dynamic datasets of statistics in the setting of stream processing (Cole and Conley 2009,
p.394). Most importantly, such type continuous query processing’s adaptivity query optimiser
becomes harder to achieve due to the unpredictivity of Resource Description Framework data
and the dynamic nature of the stream data distributions (MacLennan and Tang 2009, p.400).
Moreover, the SPARQL just like queries always consist of share query patterns posting the
requirements of optimization of the multi-query (Cole and Conley 2009, p.386). Some of the
proposed approaches for relational streams might sometimes fail to work on the Resource
Description Framework based on the stream, although there exist several efforts in the multi-
query optimisation (Abdulla and Matzke 2006, p.397). In light of these approaches, such failure
14
to work mostly results from its various natures in comparison to the relational one (Zhang and
Kollios 2007, p.391). In effect, it becomes very challenging in enabling multi-query optimisation
for Linked Data Streams.
1.3 The Outcome of the Thesis
In light of the issues stated above, the outcome this thesis would include:
1.3.1 Adaptive execution framework
This kind of framework will enable adaptivity in RSP engines: the CQELS and C –
SPARQL (Abdulla and Matzke 2006, p.402). Additionally, the framework can allow full control
of the process of execution with the flexibility of adding new algorithms and new data structure
to the query engine component (MacLennan and Tang 2009, p.433). Essentially, the framework
uses the encoding mechanisms in enabling the implementation of a small footprint and less
workload of the operators by performing only on fixed, small size integers (Buchanan and
Shortliffe 1984, p.266). It is important to note that the Linked Data parts catching solution for
subqueries helps in improving the performance and scalability of the query processing on the
collection of Linked Data (Zhang and Kollios 2007). In practice, the framework can address the
problem of scalability to integrate large static datasets with the proposed caching mechanism.
1.3.2The linked data stream adaptive processing model
This thesis recommends an adaptive processing model such as the formal definition of
query semantics, the data model, and the model of execution (Cole and Conley 2009, p.437). It is
important to note that the data model covers both the temporal aspect of Linked Data sets and the
Linked Stream Data that are yet to be addressed (Zhang and Kollios 2007, 434). On the other
hand, the query semantics get formalized by the use of both the operational and mathematical
meanings. In the first place, the precise meaning is helpful in showing the way of mapping a
15
declarative query fragment in response to the mathematical expressions (Cole and Conley 2009,
p.441). Additionally, the abstract syntaxes play a significant role in accompanying all the query
fragments to define a declarative query language with an extension from SPARQL (Buchanan
and Shortliffe 1984, p.280; Zhang and Kollios 2007, p.404). On the other hand, the operational
meanings help in the definition of the way of executing the operators in the expressions in the
physical execution plans (MacLennan and Tang 2009, p.432). In this case, the operational
semantics plays a significant role in showing the performance model for the constant execution
of the equivalent execution plans for a query expressed in both CQELS and C – SPARQL
languages (Cole and Conley 2009, p.470). In effect, this kind of operational feature helps in the
facilitation of the adaptivity of the execution engines based on the processing models (Zhang and
Kollios 2007, 355). This kind of scenario occurs due to the its ability to execute engine to
dynamically change to another equivalent execution plan from the current one for adapting to the
variations in the run-time (MacLennan and Tang 2009, p.446). In short, the CQELS language is
both the only language that get accompanied with the sound operational and mathematical
semantics and also one of the first query language for Linked Stream Data.
1.3.3 Algorithms and data structures for triple-based windowing operator incremental
evaluation
In this case, the introduction of the novel operator-aware data structures in association
with efficient additional evaluation algorithms in dealing with both the unusual properties of
query patterns and the RDF stream is helpful (Cole and Conley 2009, p.422). Most importantly,
the design of these data structures allows for the handling of intermediate mappings and small
data items contained in the processing state. Worth noting, these kind of data structures consists
of different cost indexes that have low maintenance in supporting high throughput in the
16
operations of probing that are useful in various implementations of operators (Abdulla and
Matzke 2006). In context of this kind of data there was a need for proposing various algorithms
in enabling incremental evaluation of some of the basic operators that include the elimination of
the duplicates, join, and aggregation (MacLennan and Tang 2009, p.453). In short, these kind of
algorithms aims at overcoming typical issues involved in incremental evaluation of the
windowing operators.
1.3.4 The techniques for optimization for multiway joins
In essence, this thesis explores the use of techniques of adaptive optimization to improve
the performance of the multiway joins (Abdulla and Matzke 2006, p.456). It is important to note
that this is one of the most expensive operators of query in the pipeline of query (Cole and
Conley 2009, p.472). Practically, this kind of adaptive cost model is useful in designing two
adaptive algorithms for the dynamic optimization of a query of a two-multiway join.
1.4The Outline of This Thesis
The organisation of the remaining part of this thesis is as follows: Chapter 2 explores the
general background on Linked Data processing and stream processing. Chapter 3 presents the
background to RSP engines (the CQELS and the C-SPARQL). Chapter 4 touches on the State of
the Art in LSDP or the Linked Stream Data Processing. Chapter 5: explores the optimisation
solutions for the CQELS. Chapter 6 mainly explores the RDF engine – continuous C – SPARQL.
Chapter 7 evaluates the RSP engines framework, and finally, Chapter 8 points to the future work
after the conclusion of this thesis in the same chapter.
17
Chapter 2: The General Background
This chapter explores the background techniques and concepts for Linked Data
processing and stream processing. Additionally, this background information also gives provision
for the fundamentals of stream processing that is applicable to the Linked Stream Data. In short,
the chapter discusses the representation of continuous semantics, basic techniques and models,
and the operators and the methods of optimisation, and the way of handling issues such as
memory overflow and time management (MacLennan and Tang 2009). In addition, the chapter
presents the definition of the semantics of SPARQL and Resource Description Framework data
model queries and the relevant notations. This general background also gives an overview of the
way of storing Resource Description Framework and its query by the use of SPARQL.
2.1Introduction
The term ‘heuristic’ is Greek for ‘discover’ or ‘find’ (Calhoun and Riemer 2001).
Heuristics is a common practice applied in multiple industry fields for the benefits of observing,
learning, and spotting malware errors and other problems by use of experience. For example,a
well-modelled heuristics technique is enlisted in antimalware programs to learn and spot
computer threats such as Trojan horses, viruses, and worms. The learning and observation aspect
of the heuristics framework operates by scanning computer documents capturing the signatures
they are differentiated with (Chen 2009). After reading the unique signatures of the computer
files such as tiny macros, find commands, or even subroutines, the heuristics uses its memory
and experience to identify the already read threats.
According toCIKM 2006 Workshops (2006), heuristics entail a suite of rules geared
towards enhancing the probability of identifying and ironing out problems in a given structure.
18
When applied in the computer science field, heuristics is considered as an algorithm engineered
to present viable solutions to any arising glitches in a given scenario. The heuristics discipline
generally examines how information is studied, captured, and discovered.When engaged in
artificial intelligence, computer science, or mathematical optimisation, heuristics engineswork to
decipher problems in a fast and efficient way when the conventional methods are acting up, are
not fast enough, or fail to calculate accurate solutions(Cheung et al. 2006, p. 49). If the heuristics
path is chosen in the failure of conventional methods, it is seen as a shortcut as it speeds up the
process. As Cohen (1985), says, heuristics can either work in isolation generating solutions by
themselves or in combination with optimisation algorithms all geared towards increasing the
RSP’s effectiveness (Gedik 2006). The more advanced version of heuristics thoroughly inspects
then traces the guidelines put in the codes of programs prior to passing them over to the
computer’s processing unit for execution. This will help the heuristics engine to assess and learn
the behaviour and mannerisms of that program while it runs in a virtual setting.
The current querying strategies enlisted in CQELS and C-SPAQRL waste a lot of
valuable time while performing incorrect and inept queries that may be keyed in by an end user
who is not quite familiar with the intricate querying descriptions, say Gore (1964). In as much as
the database servers within the CQELS and C-SPARQL systems may recognize these inefficient
queries, the end computer users and internet browsers are not aware of these incorrectly stated
queries, and, hence, may continue ringing them. As this happens, the entire performance and
speed of the language engines is incrementally impaired thus having less and less of total number
of data retrieval executed per unit time. In a bid to look for solution of the system downgrade, the
users opt to refer the issue to the DBA to help them code the efficient queries. Similarly, this
DBA consultation also results in time wastage as well. This is where the incorporation of a
19
heuristics engine comes into play. By assimilating a heuristics function in the querying of the
CQELS and C-SPARQL languages, a substantial amount of time and querying effort will be
saved(Cheung et al. 2006, p. 57). The heuristics function will serve as a query optimiser that will
skim through the input user query, inspecting it thoroughly to highlight and remove any detected
errors. According to Mcllroy (1998), unlike the DBA that recognizes the lapses in the queries yet
does nothing about them, the heuristics will work to automatically muster and reproduce a
correspondent but highly optimized query. By spotting and rectifying the inaccuracies inherent in
the queries input by the end users, the heuristics function will be discarding the time-consuming
processes of inaccurate query execution as well as the time expended to consult the DBA for
viable solutions. In this way, the system productivity and throughput will always be on an
upward curve. The frequency of accesses will be marginalized as the heuristics will lessen and
fully eradicate the number of tuples and columns browsed hence the data streaming processing
and querying accuracy will be on a winning streak.
An effort to integrate this kind of sources of information would enable a broad range of
application of near real time in the areas of green information technology, smart cities, e-health
and so on (Cole and Conley 2009, p.19). However, harvesting of such kind of data remains a
labour-intensive and a difficult task due to the heterogeneous nature of the vast streams. In
essence, such a process needs a lot of hand-crating methods. Worth noting, the remedy of this
kind of scenario involves the application of Resource Description Framework data or the RDF
data model (Schreiber 1977, p.38). In practice, this type of data model helps one to express
knowledge in a generic way. It is also necessary to note that it does not require any adherence to
a particular schema (MacLennan and Tang 2009, p.67). Efforts are underway to help in lifting
stream data to a level of semantic by semantic stream/ sensor and by the group of W3C semantic
20
network incubator (Maringer 2005). Essentially, the primary goal of the process is to make the
availability of stream data to the principles of Linked Data. Notably, this kind of concept is
referred to as the Linked Stream Data (Schreiber 1977, 103). Ordinarily, the Linked Data helps in
the facilitation of the process of data integration among the heterogeneous collections (Buchanan
and Shortliffe 1984). Another important thing to note is that the data streams has similar goals
concerning the Linked Stream Data (Schreiber 1977, p.89). Furthermore, it assists in bridging the
gap between more sources of static data and streams.
Besides a unified model of data representation, there is also a requirement of a processing
engine that can help in supporting a continuous query on both the Linked Data and Linked
Stream Data (Cole and Conley 2009, p.107). Moreover, there is always an assumption that data
get stored in a centralized repository and also changes infrequently before additional processing
(MacLennan and Tang 2009, 102). Ordinarily, this kind of scenario happens in a classical
Linkage Data processing. According to research (e.g. Zhang and Kollios 2007, p.51), it is evident
that there is always a limitation of an update on the dataset to just a small fraction of the same
dataset. Additionally, it is worth noting that this process only happens in a less frequent way, and
in some cases, the database gets replaced by a new version.
Both ‘one-time’ and ‘pull’ forms the traditional relational databases (Schreiber 1977,
p.139). In essence, there is an execution of the query after reading the data from the disk. Most
importantly, the output gives out a set of results for the same point in time (Cole and Conley
2009, p.137). On the other hand, Linked Stream Data produce new items continuously. In fact,
the data only becomes valid at the time of window. Additionally, it consistently gets pushed to
the processing query (Buchanan and Shortliffe 1984, p.99). In practice, the registration of queries
only happens once then a continuous evaluation over a given time against the dataset that
21
changes, in short, queries are continuous (MacLennan and Tang 2009, p.139). In effect, the
appearance of the new data results in the updates of the continuous query (Abdulla and Matzke
2006, p.97). It is important to note that such continuity of continuous queries and the temporal
aspect of the Linked Stream Data do not get considered in the processing engines of the Linked
Data query at the same moment (Cole and Conley 2009, p.148). Worth noting, better candidates
for processing continuous queries seem to be DSMSs or the Data stream management systems
(Zhang and Kollios 2007, 167). Ordinarily, a Data stream management system can be useful in
making a sub-component that deals with the steam data. in practice, the only problem is that no
any traditional Data stream management systems support the Resource Description Framework,
this makes it vital for the use of a data transformation step (Schreiber 1977, p.108). However, in
most cases, the use of such overhead of data transformation can sometimes be very costly in the
low-latency processing context of stream data (Sims and Yocom 2008, p.109). Furthermore,
losing full control over the execution of query means delegation of processing to a sub-system
such as the data stream management system (Cole and Conley 2009, p.145). Moreover, the
optimisation only can always get done locally in each of the subsystems (Schreiber 1977, p.143).
In this case, the subsystem is only optimized for the query patterns, a model of data, and also the
distribution of data since it gets used as a black box.
According to research (e.g. Buchanan and Shortliffe 1984, p.152), the difficulty of
predicting the structure of graphs of Resource Description Framework proves some challenges
for the traditional Data stream management systems. Moreover, they cannot effectively scale to
large quantities of the same Resource Description Framework data (Schreiber 1977, p.154).
Worth noting, this kind of difficulty in making predictions is also applicable to the Resource
Description Framework based data streams (Sims and Yocom 2008, p.151). In effect, it makes it
22
tough for the optimizers of Data stream management systems to handle. It is also necessary to
note that these optimisation problems of Data stream management systems were solved in some
ad-hoc and restricted scenarios (Cole and Conley 2009, p.162). Furthermore, some open
problems and challenges still present a good number of areas (MacLennan and Tang 2009,
p.173). In addition, a heuristic is the most of the optimisation algorithms, and they also prove to
work for certain kinds of data and queries.
In essence, this kind of facts played a significant role in motivating me to develop a
heuristic-based optimisation solution implementation for two RPS engines (C-SPARQL and
CQELS) by the use of a Java code for optimization with the naïve idea (Sims and Yocom 2008,
p.182). In practice, my approach aims to build engines with high processing performance for the
Linked Stream Data by a combination of algorithms, structures of re-engineering efficient data,
and techniques from both traditional Data stream management systems and Linked Data
processing. According to several research (such as Abbass and Newton 2002, p.135; Sims and
Yocom 2008, p.127), it is not a good practice to store Resource Description Framework data
elements by rotational tables. On the other hand, careful design of indexing schema and physical
storage plays a vital role for the performance of the triple storages (Schreiber 1977, p.94). It is
now important to note that this approach aims to design a native data structure that treats both the
Resource Description Framework and Resource Description Framework stream data elements as
citizens of the first class (Cole and Conley 2009, p.142). Most importantly, the continuous
changing of the data during the lifetime of query requires adaptive in its processing.
In essence, such action requires the introduction of adaptive execution framework known
as Continuous Query Evaluation for Linked Stream or the CQELS (Cole and Conley 2009,
p.177). It is important to note that this kind of framework gets designed to apply adaptive
23
techniques of processing in meeting the performance requirements of stream processing
(Buchanan and Shortliffe 1984, p.103; Zhang and Kollios 2007, p.171). Moreover, this kind of
framework allows for the full control of the execution process that is continuous where both the
optimization and scheduling can take place during the runtime (Schreiber 1977, p.67). In the
process, I had to create a new continuous query language as one of the first works in the
processing of the Linked Stream Data (Cole and Conley 2009, p.191). Worth noting, the
evaluation of the Linked Stream Data processing engines and conducting of the first survey
developed during the time of this thesis helps in providing an insight into the way of building an
efficient Linked Stream Data.
In this paper, we advance the integration of a heuristics engine in query optimisation of
the CQELS as well as C-SPAQRL to augment the query execution and data streaming processes.
The query optimisation operations will be done using a Java code that will serve as the optimizer
in both RSP engines. The implementation of a Java code as the optimiser in the RSP engines will
speed up operations and the query optimisation function in general. As MacLennan and Tang
(2009) claim, the code will allow the end users to unambiguously express their queries within the
code and reduce imprecise queries input. This will also help cut the incremental costs during the
computation process in regards to the projection, selection, and join functionalities as well as
other cost factors such as processor and communication time. As the data and ontology
constituents of the Web 3.0 have stabilised through the assimilation of golden standards such as
the OWL and RDF, the optimisation and solution implementation of heuristics-based querying is
next in the to-do-list.
The assimilation and solution implementation of the heuristics utility is outlined in this
thesis in this format: Section 1 discusses how heuristics can be employed in the query
24
optimisation to minimize the pertinent costs. In the projected heuristic algorithm, a query is
scanned and executed by use of the magic trees in the storage files which in turn demonstrates a
significant progress over the previous optimization approaches. The cost-based algorithm proves
that the system’s enhancement continues to improve as the query becomes even more interlaced
and dense as the user performs more intricate searches. Section 2 discusses how heuristics can be
enlisted in the Java code to significantly reduce the erroneous query executions by instinctively
recognizing and amending inefficiencies in the CQELS and C-SPARQL queries. The detection
and rectification of flaws within the queries will consequently save the huge amounts of time and
effort expended by the RSP engines in retrieving information, thus enhancing the overall
throughput and productivity of the engines. Section 3 demonstrates the impact of heuristics in its
competency to execute queries without involving join operations. The exclusion of join
operations in query optimisation will help to shrink operational costs in addition to making the
RDF data volume less bulky. The empirical results confirm that the projected heuristics model
overshadows the conventional querying techniques, for example the Jena, by 79%, regarding the
reduction of pointless intermediate results and a faster query processing time.
2.2 Comparative and Survey Evaluations
Essentially, the first experiments and survey is helpful in giving comparisons and insight
of the techniques of the data stream processing and also the Linked Stream Data processing
engines (Abdulla and Matzke 2006, p.487; Zhang and Kollios 2007, p.378). Additionally, the
first evaluation of the cross-system is to present the Linked Stream Data processing engines.
A scenario that integrates human-centric streaming data from the digital and physical
world similar to the live social semantics are just a total inspiration (MacLennan and Tang 2009,
p.474). Worth noting, the kinds of data from the physical world get captured and streamed
25
through tracking systems and sensors such as the wireless triangulation, RFID, and GPS, and the
integration can get done with the use of virtual streams such as the city traffic data, Twitters, and
the airport information in the delivery of up to date views or location based services of any
particular situation (Cole and Conley 2009, p.479). Furthermore, the conference scenario mainly
focuses on the problem of data integration between the streams of data from a static data set and
a tracking system (Abdulla and Matzke 2006). Worth noting, the system of tracking, similar to
the various real deployment in Live Social Semantics is useful to gather the relationship between
physical spaces and the real-world identifiers of the attendees of the conference. Moreover, the
non-stream datasets are used in correcting the tracking data (Cole and Conley 2009, p.482). For
example, the information that is online concerning the attendees such as the online profiles,
social network, records of publication and so on. In essence, there exist several benefits of
correlating the two sources of information (MacLennan and Tang 2009, p.453). Most important,
based on the number of the people and the topic of the talk, conference rooms could be
automatically assigned to the talks based on the total number of people that might show some
interest in attending it, based on their level of the profile (Cole and Conley 2009, p.491). In
addition, the attendees of the conference could also get notified about the co-authors found
within the location (Abdulla and Matzke 2006, p.423; Buchanan and Shortliffe 1984, p.403;
Zhang and Kollios 2007, p.348). It is also important to note that it can be very easy to assign a
service that the type of talk to attend based on the records of citation, profile, and the distance
found between the locations of the talk.
In practice, the spread of a social stream data of interest for a user occurs among various
platforms of social application such as the Twitter, Facebook, Foursquare and son on
(MacLennan and Tang 2009, p.496). Additionally, the social network analysis and aggression
26
platforms such as the Bottlenose require an integration of heterogeneous streams from various
feeds and social networks (Abdulla and Matzke 2006, p.437; Buchanan and Shortliffe 1984,
p.428). Most importantly, these kinds of platforms can easily use Linkage Stream Data
processing engines to deal with the issues of data integration (Cole and Conley 2009, p.504). In
the same context, this kind of scenario continues to focus on the different social stream
aggregation sources that the social network users create (MacLennan and Tang 2009, p.511).
Another important thing to note is that the social networks give provisions for rich resources of
the interesting stream data that includes the uploading of the photo and the sequence of social
discussions (Cole and Conley 2009, p.521). Additionally, the social networks get considered as
the best area of test for Resource Description Framework engines. Furthermore, the Resource
Description Framework can also exhibit its merits on how to represent graph data (MacLennan
and Tang 2009, p.527). Ordinarily, skewed distribution of data correlates with the data in real life
which mostly occurs in the social network data. Moreover, there is recognition of the efficient
handling of correlations as a very difficult problem by the engines of the database (Abdulla and
Matzke 2006, p.484; Buchanan and Shortliffe 1984, 503; Zhang and Kollios 2007, p.509). On
the other hand, it also plays a significant role in opening up many opportunities for the query
optimisation (MacLennan and Tang 2009, p.539). In the context of the scenario, it becomes
possible to build the simulator of data in exploiting different distributions of the skewed data and
the available correlations in a social network (Abdulla and Matzke 2006, p.437). As a
consequence, the data simulator is useful in generating some of the realistic test cases to evaluate
the Linked Stream Data processing engines.
It is important to note that various parts of this thesis have earlier been published as
workshop, conference and journal articles (MacLennan and Tang 2009, p.544). Furthermore,
27
there was an introduction of the first attempt of building a heuristic based query optimisation
solution implementation of RSQ engines in several research (such as Abbass and Newton 2002,
cessing engines and the Data stream management systems (MacLennan and Tang 2009, p.587).
On the other hand, the RSP engines such as the CQELS and the C-SPARQL are readily available
in studies (such as Abdulla and Matzke 2006, p.463; Buchanan and Shortliffe 1984, p.401;
Zhang and Kollios 2007, p.409).
2.3Query Optimization
Maringer (2005) describes query optimisation as an interspersed querying function in a
multitude of information systems and database frameworks. All query languages, be they
structured (SQL) or unstructured (C-SPARQL and CQELS), enlist query optimisation
functionalities to establish the most shrewd and adept channel of executing a query that has been
keyed in by a user. Such functionalities encompass query optimisers such as the PostgreSQLor
the Java code (Java Runtime Environment) that analyse and carefully assess the SQL, C-
SPARQL, or CQELS queries tackling the most effectual mechanism for query execution. The
querying of database systems happens almost every other minute of the day and, thus, query
optimisation is just as frequent(Cheung et al. 2006, p. 64). Anyone browsing the internet doing
either simple or complex kind of researches engages query optimisation in the Database
Management Systems (DBMS), when requesting for piece of information from the respective
databases. For example, if you are searching for a Social Security No., financial statements of a
company, a country’s demographics, of even trying to compute the average pay of all the civil
workers in the Department of Agriculture in your regional state, you are querying the distinctive
databases.
28
If, for instance, you are interested in investing in Ernst and Young LLP (a multinational
audit firm), you will obviously want to find out how it is performing in the market and its overall
productivity compared against other industry benchmarks. To locate such information, you will
log in to the company’s database system and request for its financial statements, ratios, and key
market/ performance indicators. A query soliciting for the financial ratios of Ernst and Young
LLP will look like this: “find the consolidated balance sheet of Ernst and Young.” Before the
balance sheet is availed onto your computer screen, there are a number of procedures that occur,
featuring a query plan. After submitting this query, the parser within the database will parse it,
and then hand it over to the query optimiser, which will then hatch several query plans in
accordance with the resource costs (Moustakas 1990). The most efficient means, in terms of
costs and time consumption, is chosen, after which the database server will access the pertinent
database data and whip up the desired results.
The prime focus of the query optimisation function of databases is centred on expeditious
and prompt query execution so as to deliver the desired results in the flash of a minute (Mueller
2009, p. 34). Time consumption is top of the list in the determination of the best query plan to
solve a given query. Any marginal time variance in alternative query plans will prompt the query
optimizer to select the option that is fastest and consumes the least amount of time. However, the
optimisation function is still lacking in regards to time efficiency and conservation as most
querying processes involve redundant executions of intermediate results within the join
operations. These join operation, together with other accompanying costs such as projection and
selection functionalities as well as the processor time, downscale the communication time of the
data results in addition to increasing the computational costs. As the selected query plan works, it
makes use of various algorithms with which it collaborates to manipulate and combine tables of
Figure 1: Semantic Web processing
29
data from the database structure so as to produce the requested knowledge material (Nirmal
1990, p.388). These manipulations and combining of data tables are called join operations and, in
the retrieval of real-time steaming data such as financial statistics, slow down the data streaming
process. Additionally, the processing of the intermediate results needed in the join operations
contributes in making the RDF data volume bulky, thus, impeding operations and the engine’s
speed in overall(Cheung et al. 2006, p.69). All these issues call for programmers to construct the
query optimisation function of the RSP engines around a heuristics solution and implement this
solution to improve on RDF stream processing.
2.4RDF Stream Processing and Semantic Web
The recent deployment of the semantic web in divergent industry sectors such as in
logistic planning in military fields, engineering analysis, health care, and life sciences has proved
its worth in data search automation and information technology upscale. According to Zhang and
Kollios 2007), the semantic web contributes to an instinctual and spontaneous web application
that browses the precise information from linked data sources. The application works by
collecting, filtering, and sampling of data items captured from differential sensor plants and
stored as ontologies in RDF formats (see Figure 1).
30
Invented by Lee Feigenbaum in 2001, the Semantic Web (Web 3.0) has,so far,
showcased some data processing differences between its database management versus other
relational databases such as the World Wide Web (Web 1.0). While the Web 1.0 operates by
dislodging the physical storage and networking layers, the Web 3.0 upgrades this tedious and
seemingly slower process by dismissing the document and application layers. In as much as the
search engines on the World Wide Web index a majority of the content stored on the Web, they
still lack in the instinctive capacity of selecting the articles and web pages that an end user really
desires. Rather than connecting documents and data structure like the Web 1.0, Web 3.0
capitalizes on its metadata base and ever evolving compilation of knowledge to connect facts and
meaning. This algorithm is what enables the Semantic Web to build on intuitiveness and self-
description that help the context-understanding programs to find the exact pages that a user is
looking for. As Sims and Yocom (2008, p.411) convey, the Web 3.0 has gained its technological
leverage over the Web 1.0 by its cutting-edge means of data storage, querying, and information
display. The data storage means incorporated in this new technique involvesmatching data
sources to ontologies that are stored in a structured form in Resource Description Frameworks
(RDF). Unlike the natural text formats that Web 1.0 utilizes in data storage and retrieval, the
Semantic Web models the data items sourced from diverse sensor plants into a comprehensive
descriptive language to make the query processes and information display easy enough and
friendly for all Internet browsers.
As Abbass and Newton (2002) illustrate in their journal article, the RDF comprises of a
descriptive structuring of data used for information exchange on the net. As the semantic
metadata reads information from sensor plants, it filters and stores this information into a format
that is easily readable by both the machine and the computer user. Engineered by the World
31
Wide Web Consortium (W3C), the RDF integrates the use of query languages and descriptive
statements and conjunctions (e.g. has, is) to provide relevant information about web resources
that a user may search for. For example, if you want to find out about the U.S. current president
(web resource), you will type in “The U.S. has a current president in office.” As seen from this
statement, there is an entity-relationship data model that is in the form of a subject-predicate-
object expression. This model is the strategy made use of by the RDF when searching for
information. Thus, the RDF refers to that language that exhibits web data by use of marginally
constraining, meaningful, and constructive expressions. To incrementally expand RDF’s
efficiency, we have to further advance the aspect of heuristics in the querying of RDF data
stream processing engines.
32
Chapter 3: Background to RSP Engines
3.1 C-SPARQL
Barbieri et al. (20109, p. 20) define C-SPARQL as an advanced language – a stretch of
the SPARQL query language that observes windows and recent triples of RDF data streams
while simultaneously allowing the streams to flow. The continuous streaming of queries by the
Continuous SPARQL (C-SPARQL) facilitates the interoperability of RDF formats and
implements crucial applications that allow researchers to access the ever-evolving information of
web resources. Wei (2011, p. 101) refers to C-SPARQL as an orthogonal extension of the
conventional SPARQL grammar, making the SPARQL a congruent component of the C-
SPARQL. The C-SPARQL builds on SPARQL by its capability of combining static RDF
together with real-time streaming data for purposes of stream reasoning. In as much as SPARQL
has cemented its viability in querying RDF repositories, Barbieri et al. analyze that it is still
lacking in producing continuous, flowing data streams (Abbass and Newton 2002, p. 21).
Stream-based data emitters encompassing stock quotations, click streams, feeds, and feeds emit
real-time continuous information. However, the SPARQL is still limited in its efficiency of
storing entire streams; therefore, the Data Stream Management Systems (DSMS) registers
consecutive queries in static forms. The invention of the C-SPARQL is thus based on its capacity
to merge the static data with the streaming data – a procedure that mobilizes logical reasoning in
actual time for those large and noisy data streams.
3.2CQELS
According to Abbass and Newton (2002), the Continuous Query Evaluation over Linked
Streams (CQELS) constitutes an adaptive and instinctive schema for supporting Linked Stream
33
Data, whose grammar is derived from the SPARQL 1.1, thus making them compatible. The
congruence of the two query languages (CQELS and SPARQL 1.1) capacitates the performance
level of the CQELS over other continuous query languages. The CQELS has been engineered
with the sole objective of enlisting the white box approach that functions by utilising the
prerequisite query operators in a native way to obviate all overhead costs plus any other
restrictions of closed system regimes (Schreiber 1977). The CQELS offers flexibility and
updatability in their execution structures as the inherent query processors continuously readjust
to the changes in the incoming data. Examples of such continuous queries are contained in
papers such as CF02, HFAE03, CDTW00, and ABB+02. These queries, however, are quite
simple and only applicable in general-purpose event processing. This paper proposes the
assimilation of heuristics in the query execution of CQELS to enable the continuous reordering
of its operators, thus, improve query applicability in complex situations, not just general-purpose.
The interspersion of the heuristics engine in the querying of RDF data streams is, hence, very
crucial and fundamental in the upscale of RDF stream processing as it greatly minimises the
lengthiness of the join operations. Besides lessening the inherent time consumption, the
heuristics will additionally help spot and rectify any flaws that occur in the queries that users
may input while searching for useful information from given databases. In general, the heuristics
functionalitywill have a double role in the query optimisation of RDF stream processing. One, to
shrink the duration of intermediate results processing for join operations, and, two, to discard the
errors contained in queries, hence, curbing flawed query execution, in turn, escalating time
saving during query optimisation.
34
Section 1: Cost-Based Heuristics Optimisation Approach
3.2.1 Introduction
The move to consolidate heuristics into the query optimisation aspect of RSP engines is
ingenious and groundbreaking, to say the least. The implementations of heuristics are geared
towards cutting computational costs during the query optimisation and join operations executed
within the C-SPARQL and CQELS languages. This section outlines in depth how enlisting the
heuristics function helps minimise the costs estimated in terms of the overall time spent by the
optimiser to select the most effectual query plan/ tree that will execute a given query in the least
time possible thus lessening the CPU and input/ output costs.
The CQELS and C-SPARQL DBMS optimisers endeavor to boil down to a single, most
feasible query plan for the given query statements. In the query optimisation world, pinning
down a suitable plan is contingent upon which mechanism has the least time duration as well as
the most minimal costs involved in terms of query execution factors like communication, the
processor, and the Input/ Output Expenses. These costs are a very critical factor and get utmost
consideration during the selection of the most ideal query plan tree (Abbass and Newton 2002).
When a query is input into an RDF database, the Database Management System (DBMS)
initiates a process a selection course geared towards determining the most potent path to follow
and give results in the shortest route possible. This course entails the optimiser devising several
paths plans from which it chooses the most ideal one to utilize. All these hatched path plans,
when followed, output equivalent data or information. However, they differ in regards to their
cost expenses, specifically, in terms of how much time each plan consumes to finalise the data
retrieval process and generate the data desired by the computer user or researcher, claims Abbass
35
and Newton (2002). The selection criterion hinges upon a critical question: Which path plan will
take the least time to reach and deliver the user information? The optimisation course revolves
around a myriad of circumstances such as how a query is stated, the access methods, the
information layout, and the data set size (Oracle Help Center 2016). The access frameworks are
quite influential in this stage of optimisation as they are the ones which dictate whether the data
should be accessed by use of index scans or full table scans. Suppose Path A requires an index
scan that will take 2 minutes while Path B requires a full table scan that will take 2.5 minutes, in
estimation, Path A will be chosen.
In as much as the conventional optimiser in the CQELS and C-SPARQL strive to hatch
the most feasible execution plan, there are still gaps in this feature. Lots of processor time and
communication time as well as input/ output costs are still considerably high. This section
outlines the trends in query optimisation observed before and after the assimilation of heuristics,
thus approving the positive cost-saving impact achieved after its integration. When a query is
submitted to the database server, it undergoes a certain traverse within the DBMS modules; it
adheres to this sequence until the final results are generated (see Figure 2). These constituent
DBMS modules consist of a scanner, parser, query optimiser, code generator, and query
processor.As Abbass and Newton (2002) explains, the scanner scrutinises the inherent language
tokens, for example, the relation names and CQELS/ C-SPARQL keywords in the context of the
query statement. The parser then follows by certifying the query syntax, its validity, and if the
attribute names are semantically correct. After this, it transforms the query expression into an
internal representation that is machine-readable using a query tree or even sometimes a query
graph. The tree’s data structure is sketched by means of a calculus expression(Abbass and
Newton 2002). The query optimiser comes into play by reading the machine-readable instruction
36
and then forming a multitude of execution plan strategies. The optimizer finally chooses the most
amenable path by assessing all pertinent algebraic expressions relating to the input query,
favoring the cheapest and shortest one. The code generator then works to create a viable code
that requests the query processor to execute that plan projected by the optimizer (MacLennan and
Tang 2009, p.242).
Scanner
Parser
Optimizer
Code generator
Query processor
37
Figure 2: Query flow through a DBMS
As mentioned above, the query optimizer explores relevant algebraic expressions
contained within various algorithms generated by the DBMS in query searches. The traditional
algorithms have always zeroed in on exhaustively enumerating all alternatives available to
empower query searches. However, as explained by Abbass and Newton (2002), this exhaustive
technique is defective when it comes to solving for complex queries as the algorithms cannot
make it to enumerate all possible (millions of) options in a short, convenient timing. Rather, the
timing is quite long and tiring even for the user waiting for the results. This occurrence is evident
when an algorithm has to enumerate join orders for a query whose resulting data is contained in
50 tables. The process of enumerating all these 50 tables and joining the data items can take up
several minutes before results are delivered, thus failing in fastness and cost efficiency. To solve
this drawback, a heuristics solution has been implemented in both the CQELS and C-SPARQL
optimisation processes. This heuristics solution activates an algorithm that basically checks the
storage file in the DBMS to confirm if there is a ready-to-use query plan that matches the new
input query. If the ready-to-use query plan exists in the storage file, the algorithm uses this to
execute the new query expression, thus eradicating the need to create a new query plan. This
ultimately saves the processing time meant for developing the new query plan as well as the
input/ output costs (MacLennan and Tang 2009, p.42). Also, the communication time spanning
between the input of the query and output of the data results is also shortened. This improvement
in processor time/ cost and communication time continues to increase as time proceeds and even
as queries get more intricate.
3.2.2 Proposed heuristics approach
Figure 3: Binary tree
38
The heuristics solution proposed in this thesis advocates for a change in the sequence of
query execution from a normal binary tree to a magic tree that is stored in the given storage file.
The move to change the sequence of execution steps allows for the DBMS to save computational
costs and time as well (MacLennan and Tang 2009, p.221). In the absence of heuristics, the
query optimiser normally formulates a binary query tree (see Figure 3) which it uses to derive
numerous path plans before choosing the most optimal alternative. The formulation of the binary
tree calls for redundant operations such as the join, filter, and projection functionalities every
time a query search is initiated within the DBMS. This redundancy contributes majorly to the
compilation of operational expenses (join, filter, and projection), time involved in the
performance of these functionalities, as well as the processor and communication time. Frequent
join executions, particularly, make the RDF data volume being accessed extremely voluminous
and bulky, which in turn even strains the manipulation of data depositories more complicated.
However, the addition of heuristics ensures that these binary trees are replaced with a
much more efficient methodology, the magic tree. The magic tree differs from the conventional
Figure 4: Magic tree
39
binary tree by its innovation way of setting all the constituent variables (join, filter, and
projection) to only one wing of the tree (see Figure 4). Each of these distinctive variables is then
allocated a specific weight by the algorithm, after which the total weight is used to calculate the
cost of the variables in the tree. The criterion of assigning the individual weight is dependent on
the amount of time spent by each variable during the query processing, therefore, the
computational time correlates with the attached weights (MacLennan and Tang 2009, p.232). The
magic tree reorders marked variables such as the projection stem of the binary query tree and
eliminates the redundancy implemented in binary projection mechanisms. For example, the
applicable costs within the projection stem is x units. Therefore, if we administer a projection
fifteen times on a nested query, the aggregate cost will be 15 * x units, in the customary binary
tree. The proposed heuristics magic tree, however, shifts the projection facet to one state such
that if the projection operation is to be administered on the same nested query, the projection
administration would only need to be once, thus the total cost of processing would be x units
only. Figure 5 below depicts the algorithm proposed by the heuristics solution.
40
Table 1: Algorithm 1
Function: Compose a Magic Tree.
a) Query parsing.
b) Transformation of the query expression into a machine-readable statement.
c) Forming a query tree or graph, depending on the calculus expression used.
d) Selection entity shifts to the head nodule of the query tree.
e) Elimination of all candidate selection entities available.
f) Formation of all the dependent groupings. These are shifted to one wing of
the tree.
g) All the leaf nodules are relations; they are therefore halted once the process
reaches the leaf.
h) The query processor begins the search query course of action.
i) Once the query processor discovers the data target, it heads over to the
projection stem where all the other pertinent functionalities are conducted.
As MacLennan and Tang (2009, p.144) claim, heuristics has always been a viable
solution for modern computational problems, more so those that deal with voluminous data sets
such as telecommunication and industrial plants streaming data. The algorithms embedded in
heuristics functions help solve for entity optimisation and complex real-world issues as it
improves on time, costs, and space required in deciphering computational inquiries. In our case,
the effect of heuristics may not be felt or seen immediately, but after a while, the cost-saving
impacts will surely become visible. This is because of the working psychology assumed by
heuristics. As explained above, during the early implementation stages of the heuristics, the
entity operates by first monitoring how applications work. It performs meticulous appraisals and
41
evaluations of how program applications, in this case the query optimisation process, are run and
traces all these moves and formulas onto its memory. By this, it has created a virtual image of the
functioning of all the steps involved during a query search, from when the query is input to when
data results are displayed on the screen. The more advanced version of heuristics thoroughly
inspects then traces the guidelines put in the codes of programs prior to passing them over to the
computer’s processing unit for execution. This will help the heuristics engine to assess and learn
the behaviour and mannerisms of that program while it runs in a virtual setting.
As soon as its memory is packed with the application performance information, it starts
using this information to revamp activities and even cultivate better channels for enhanced task
execution. In the case of the RDF stream processing, a user can input the same query over and
over again over a given period of time, say for example, when retrieving information about a
certain tweet or when researching about the manufacture status of a phone from its manufacturer.
For every single time that a query search is initiated for such a research function, the parser must
form a query tree for each search before handing it over to the query optimiser and code
generator to formulate a code needed in the actual processing of the query statement. Building a
query tree for each and every query search of the same research question consumes an awful lot
of communication time and processing expenses as well, in the absence of a heuristics engine
(MacLennan and Tang 2009, p.39). This time, physical storage space, and processing costs is
what we all aim to eradicate in our RDF streaming processing. In a heuristics environment,
however, the redundant formations of the same query tree, their optimisations, and final query
processing, is noted in the heuristics’ memory. Hence, if the same research question is entered
yet again, the parser will just proceed to the heuristics’ memory and retrieve the query tree that
was noted before, instead of building a new one all over again. Therefore, the time that could
42
have otherwise been expended in the query tree formation has been saved and, in turn, also the
communication time has been minimised too. The query search proposed by this heuristic is as
shown in Figure 6.
Table 2: Algorithm 2
Function: The Projected Heuristics Query Search.
a) A query tree is crafted for each query expression that is submitted into the database
system.
b) Then, the heuristics function reads and stores this binary tree in a dedicated storage folder
for that particular query tree.
c) The storage folder is then assigned a unique company usage factor for easy identification
by the parser, such that the maximum quantity of storage folders generated equal the
company usage factor (c.u.f).
d) Following this, the heuristics devises a unique magic tree that shifts all the dependent
variables (join, select, and projection) in the binary tree to one side of the tree.
e) When a similar query is submitted by a user, the parser first confirms from the storage
folder if there is an equivalent query tree that can be utilized for that input inquiry.
f)If there is an equivalent stored tree, it will hence proceed to the precise branch node
required for processing the inquiry at hand, and perform all the relevant courses of action.
g) However, if there is no suchlike tree, it will consult the magic tree stored there, and if
successful, it will halt further searches and perform all the relevant courses of action
necessitated.
h) However, if all these searches fail such that there is no equivalent branch node even in the
magic tree, the parser will now resort to generate a new magic tree as depicted in the first
43
algorithm, thus increasing the storage folder counter.
Lastly, the database server will refresh the folder in the event that the counter is less than the
company usage factor. This is commendable because the number of folders should be equal
to the company usage factor (MacLennan and Tang 2009, p.19).
3.2.3 Results simulation
This section puts into actual practice, through simulation, this theoretical novel approach
of heuristics assimilation into an RDF stream processing engine to confirm if the prototype
makes good of its promise. The RDF engines tested herewith consist of the CQELS and C-
SPARQL languages. Simulation here refers to the manner in which the heuristics replication was
conducted over a specified period of time (6 months). A model of the heuristics query
optimization engine was replicated in a Java Runtime Environment (JRE) running on a computer
powered by the Windows Operating System. With the help of the JRE, we codified some core
Java codes, which were later, compiled and run in a Java eclipse environment to execute the
given RDF data streams. The codification was written in Java and employed the concept of class
handling. The data structuring integrated in the query tree went hand in hand with dynamic
memory appropriation that primarily used linked lists. The outcome of the analysis was as
expected; the integration of heuristics across the RSP engines board improved cost-saving by
shrinking the processor operational costs. A heuristics approach was implemented in the CQELS
and C – SPARQL query languages to form magic trees and also perform the selections earlier.
As MacLennan and Tang (2009, p.66) explains, the heuristics database engine is exploited in the
early performance of selections. This action considerably reduces the size and magnitude of the
RDF graph databases hence speeding up the query search process in overall. For example, if we
reflect on the following CQELSand C – SPARQL query expressions (see Figure 7), applying
44
heuristics is beneficial in terms of how it executes the selection entities very early in the process
hence minimizing the communication time.
Table 3: Query 1
The customary query processing of these CQELSand C – SPARQL query expressions
would have initiated the formation of a binary query tree as depicted in Figure 3. With heuristics,
however, the database engine will form a magic tree (see Figure 4) that will shift the selection
variable to one side of the tree. As MacLennan and Tang (2009, p.41) inform, yes, the initial
query processing stages of the heuristics approach will absorb come costs in constructing as well
as searching the magic tree. Nonetheless, these costs will be significantly lower as compared to
those expended in the formation and execution of the binary trees. The implementation of the
magic tree likewise reduces all other computational costs involved since also the frequency of
the selection variables also decrease. This cost-saving is evident in the comparison of the
estimated cost calculations of both methods: the binary tree and magic tree query processing. As
for the traditional binary tree, its aggregate running costs are 100 units while the incurred
expenses for the magic tree are 50 units only.Supposing a new query is input for the first time by
a user, the database server will incur seemingly high expenditures in both the formation of the
binary tree as well as the conversion of this binary tree into a magic tree. However, in the next
45
round, there will be no conversion costs as the magic tree will be readily available in the
heuristics’ storage folder. Additionally, the communication and processor will reduce in the same
degree as the conversion costs, as the parser will automatically reach for the magic tree branch
nodes. Figure 5 demonstrates the cost versus time chart comparing the conventional query
processing versus our projected heuristics-based CQELSand C – SPARQL query optimization
strategies.
Figure 5: Cost versus time graph
As it is shown on Figure 5, the preliminary costs are somewhat high, but as the heuristics
functionality continues to track, learn, and store the magic trees in its folders, the overall
computational expenditures decrease with time(Cheung et al. 2006, p. 43). To elucidate this
phenomenon, as a new query is fed into an RDF format database, all the constituent stages
conducted during a tree match search are carried out: parsing, query tree building, syntax
checking, attribute name confirmation,optimisation, and code generation. These activities
contribute to the evidently high cost expenditures as well as huge time consumptions
(MacLennan and Tang 2009, p.71). As time goes by, the heuristics entity monitors the query
search procedure, identifying the redundant parsing and optimisation sequences, and creating a
way out. It achieves a way out by tracing a particular binary tree in its storage folder and, from
46
this, derives a magic tree that is equivalent matches it. Therefore, in the subsequent standard
query searches, there will be no need to create yet another new binary tree for a similar inquiry
(MacLennan and Tang 2009, p.83). Instead, the magic tree will be retrieved from the storage file
for a duplicate tree matching, hence saving the computational conversion time and costs. The
heuristics application becomes even better with the execution of nested queries as the data results
are delivered much faster and more efficiently (see Figure 9). Further simulations of the
heuristics algorithm can also be enlisted in extending join properties such as the right and left
joins.
Figure 6: Performance versus complexity
3.2.4 The performance comparison graph between new improved model and the previous
version of CQELS and C-SPARQL
Most of the considered systems work in progress and are scientific prototypes.
Unsurprisingly, those are not able to support all the query patterns and features. The outputs of
the new improved model and the previous version of the CQELS and C – SPARQL are
significantly different because of their differences in implementation. These differences in
performance mainly result from the technical issues of intrinsic concerning the methods of
handling streaming data such as potential fluctuating execution environment and time
management.
47
Table 4: ThePerformance Comparison by Features
Special support for Input Extras
C – SPARQL TF RDF and RDF
streams
CQELS NEST, Vos RDF and RDF
streams
Disk spilling
Streaming SPARQL RDF streams
SPARQL stream NEST Relational stream Ontology-based mapping
EP – SPARQL EVENT, TF RDF and RDF
streams
Event operators
EVENT: Even pattern, VoS: Variables on stream, TF: Built in time function, NEST: Nested
patterns.
Table 5: Performance Comparison by the Mechanism of Execution
Re-execution Optimisation Architecture Scheduling
C – SPARQL Periodical Static and algebraic Black box Logic plan
CQELS Eager Adaptive & physical White box Adaptive
physical plans
Streaming SPARQL Periodical Static and algebraic White box Logic plans
SPARQL stream Periodical Externalised Black box External call
EP – SPARQL Eager Externalised Black box Logic program
48
Figure 7: Graphical performance comparison
As the graphs shows, the throughput of scalability and performance tests of C- SPARQL
are considerably lower than that of the CQELS and JTALIS. For this reason, it is clear that the
recurrent execution is likely to waste significant resources of computing. A sliding window
extracts the recurrences, and the outputs can be incrementally computed as a stream. Notably, the
outputs of JTALIS and CQELS are useful in answering the recurrent queries.
Query 1 involves counting the number of items over a tumbling window of one-second.
Of note, however, this query uses a physical time window. For statistical and significant robust
results, the computation is done as an average of twenty executions. The main reason for doing
this activity is because of the variable time of execution that also depends on the condition of the
machine.
49
Notice that CQELS performs better than JTALIS because it uses both the adaptive and
native approach. The JTALIS and C – SPARQL performance heavily depends on of some of the
underlying systems that include prolog engine and a relational stream processing engine
respectively. In similar fashion, CQELS is likely to benefit from a more sophisticated algorithm
that is optimised as compared to the current one. The only system that indexes and precomputes
the intermediate results over the static data from sub-queries is the CQELS. However, both the C
– SPARQL and the CQELS do not scale well at the time they increase the number of queries
such as sharing data windows and similar patterns. Additionally, they testify that neither of the
systems uses the techniques of multiple query optimisations to avoid redundant computations
among the queries that share computing memory and blocks. In this case, the optimisation only
occurs at statically and algebraic level since streaming both C – SPARQL and SPARL schedule
the execution at a logical level(MacLennan and Tang 2009, p.102). On the contrary, CQELS can
choose alternative plans of execution that get composed from the available operators’ physical
implementations. In effect, the optimiser adaptively optimises the execution at the physical level.
50
Both SPARQL stream and EP-SPARQL schedule the execution through a logic program or a
declarative query. In this case, they fully delegate the optimisation to other systems (Seshadri
and Leung1998). The technique used in improving the result involves the definition of mappings,
triple pattern, RDF triple, and other operations on mappings and the reuse of notations.
Under the Instantaneous RDF dataset and RDF stream, the temporal nature of data is
essential and requires capturing in the representation of data in the continuous processing of
dynamic data. This applies to both the sources of data because the collections in linked data
updates are also possible. It is an instantaneous RDF dataset. G (t + 1) = G(t) for all the values of
t ≥ 0 and G(t) = G for all t = N. Pattern matching is the main primitive operation on both the
instantaneous RDF dataset and RDF stream(MacLennan and Tang 2009, p.88). Notice that, triple
pattern of SPARQL semantics extends the pattern matching. As a consequence, the use of
notations of denotational semantics becomes helpful for the formal definition of query patterns
of the processing model. The denotations are the meaning functions of the semantics
compositions of abstract syntax. These compositions comprise a total of three operators namely
relational, pattern matching, and stream operators. Pattern matching operators extract triples
from a dataset or an RDF stream that are valid and match a given triple pattern at a certain time t
as shown below
Pattern matching operator’s abstract syntax
The meaning of triple matching pattern operator PG gets defined in the same way as SPARQL
on an RDF dataset at a given timestamp t as follows
51
Next is the definition of the window-based triple matching operator on an RDF stream.
The denotational semantics composability results in the definition of the abstract syntax
for the compound query pattern constructed from both the logical operators and matching
operators. Additionally, the definition of the aggregation operator comes before the definition of
the syntax and its semantics(MacLennan and Tang 2009, p.99). Notice that a uniform mapping
contains only the mappings that have similar domains. In this case, a consistent mapping gets
defined in an aggregate operator setΩ. The relational operators’ abstract syntax is therefore
defined recursively as shown below.
The mapping of the operators therefore becomes
Under the streaming operators abstract, the streaming operator becomes either an RDF
stream or relational stream from the above relational operators.
52
Next is the definition of the declarative query language CQELS-QL or CELS query
language for the execution framework of CQELS. Additionally, the SPARL grammar in the
notation of EBNF helps in the definition of the CQELS-QL. The first thing is the addition of the
query pattern for the representation of window operators on RDF stream.
53
Chapter 4: State of The Art in LSDP or the Linked Stream Data Processing
According to Gedik (2006), Linked Stream Data derived its usefulness in bridging the
gap between Linked Data and stream, and also in the facilitation of the integration of data among
Description Framework data streams enables the query processor to participate in treating the
RDF elements of stream nodes and also allows for both the access to get access to RDF streams
in the form of the materialized data (Abdulla and Matzke 2006, p.907; Buchanan and Shortliffe
1984, p.777; Cole and Conley 2009, p.809; Zhang and Kollios 2007, p.733). Notably, the whole
process makes it possible for the application of other SPARQL query patterns (Cheung et al.
2006, p.444). In short, this chapter explores both the techniques and concepts of processing
streams and the introduction of Linked Stream Data Processing engines (Calhoun and Riemer
2001, p.447). Additionally, the inclusion of the CQELS engine in the chapter helps in the
clarification of the contribution of this field.
4.1 Query Semantics and Data Models
This section mainly explores the possible ways of formalizing the data model for
Resource Description Framework datasets and the Resource Description Framework streams in a
continuous context (Cole and Conley 2009, p.931). Additionally, it touches on the continuous
query semantics.
4.2 Data Model
It is important to note that the modelling of Linked Stream Data occurs by extending the
meaning of both the RDF triples and RDF nodes (Cohen 1985, p.303). A stream of RDF refers to
a bag of different elements, while an RDF triple just denotes a temporal annotation such as time
interval or time stamp. A pair of time stamps includes an interval based label. In common cases,
54
natural numbers help in representing logical time (Eastwood 2008, p.278). Things such as ‘start'
and ‘ends' represent a pair of timestamps, and they are also useful in specifying the valid interval
in which the Resource Description Framework triple (Dean 2009, p.264). On the other hand, a
point based label refers to just a single natural number that represents the received or the
recorded point in time of the triple (Buchanan and Shortliffe 1984, p.708). One may see a point
based labels to be looking less efficient and redundant as compared to the interval based labels.
Further, point based labels are less expensive than the interval based labels because the former
gets considered to be an important and special case of the latter. For example, start = end.
According to research (e.g. Abbass and Newton 2002, p.946), streaming SPARQL find labels
useful in representing its EP-SPARQL and the items of the physical data stream in the
representation of the triple based events.
For the purposes of streaming data source, a point base labels out more practical results
because it allows for the instantaneous and unexpected generation of a triple. It is a good
example of the use of a tracking system to detect people at an office (Buchanan and Shortliffe
1984, p.707). Notably, this kind of activity results in the generation of a triple using a timestamp
at any time it receives any reading from a sensor. For further processing of the information, the
system must do further processing and buffer the reading in order to help in the generation of the
interval of the valid triple (Bolton 1996, p.407). Furthermore, the instantaneous point based
labels play a vital role for the applications that require the processing of the data immediately it
arrives in the system. Additionally, the concept of the Resource Description Framework must be
included in the model of data to enable the integration of stream data without stream data.
primarily, the Resource Description Framework dataset always get considered as a static data
source by the current state of the art. In light of the findings (e.g. by Abbass and Newton 2002,
55
p.944), it is important to note that the data stream applications can always run for any given
number of period that ranges from days to years. In addition, the changes in the Resource
Description Framework dataset during the lifetime of query must be reflected in the continuous
query's outputs.
4.3 Query Semantics
Semantics extend to explore things like approaches of the current state of the query
operators of SPARQL-like union, join, and filter. In practice, these operators output and consume
mappings (Abbass and Newton 2002, p.556). In addition, they also take part in introducing the
operators on the Resource Description Framework streams to the output mappings. Worth noting,
C-SPARQL defines its stream operator to access a Resource Description Framework stream that
is identified by its IRI (Cohen 1985, p.301). Additionally, the window operator gets defined to
help in accessing a Resource Description Framework stream based on certain windows.
Essentially, the window operator is useful in adopting the window operator on Resource
Description Framework streams in relation to the CQL (Cole and Conley 2009, p.954). It is also
important to note that the semantics of continuous query on Resource Description Framework
get defined as query operator composition. Practically, a query gets composed as an operator
graph in streaming both the C-SPARQL and SPARQL (Dean 2009, p.237). The SPARQL helps
to base the definition of the query graph on the query operator.
4.4 Query Languages
There is a need for the introduction of a query pattern for expressing the primitive
operators in order to fully define a declarative Linked Stream Data's query language (Abdulla
and Matzke 2006, p.956; Buchanan and Shortliffe 1984, p.561; Zhang and Kollios 2007, p.654).
In practice, this kind of data is window matching, triple matching, and sequential operators
56
(Eastwood 2008, p.509). In addition, the composition of these basic query patterns can later get
expressed by things such as OPT, AND, filter patterns of SPARQL, and UNION. Another
important thing to note is that, these patterns, corresponds to the operators in earlier definitions.
In support of the aggregation operators, several types of research (e.g. Abdulla and Matzke 2006,
p.966; Buchanan and Shortliffe 1984, p.906; Zhang and Kollios 2007, p.749), define their
semantics with the AGG query pattern. This kind of pattern is compatible with another type of
SPARQL patterns. The definition of the evaluation of query pattern AGG is [[P]]/ [[A]] = [[P
AGG A]], whereby A refers to the aggregate function consuming output of an SPARQL query
pattern P in returning the set of mappings. By letting, P, P1, and P2 to be the composite or basic
query patterns, then the declarative query gets composed recursively by the use of this kind of
rules:
[[P1]]/ [[P2]] = [[P1 UNION P2]],
[[P1]]/ [[P2]] = [[P1 AND P2]],
[[P1]]/ [[P2]] = [[P1 AND P2]],
[[P]]/ [[A]] = [[P AGG A]],
andfµ 2 [[P]] jµ = [[P FILTER R]].
In practice, these type of patterns helps to extend the grammar of SPARQL for the continuous
queries.
It is important to note that the use of C-SPARQL is helpful for extending the SPARQL by
ion Framework stream output is the triple patterns of this kind of CONSTRUCT. In essence, the
grammars that are helpful in streaming both the C – SPARQL and SPARQL are the same. In
practice, the use of databases is always manifold (Jeuring 2012, p.417). In fact, they give a
provision for a means of retrieving either parts of the records or the entire records and in the
57
performance of the different kind of calculations before displaying the outcomes (Abdulla and
Matzke 2006, p.504; Buchanan and Shortliffe 1984, p.703; Cole and Conley 2009, p.968; Zhang
and Kollios 2007, p.974). Practically, the query language is the interface that specifies such kind
of manipulations (Lucas 2010, p.608). On the other hand, the early query languages were
initially very complex making the interaction with electronic databases to get done by the
individuals with some special knowledge (MacLennan and Tang 2009, p.673). Ordinarily, the
more user-friendly interfaces are the modern ones, in addition, they also allow for the casual
users to access the information of the database.
A good example of the main types of this kind of query modes is the fill in the blank, the
menu, and the structured query (Gedik 2006, p.422). Most importantly, the menu needs an
individual to choose from various alternatives that get displayed on a monitor that are
particularly suitable for novices (Maringer 2005, p.342). On the other hand, the technique of the
fill in the blank refers to one that allows the user get a promotion to enter the key words such as
the statements (Moustakas 1990, p.623). Worth noting, the approach of the structured query is
very effective with the databases that are relational. In simple terms, it has a powerful syntax that
is formal and, in practice, a programming language. Additionally, it can accommodate logical
operators (Mueller 2009, p.506). Furthermore, the Structured Query Language or the SQL has
some various forms during the implementation of this kind of approach. Some of the various
forms include: selecting [[field Fa, Fb, Fc..., Fn]], on the other hand, where [[Fa Field = abc]]
and [[field Fb = def]], and from [[ database Da, Db, Dc… Dn]]. Several studies (e.g. Abdulla and
Matzke 2006, p.678; Buchanan and Shortliffe 1984, p.985; Zhang and Kollios 2007, p.992),
shows that it is important to note that the structured query language is supporting the searching
of the database and also other activities by the use of various commands such as ‘sum’, ‘print’,
58
‘find’, ‘delete’ and so on (Nirmal 1990, p.496). Ordinarily, the natural language looks like the
sentence structure of a Structured Query Language except that the syntax of the SQL instead uses
the statement of Structured Query Language. Additionally, it is also possible to show a
representation of the queries in the form of tables.
The technique is known as the QBE (or the query by example) helps in the displaying of
an empty form. According to Mcllroy (1998), this kind of process continues to occur expecting
the searcher to enter the appropriate specification of the search into the appropriate columns.
This kind of SQL type of query then gets constructed by the program from the table as it does the
execution (Zhang and Kollios 2007, p.997). In practice, the natural language shows the most
flexible query language (Abdulla and Matzke 2006, p.911; Buchanan and Shortliffe 1984, p.703;
Zhang and Kollios 2007, p.707). Most importantly, some commercial database management
software allows the use of sentences of the natural language in the form of constraints to search
the databases (Schreiber 1977, p.781). In essence, these kinds of programs recognize the
synonyms and the action words of syntax after its parse (Abdulla and Matzke 2006, p.1002;
Buchanan and Shortliffe 1984, p.734; Zhang and Kollios 2007, p. 836). In addition, the programs
records identify the names of the files, perform, and field the required logical operations
(Seshadri and Leung 1998, p.699). Furthermore, there has been some development in the natural
language queries in the spoken voice due to the acceptance of such experimental systems (Sims
and Yocom 2008, p.1003). However, the ability to employ the unrestricted natural language in
query unstructured information that further needs advances in the understanding of the machine
of natural language (Wei 2011, p.354). This kind of activity mainly presents in the representation
of the programmatic and semantic context of ideas.
59
Chapter 5: The Optimization Solutions for the CQELS
In essence, this kind of execution framework helps in supporting adaptive and native
query execution over RDF datasets and RDF streams (Bolton 1996, p.404). Worth noting, the
framework’s white box architecture accepts both the RDF datasets and RDF streams as inputs
and also returns the outputs as either the relational streams or the RDF streams in the result
format of SPARQL (Abdulla and Matzke 2006, p.702; Buchanan and Shortliffe 1984, p.497). In
practice, it is possible to feed the output RDF streams into any CQELS engine (Wei 2011,
p.4078). On the other hand, the relational streams can be useful to other relational stream
processing systems (Cheung et al. 2006, p.497). Notably, the working processing involves the
following: the pushing of the stream data to the input manager and using the encoder for
encoding it into the normalised input stream representation (Cole and Conley 2009, p.1007).
Practically, the dynamic executor is able to consume this kind of encoder. Another important
aspect to note is that the decoder has to decode the outputs of the dynamic executor by streaming
it to the receiver (Abdulla and Matzke 2006 p.749). Mostly, the decoder and the encoder share a
dictionary for the decoding and the encoding operations. Additionally, the dynamic executor
accesses the static RDF datasets via the cache fetcher. Furthermore, the SPARQL endpoints can
be useful in hosting the decoder and encoder in either the remote RDF stores or the local RDF
stores (Cole and Conley 2009, p.1011). On the other hand, the cache fetcher plays a vital role in
retrieving the crucial data then encodes the same data for the cache manager by the use of the
encoder (Wei 2011, p.507). Worth noting, the normalized representation is helpful in representing
the encoded data of the intermediate results for sharing the same dictionary with input stream.
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis
My thesis

More Related Content

What's hot

European Pharmaceutical Contractor: SAS and R Team in Clinical Research
European Pharmaceutical Contractor: SAS and R Team in Clinical ResearchEuropean Pharmaceutical Contractor: SAS and R Team in Clinical Research
European Pharmaceutical Contractor: SAS and R Team in Clinical Research
KCR
 
Workshop on Real-time & Stream Analytics IEEE BigData 2016
Workshop on Real-time & Stream Analytics IEEE BigData 2016Workshop on Real-time & Stream Analytics IEEE BigData 2016
Workshop on Real-time & Stream Analytics IEEE BigData 2016
Sabri Skhiri
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
Ganesan Narayanasamy
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
Gianvito Siciliano
 
Presentation1
Presentation1Presentation1
Presentation1
Borreke
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
IJAAS Team
 
Data performance characterization of frequent pattern mining algorithms
Data performance characterization of frequent pattern mining algorithmsData performance characterization of frequent pattern mining algorithms
Data performance characterization of frequent pattern mining algorithms
IJDKP
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
Journal For Research
 
A Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning ArchitectureA Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning Architecture
Flurry, Inc.
 
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
dbpublications
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
Flurry, Inc.
 
p1365-fernandes
p1365-fernandesp1365-fernandes
p1365-fernandes
Ricardo Fernandes
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
Geoffrey Fox
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
tema_solution
 
Using R for Cyber Security Part 1
Using R for Cyber Security Part 1Using R for Cyber Security Part 1
Using R for Cyber Security Part 1
Ajay Ohri
 
C1803041317
C1803041317C1803041317
C1803041317
IOSR Journals
 
Comparison with Traditional databases
Comparison with Traditional databasesComparison with Traditional databases
Comparison with Traditional databases
GowriLatha1
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 

What's hot (18)

European Pharmaceutical Contractor: SAS and R Team in Clinical Research
European Pharmaceutical Contractor: SAS and R Team in Clinical ResearchEuropean Pharmaceutical Contractor: SAS and R Team in Clinical Research
European Pharmaceutical Contractor: SAS and R Team in Clinical Research
 
Workshop on Real-time & Stream Analytics IEEE BigData 2016
Workshop on Real-time & Stream Analytics IEEE BigData 2016Workshop on Real-time & Stream Analytics IEEE BigData 2016
Workshop on Real-time & Stream Analytics IEEE BigData 2016
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 
MAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine LearningMAD skills for analysis and big data Machine Learning
MAD skills for analysis and big data Machine Learning
 
Presentation1
Presentation1Presentation1
Presentation1
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
 
Data performance characterization of frequent pattern mining algorithms
Data performance characterization of frequent pattern mining algorithmsData performance characterization of frequent pattern mining algorithms
Data performance characterization of frequent pattern mining algorithms
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
 
A Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning ArchitectureA Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning Architecture
 
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
 
p1365-fernandes
p1365-fernandesp1365-fernandes
p1365-fernandes
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Mca & diplamo java titles
Mca & diplamo java titlesMca & diplamo java titles
Mca & diplamo java titles
 
Using R for Cyber Security Part 1
Using R for Cyber Security Part 1Using R for Cyber Security Part 1
Using R for Cyber Security Part 1
 
C1803041317
C1803041317C1803041317
C1803041317
 
Comparison with Traditional databases
Comparison with Traditional databasesComparison with Traditional databases
Comparison with Traditional databases
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 

Similar to My thesis

Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
IRJET Journal
 
Performance Analysis and Parallelization of CosineSimilarity of Documents
Performance Analysis and Parallelization of CosineSimilarity of DocumentsPerformance Analysis and Parallelization of CosineSimilarity of Documents
Performance Analysis and Parallelization of CosineSimilarity of Documents
IRJET Journal
 
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET Journal
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET Journal
 
Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streams
Alejandro Llaves
 
Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streams
Alejandro Llaves
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET Journal
 
Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend Analysis
IRJET Journal
 
Consistent join queries in cloud data stores
Consistent join queries in cloud data storesConsistent join queries in cloud data stores
Consistent join queries in cloud data stores
João Gabriel Lima
 
IJET-V3I2P2
IJET-V3I2P2IJET-V3I2P2
LoadAwareDistributor: An Algorithmic Approach for Cloud Resource Allocation
LoadAwareDistributor: An Algorithmic Approach for Cloud Resource AllocationLoadAwareDistributor: An Algorithmic Approach for Cloud Resource Allocation
LoadAwareDistributor: An Algorithmic Approach for Cloud Resource Allocation
IRJET Journal
 
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
IRJET Journal
 
PNNL April 2011 ogce
PNNL April 2011 ogcePNNL April 2011 ogce
PNNL April 2011 ogce
marpierc
 
THE DEVELOPMENT AND STUDY OF THE METHODS AND ALGORITHMS FOR THE CLASSIFICATIO...
THE DEVELOPMENT AND STUDY OF THE METHODS AND ALGORITHMS FOR THE CLASSIFICATIO...THE DEVELOPMENT AND STUDY OF THE METHODS AND ALGORITHMS FOR THE CLASSIFICATIO...
THE DEVELOPMENT AND STUDY OF THE METHODS AND ALGORITHMS FOR THE CLASSIFICATIO...
IJCNCJournal
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web Databases
IJMER
 
M phil-computer-science-data-mining-projects
M phil-computer-science-data-mining-projectsM phil-computer-science-data-mining-projects
M phil-computer-science-data-mining-projects
Vijay Karan
 
M.Phil Computer Science Data Mining Projects
M.Phil Computer Science Data Mining ProjectsM.Phil Computer Science Data Mining Projects
M.Phil Computer Science Data Mining Projects
Vijay Karan
 
M.E Computer Science Data Mining Projects
M.E Computer Science Data Mining ProjectsM.E Computer Science Data Mining Projects
M.E Computer Science Data Mining Projects
Vijay Karan
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
IRJET Journal
 
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET Journal
 

Similar to My thesis (20)

Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
 
Performance Analysis and Parallelization of CosineSimilarity of Documents
Performance Analysis and Parallelization of CosineSimilarity of DocumentsPerformance Analysis and Parallelization of CosineSimilarity of Documents
Performance Analysis and Parallelization of CosineSimilarity of Documents
 
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET- A Workflow Management System for Scalable Data Mining on Clouds
IRJET- A Workflow Management System for Scalable Data Mining on Clouds
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
 
Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streams
 
Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streams
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
 
Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend Analysis
 
Consistent join queries in cloud data stores
Consistent join queries in cloud data storesConsistent join queries in cloud data stores
Consistent join queries in cloud data stores
 
IJET-V3I2P2
IJET-V3I2P2IJET-V3I2P2
IJET-V3I2P2
 
LoadAwareDistributor: An Algorithmic Approach for Cloud Resource Allocation
LoadAwareDistributor: An Algorithmic Approach for Cloud Resource AllocationLoadAwareDistributor: An Algorithmic Approach for Cloud Resource Allocation
LoadAwareDistributor: An Algorithmic Approach for Cloud Resource Allocation
 
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...
 
PNNL April 2011 ogce
PNNL April 2011 ogcePNNL April 2011 ogce
PNNL April 2011 ogce
 
THE DEVELOPMENT AND STUDY OF THE METHODS AND ALGORITHMS FOR THE CLASSIFICATIO...
THE DEVELOPMENT AND STUDY OF THE METHODS AND ALGORITHMS FOR THE CLASSIFICATIO...THE DEVELOPMENT AND STUDY OF THE METHODS AND ALGORITHMS FOR THE CLASSIFICATIO...
THE DEVELOPMENT AND STUDY OF THE METHODS AND ALGORITHMS FOR THE CLASSIFICATIO...
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web Databases
 
M phil-computer-science-data-mining-projects
M phil-computer-science-data-mining-projectsM phil-computer-science-data-mining-projects
M phil-computer-science-data-mining-projects
 
M.Phil Computer Science Data Mining Projects
M.Phil Computer Science Data Mining ProjectsM.Phil Computer Science Data Mining Projects
M.Phil Computer Science Data Mining Projects
 
M.E Computer Science Data Mining Projects
M.E Computer Science Data Mining ProjectsM.E Computer Science Data Mining Projects
M.E Computer Science Data Mining Projects
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
 

More from William Aruga

169228 169228 executive-summary-1-
169228 169228 executive-summary-1-169228 169228 executive-summary-1-
169228 169228 executive-summary-1-
William Aruga
 
Nursing interview
Nursing interviewNursing interview
Nursing interview
William Aruga
 
Health and safety at work
Health and safety at workHealth and safety at work
Health and safety at work
William Aruga
 
Heuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) enginesHeuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) engines
William Aruga
 
136477 136477-circular-club
136477 136477-circular-club136477 136477-circular-club
136477 136477-circular-club
William Aruga
 
Understanding marketing-events
Understanding marketing-eventsUnderstanding marketing-events
Understanding marketing-events
William Aruga
 
92068 92068-92068-92068-92068-3-revi-fm-mike
92068 92068-92068-92068-92068-3-revi-fm-mike92068 92068-92068-92068-92068-3-revi-fm-mike
92068 92068-92068-92068-92068-3-revi-fm-mike
William Aruga
 
120968
120968120968
Topic: Gay in Afganistan ayayalalrt;jbgvc knwzcx ,nkncbwcajks,cnbm,
 Topic: Gay in Afganistan ayayalalrt;jbgvc knwzcx ,nkncbwcajks,cnbm, Topic: Gay in Afganistan ayayalalrt;jbgvc knwzcx ,nkncbwcajks,cnbm,
Topic: Gay in Afganistan ayayalalrt;jbgvc knwzcx ,nkncbwcajks,cnbm,
William Aruga
 
Heuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) enginesHeuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) engines
William Aruga
 
Fraud examination bre x minerals case study
Fraud examination  bre x minerals case studyFraud examination  bre x minerals case study
Fraud examination bre x minerals case study
William Aruga
 
The impact of digital platform on the sharing economy
The impact of digital platform on the sharing economyThe impact of digital platform on the sharing economy
The impact of digital platform on the sharing economy
William Aruga
 
Takotsubo cardiomyopathy potential differential diagnosis in acute coronary s...
Takotsubo cardiomyopathy potential differential diagnosis in acute coronary s...Takotsubo cardiomyopathy potential differential diagnosis in acute coronary s...
Takotsubo cardiomyopathy potential differential diagnosis in acute coronary s...
William Aruga
 
Evaluation of doctoral study foundation of study
Evaluation of doctoral study foundation of studyEvaluation of doctoral study foundation of study
Evaluation of doctoral study foundation of study
William Aruga
 
Glaases lamination
Glaases laminationGlaases lamination
Glaases lamination
William Aruga
 
Bnb
BnbBnb
Originality report
Originality reportOriginality report
Originality report
William Aruga
 
Mark spenser
Mark spenserMark spenser
Mark spenser
William Aruga
 
Museum project
Museum projectMuseum project
Museum project
William Aruga
 
Childhod obesity
Childhod obesityChildhod obesity
Childhod obesity
William Aruga
 

More from William Aruga (20)

169228 169228 executive-summary-1-
169228 169228 executive-summary-1-169228 169228 executive-summary-1-
169228 169228 executive-summary-1-
 
Nursing interview
Nursing interviewNursing interview
Nursing interview
 
Health and safety at work
Health and safety at workHealth and safety at work
Health and safety at work
 
Heuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) enginesHeuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) engines
 
136477 136477-circular-club
136477 136477-circular-club136477 136477-circular-club
136477 136477-circular-club
 
Understanding marketing-events
Understanding marketing-eventsUnderstanding marketing-events
Understanding marketing-events
 
92068 92068-92068-92068-92068-3-revi-fm-mike
92068 92068-92068-92068-92068-3-revi-fm-mike92068 92068-92068-92068-92068-3-revi-fm-mike
92068 92068-92068-92068-92068-3-revi-fm-mike
 
120968
120968120968
120968
 
Topic: Gay in Afganistan ayayalalrt;jbgvc knwzcx ,nkncbwcajks,cnbm,
 Topic: Gay in Afganistan ayayalalrt;jbgvc knwzcx ,nkncbwcajks,cnbm, Topic: Gay in Afganistan ayayalalrt;jbgvc knwzcx ,nkncbwcajks,cnbm,
Topic: Gay in Afganistan ayayalalrt;jbgvc knwzcx ,nkncbwcajks,cnbm,
 
Heuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) enginesHeuristic based query optimisation for rsp(rdf stream processing) engines
Heuristic based query optimisation for rsp(rdf stream processing) engines
 
Fraud examination bre x minerals case study
Fraud examination  bre x minerals case studyFraud examination  bre x minerals case study
Fraud examination bre x minerals case study
 
The impact of digital platform on the sharing economy
The impact of digital platform on the sharing economyThe impact of digital platform on the sharing economy
The impact of digital platform on the sharing economy
 
Takotsubo cardiomyopathy potential differential diagnosis in acute coronary s...
Takotsubo cardiomyopathy potential differential diagnosis in acute coronary s...Takotsubo cardiomyopathy potential differential diagnosis in acute coronary s...
Takotsubo cardiomyopathy potential differential diagnosis in acute coronary s...
 
Evaluation of doctoral study foundation of study
Evaluation of doctoral study foundation of studyEvaluation of doctoral study foundation of study
Evaluation of doctoral study foundation of study
 
Glaases lamination
Glaases laminationGlaases lamination
Glaases lamination
 
Bnb
BnbBnb
Bnb
 
Originality report
Originality reportOriginality report
Originality report
 
Mark spenser
Mark spenserMark spenser
Mark spenser
 
Museum project
Museum projectMuseum project
Museum project
 
Childhod obesity
Childhod obesityChildhod obesity
Childhod obesity
 

Recently uploaded

IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
Madan Karki
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 

Recently uploaded (20)

IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 

My thesis

  • 1. 1 HEURISTICS-BASED QUERY OPTIMISATION SOLUTION IMPLEMENTATION IN RSP ENGINES: THE CQELS AND C-SPARQL Submitted in fulfilment of the requirements for the degree of Masters of Science Supervisor: Co-supervisor The Insight Centre for Data Analytics, National University of Ireland, Galway September, 2016
  • 2. 2 Abstract This thesis addresses the gravity of basing and constructing the query optimisation process executed in RDF stream processing engines around an efficient heuristics engine. The Resource Description Framework (RDF has taken the world over by storm and by its golden standard of data stream processing and communication of real-time data items collected from medical institutions, industrial plants, financial entities, and telecommunication service providers. For instance, DBPedia and Yago help reinforce structural querying in Wikipedia searches by retrieving metadata and encoding them in an RDF format. Also, biological information such as experiments and their distinctive results are stored as RDF data compilations to enable sufficient communication between chemists and biological specialists. The data streaming framework has been highlighted by the invention of the Semantic Web by Tim-Berners Lee that works to stream linked data from sourced documents and applications, thus serving users with precise web pages. However, the query optimisation performed in both of these query languages is still somewhat deficient in regards to the time expended before the results of the search are delivered. The execution of flawed queries is also another worrying factor in the query optimisation function of the RSP engines. All these elements: lengthy run-time, extravagant computational costs such as join operations, and the implementation of inaccurate queries contribute to the downgrade of RDF stream processing. Heuristics will help identify early error signs in the user queries and solve them by use of its inbuilt configurations and algorithms. The novel heuristics optimisation model can be used as a benchmark in querying of the Semantic Web metadata in departments such as in military logistics, data warehousing, engineering analysis, and health care. Some of the main
  • 3. 3 contributions of this research work include: (i) Deploy an implementation of reference on existing CQELS and C-SPARQL execution framework; (ii) Extend the two RSP engines (CQELS and C-SPARQL). This new engine helps in allowing the processing and resource space sharing among multiple concurrent queries; (iii)Evaluate the performance of the extended RSP engines and compare them with the first released CQELS and C-SPARQL engines. The results of the evaluation show a remarkable improvement in the performance in addition to the demonstration of the practicality of the approach used.
  • 4. 4 Table of Contents Table of Contents........................................................................................................................................4 Chapter 1: Introduction...............................................................................................................................9 1.1 Motivation.........................................................................................................................................9 1.2 Problem Statement and Hypotheses...............................................................................................10 1.3 The Outcome of the Thesis..............................................................................................................14 1.3.1 Adaptive execution framework.................................................................................................14 1.3.2The linked data stream adaptive processing model...................................................................14 1.3.3 Algorithms and data structures for triple-based windowing operator incremental evaluation15 1.3.4 The techniques for optimization for multiway joins.................................................................16 1.4The Outline of This Thesis.................................................................................................................16 Chapter 2: The General Background..........................................................................................................17 2.1Introduction......................................................................................................................................17 2.2 Comparative and Survey Evaluations...............................................................................................24 2.3Query Optimization..........................................................................................................................27 2.4RDF Stream Processing and Semantic Web......................................................................................29 Chapter 3: Background to RSP Engines......................................................................................................32 3.1 C-SPARQL.........................................................................................................................................32 3.2CQELS................................................................................................................................................32 3.2.1 Introduction..............................................................................................................................34 3.2.2 Proposed heuristics approach...................................................................................................37 3.2.3 Results simulation.....................................................................................................................43 3.2.4 The performance comparison graph between new improved model and the previous version of CQELS and C-SPARQL.....................................................................................................................46 Chapter 4: State of The Art in LSDP or the Linked Stream Data Processing...............................................53 4.1 Query Semantics and Data Models..................................................................................................53 4.2 Data Model......................................................................................................................................53 4.3 Query Semantics..............................................................................................................................55 4.4 Query Languages.............................................................................................................................55
  • 5. 5 Chapter 5: The Optimization Solutions for the CQELS...............................................................................59 5.1 The Adaptive Optimizer...................................................................................................................65 5.2 The Dynamic Executor.....................................................................................................................67 Chapter 6:Exploration of the RDF Engine – Continuous C-SPARQL............................................................69 Chapter 7: Adaptive Query Optimiser in RDF Engines...............................................................................74 7.1 Adaptive Query Optimiser...............................................................................................................74 7.2 MultiwayJoins Adaptive Cost-based Optimisation...........................................................................74 7.3 Shared Window Joins Optimisation.................................................................................................76 7.4 Multiple Join Operator.....................................................................................................................76 7.5 Features of Adaptive Query Optimization.......................................................................................78 7.6 Adaptive Plans Concepts..................................................................................................................79 Chapter 8: Conclusion and Future Work....................................................................................................81 8.1 Conclusion.......................................................................................................................................81 8.2 Future Work.....................................................................................................................................84 References.................................................................................................................................................87
  • 6. 6 List of Figures Figure 1: Semantic Web processing...........................................................................................................29 Figure 2: Query flow through a DBMS.......................................................................................................37 Figure 3: Binary tree..................................................................................................................................38 Figure 4: Magic tree...................................................................................................................................39 Figure 5: Cost versus time graph...............................................................................................................45 Figure 6: Performance versus complexity..................................................................................................46 Figure 7: Graphical performance comparison...........................................................................................48 Figure 8: An architecture of the C-SPARQL engine....................................................................................72
  • 7. 7 List of Tables Table 1: Algorithm 1..................................................................................................................................40 Table 2: Algorithm 2..................................................................................................................................42 Table 3: Query 1........................................................................................................................................44 Table 4: ThePerformance Comparison by Features...................................................................................47 Table 5: Performance Comparison by the Mechanism of Execution.........................................................47
  • 8. 8 Summary This work aims at exploring query optimisation solution implementation in RSP engines namely; the CQELS and C – SPARQL. The framework presents one of the continuous query languages which are compatible with SPARQL. This structure is introduced over both linked data and linked stream data. In practice, the framework is very flexible, hence enabling performance gains of various magnitude orders over other related systems. An efficient hybrid physical data organisation that uses a novel data structure in supporting algorithms helps to deal with high update throughput RDF streams and large RDF datasets. Additionally, this framework also gives provision for various adaptive optimisation algorithms. This thesis also provides extensive experimental evaluations for the demonstration of the advantages of the CQELS and C – SPARQL processing engines and framework regarding performance. Furthermore, these assessments aim at covering a comprehensive set of parameters that plays a significant role in dictating the performance of the continuous queries over both the linked data and the linked stream data.
  • 9. 9 Chapter 1: Introduction As the primary purpose of this research study is the exploration of the gravity of basing and constructing the query optimisation process executed in RDF stream processing engines around an efficient heuristics engine, the introduction starts with the motivation. Afterward, it discusses the problem statement and hypotheses. Next, this chapter touches on the thesis outcome, and lastly the thesis outline. 1.1 Motivation It is crucial to note that the world is currently witnessing a scenario of paradigm shift (Abdulla and Matzke 2006, p.29). In essence, the real time and data that depends on such time continuous to become ubiquitous (MacLennan and Tang 2009, p.61). In the last few years, little was known about things such as the sensor devices (Mueller 2009). For example, compasses, cameras, mobile phones, GPS, accelerometer and so on. Additionally, stations for observing the weather such as humidity, temperature and so forth are on the continuous rise in producing a large quantity of information in the form of data streams (Cheung, Hong, and Fong 2006, p.55). Furthermore, things like the systems that monitor the patients such as the blood pressures, heart rate and so on, and also the systems that track the locations such as the RFID, GPS, etc., play a vital role in this process. Moreover, the building management systems that includes the conditions of the environment, the consumption of energy and so on, the cars that include the both the driver and engine (Abdulla and Matzke 2006) monitoring also records an equal tremendous increase in the production of such quality information (Cole and Conley 2009, p.53). In addition, the web equally has several services that include the use of Facebook, Twitter, and
  • 10. 10 some blogs that helps in the delivery of streams of real-time data that is typically unstructured on various topics. 1.2 Problem Statement and Hypotheses In practice, the motivation of this kind of thesis results in larger problems of research that always arise at the time of building an efficient Linkage Stream Data query processing engine. One of the major problems is how to design a new declarative query language. According to research (e.g. Abdulla and Matzke 2006, 145; Buchanan and Shortliffe 1984, p.99), this kind of a problem mostly arises due to neither SPARQL nor the state of the art continuous query languages could assist in querying Linkage Stream Data. In practice, a query language requires a sound semantics and a formal data model of continuous query operators (MacLennan and Tang 2009, p.187). In essence, the data model must have the ability to represent both the Linked Data and the Linked Stream Data in a unified view. In this case, the new data model must be an extension of the Resource Description Framework model to allow for a transparent integration of conventional Resource Description Framework databases (Zhang and Kollios 2007, p.85). In light of the continuous query processing, there is a required property of a temporal aspect of the data that has not been earlier covered by any of the Resource Description Framework (MacLennan and Tang 2009, p.22). Alongside the data model, there must also be a definition of a graph base query operators that have continuous semantics in the specification of the meaning of the declarative query patterns (Buchanan and Shortliffe 1984, p.163). Worth noting, for the primary purposes of reducing the efforts of learning, it is important to have a query pattern that resembles an SPARQL. Furthermore, the kind of activity requires some alignment of the query operators with the semantics of SPARQL. Additionally, this kind of alignment must be
  • 11. 11 compatible with the operations of the window according to its definition in the traditional continuous queries, for example, the CQL. It is important to note that when given the disadvantages of using unmodified triple storages and the Data stream management systems for the Linked Stream Data, Resource Description Framework based stream data displays new issues for the physical organization for both the Linked Data and Linked Stream Data (MacLennan and Tang 2009, p.149). Most importantly, a triple table storing identifiers that represent literals and URIs are the standard models of storing bags (Abdulla and Matzke 2006, p.109). In the process, such activity combines with mapping tables in the form of dictionaries in translating that identifies back into the form of lexical (Cole and Conley 2009, p.203). The Linked Stream Data necessitates a high writing throughout, on the other hand, this kind of data has the design for heavy read intensive context (Zhang and Kollios 2007, p.148). Another important thing to note is the remedy of the Data stream management systems for the Linked Stream that can write intensive requirement by the use of storage of in-memory. However, this kind of data entails a Linked Data that can sometimes be not possible in hosting in the main memory (Cole and Conley 2009, p.209). Furthermore, Resource Description Framework based data elements such as temporal RDF triples and RDF triples are very small. In effect, they display an enormous individual data points in comparison to the quantity of the encoded information. In practice, the efficiency of the raw-based data structure used in relational Data stream management systems is very is not sufficient since it needs sizes of tuple header that can dominate the total size of the storage (Cole and Conley 2009, p.211). It is important to note that the raw based data structure designed for shorter and wider tables can sometimes rises significantly the ways for processing stream. In effect, there is a need for a new physical
  • 12. 12 approach of organisation for processing both Linked Data and Linked Stream Data (Buchanan and Shortliffe 1984, p.92). Resource Description Framework based operators of continuous query typically operate on a few or one very large tables (MacLennan and Tang 2009). Therefore, it plays a vital role in having indexes for the random data items’ access. It is also important to note that most of the modern Resource Description Framework stores give provision for a massive strategy of indexing in overcoming their large handicap (Cole and Conley 2009, p.173). In essence, it is always possible in bypassing such tables since the indexes cover all the accessing patterns. Notably, a comprehensive indexing scheme has a very high maintenance cost hence making it impractical for stream processing. In addition, some of the stream data indexing solutions might appear helpful but their designs only make them applicable for relational streams (Abdulla and Matzke 2006). In effect, an investment of a hybrid solutions that can be applicable for strategies of indexing of both stream data processing and triple storages forms part of an interesting problem (Cole and Conley 2009, p.239). Additionally, another issue that associates with the physical representation of Resource Description Framework based stream data is the way of efficient evaluation of the unbound nature of streams versus the window operators. It is both worthy and in order to note that there are several attempts in Data stream management systems to support the queries of the sliding windows (Cole and Conley 2009, p.243). Most importantly, one of such efforts is the independent re-evaluation over each of the windows from all other windows. In practice, this kind of process is referred to as the re- evaluation computation (Abdulla and Matzke 2006, p.199). Worth noting, this approach is useful in both the Borealis and Aurora. Additionally, there is also another method known as the incremental evaluation computation that only plays a significant role in processing changes that
  • 13. 13 get expired and inserted tuples in the windows in the pipeline of the query (MacLennan and Tang 2009, p.272). In essence, this kind of approach is useful in Nile and Stream. In contexts of these activities, there exist some shortcoming to employ incremental methods of evaluation (Cole and Conley 2009, p.287). Practically, these methods include both the negative tuples and the direct time stamps. Notably, the method of the negative tuple doubles the tuple number through the pipeline of the query. On the other hand, the direct method of the timestamps requires some extra timestamps. In practice, with the introduction of the new data structures in this thesis, the associated effective algorithms to compute operators of windowing must always address the unusual characteristics of data. A Resource Description Framework triple storage has an exceptionally thin and long table that are not standard optimization (Cole and Conley 2009, p.368). In this case, it is always quite challenging for the traditional Data stream management systems to give statistics that are relevant for query optimizer. In addition, this kind of challenge is still applicable to the processing Linkage Data and Linked Stream Data. It becomes even more challenging to maintain high dynamic datasets of statistics in the setting of stream processing (Cole and Conley 2009, p.394). Most importantly, such type continuous query processing’s adaptivity query optimiser becomes harder to achieve due to the unpredictivity of Resource Description Framework data and the dynamic nature of the stream data distributions (MacLennan and Tang 2009, p.400). Moreover, the SPARQL just like queries always consist of share query patterns posting the requirements of optimization of the multi-query (Cole and Conley 2009, p.386). Some of the proposed approaches for relational streams might sometimes fail to work on the Resource Description Framework based on the stream, although there exist several efforts in the multi- query optimisation (Abdulla and Matzke 2006, p.397). In light of these approaches, such failure
  • 14. 14 to work mostly results from its various natures in comparison to the relational one (Zhang and Kollios 2007, p.391). In effect, it becomes very challenging in enabling multi-query optimisation for Linked Data Streams. 1.3 The Outcome of the Thesis In light of the issues stated above, the outcome this thesis would include: 1.3.1 Adaptive execution framework This kind of framework will enable adaptivity in RSP engines: the CQELS and C – SPARQL (Abdulla and Matzke 2006, p.402). Additionally, the framework can allow full control of the process of execution with the flexibility of adding new algorithms and new data structure to the query engine component (MacLennan and Tang 2009, p.433). Essentially, the framework uses the encoding mechanisms in enabling the implementation of a small footprint and less workload of the operators by performing only on fixed, small size integers (Buchanan and Shortliffe 1984, p.266). It is important to note that the Linked Data parts catching solution for subqueries helps in improving the performance and scalability of the query processing on the collection of Linked Data (Zhang and Kollios 2007). In practice, the framework can address the problem of scalability to integrate large static datasets with the proposed caching mechanism. 1.3.2The linked data stream adaptive processing model This thesis recommends an adaptive processing model such as the formal definition of query semantics, the data model, and the model of execution (Cole and Conley 2009, p.437). It is important to note that the data model covers both the temporal aspect of Linked Data sets and the Linked Stream Data that are yet to be addressed (Zhang and Kollios 2007, 434). On the other hand, the query semantics get formalized by the use of both the operational and mathematical meanings. In the first place, the precise meaning is helpful in showing the way of mapping a
  • 15. 15 declarative query fragment in response to the mathematical expressions (Cole and Conley 2009, p.441). Additionally, the abstract syntaxes play a significant role in accompanying all the query fragments to define a declarative query language with an extension from SPARQL (Buchanan and Shortliffe 1984, p.280; Zhang and Kollios 2007, p.404). On the other hand, the operational meanings help in the definition of the way of executing the operators in the expressions in the physical execution plans (MacLennan and Tang 2009, p.432). In this case, the operational semantics plays a significant role in showing the performance model for the constant execution of the equivalent execution plans for a query expressed in both CQELS and C – SPARQL languages (Cole and Conley 2009, p.470). In effect, this kind of operational feature helps in the facilitation of the adaptivity of the execution engines based on the processing models (Zhang and Kollios 2007, 355). This kind of scenario occurs due to the its ability to execute engine to dynamically change to another equivalent execution plan from the current one for adapting to the variations in the run-time (MacLennan and Tang 2009, p.446). In short, the CQELS language is both the only language that get accompanied with the sound operational and mathematical semantics and also one of the first query language for Linked Stream Data. 1.3.3 Algorithms and data structures for triple-based windowing operator incremental evaluation In this case, the introduction of the novel operator-aware data structures in association with efficient additional evaluation algorithms in dealing with both the unusual properties of query patterns and the RDF stream is helpful (Cole and Conley 2009, p.422). Most importantly, the design of these data structures allows for the handling of intermediate mappings and small data items contained in the processing state. Worth noting, these kind of data structures consists of different cost indexes that have low maintenance in supporting high throughput in the
  • 16. 16 operations of probing that are useful in various implementations of operators (Abdulla and Matzke 2006). In context of this kind of data there was a need for proposing various algorithms in enabling incremental evaluation of some of the basic operators that include the elimination of the duplicates, join, and aggregation (MacLennan and Tang 2009, p.453). In short, these kind of algorithms aims at overcoming typical issues involved in incremental evaluation of the windowing operators. 1.3.4 The techniques for optimization for multiway joins In essence, this thesis explores the use of techniques of adaptive optimization to improve the performance of the multiway joins (Abdulla and Matzke 2006, p.456). It is important to note that this is one of the most expensive operators of query in the pipeline of query (Cole and Conley 2009, p.472). Practically, this kind of adaptive cost model is useful in designing two adaptive algorithms for the dynamic optimization of a query of a two-multiway join. 1.4The Outline of This Thesis The organisation of the remaining part of this thesis is as follows: Chapter 2 explores the general background on Linked Data processing and stream processing. Chapter 3 presents the background to RSP engines (the CQELS and the C-SPARQL). Chapter 4 touches on the State of the Art in LSDP or the Linked Stream Data Processing. Chapter 5: explores the optimisation solutions for the CQELS. Chapter 6 mainly explores the RDF engine – continuous C – SPARQL. Chapter 7 evaluates the RSP engines framework, and finally, Chapter 8 points to the future work after the conclusion of this thesis in the same chapter.
  • 17. 17 Chapter 2: The General Background This chapter explores the background techniques and concepts for Linked Data processing and stream processing. Additionally, this background information also gives provision for the fundamentals of stream processing that is applicable to the Linked Stream Data. In short, the chapter discusses the representation of continuous semantics, basic techniques and models, and the operators and the methods of optimisation, and the way of handling issues such as memory overflow and time management (MacLennan and Tang 2009). In addition, the chapter presents the definition of the semantics of SPARQL and Resource Description Framework data model queries and the relevant notations. This general background also gives an overview of the way of storing Resource Description Framework and its query by the use of SPARQL. 2.1Introduction The term ‘heuristic’ is Greek for ‘discover’ or ‘find’ (Calhoun and Riemer 2001). Heuristics is a common practice applied in multiple industry fields for the benefits of observing, learning, and spotting malware errors and other problems by use of experience. For example,a well-modelled heuristics technique is enlisted in antimalware programs to learn and spot computer threats such as Trojan horses, viruses, and worms. The learning and observation aspect of the heuristics framework operates by scanning computer documents capturing the signatures they are differentiated with (Chen 2009). After reading the unique signatures of the computer files such as tiny macros, find commands, or even subroutines, the heuristics uses its memory and experience to identify the already read threats. According toCIKM 2006 Workshops (2006), heuristics entail a suite of rules geared towards enhancing the probability of identifying and ironing out problems in a given structure.
  • 18. 18 When applied in the computer science field, heuristics is considered as an algorithm engineered to present viable solutions to any arising glitches in a given scenario. The heuristics discipline generally examines how information is studied, captured, and discovered.When engaged in artificial intelligence, computer science, or mathematical optimisation, heuristics engineswork to decipher problems in a fast and efficient way when the conventional methods are acting up, are not fast enough, or fail to calculate accurate solutions(Cheung et al. 2006, p. 49). If the heuristics path is chosen in the failure of conventional methods, it is seen as a shortcut as it speeds up the process. As Cohen (1985), says, heuristics can either work in isolation generating solutions by themselves or in combination with optimisation algorithms all geared towards increasing the RSP’s effectiveness (Gedik 2006). The more advanced version of heuristics thoroughly inspects then traces the guidelines put in the codes of programs prior to passing them over to the computer’s processing unit for execution. This will help the heuristics engine to assess and learn the behaviour and mannerisms of that program while it runs in a virtual setting. The current querying strategies enlisted in CQELS and C-SPAQRL waste a lot of valuable time while performing incorrect and inept queries that may be keyed in by an end user who is not quite familiar with the intricate querying descriptions, say Gore (1964). In as much as the database servers within the CQELS and C-SPARQL systems may recognize these inefficient queries, the end computer users and internet browsers are not aware of these incorrectly stated queries, and, hence, may continue ringing them. As this happens, the entire performance and speed of the language engines is incrementally impaired thus having less and less of total number of data retrieval executed per unit time. In a bid to look for solution of the system downgrade, the users opt to refer the issue to the DBA to help them code the efficient queries. Similarly, this DBA consultation also results in time wastage as well. This is where the incorporation of a
  • 19. 19 heuristics engine comes into play. By assimilating a heuristics function in the querying of the CQELS and C-SPARQL languages, a substantial amount of time and querying effort will be saved(Cheung et al. 2006, p. 57). The heuristics function will serve as a query optimiser that will skim through the input user query, inspecting it thoroughly to highlight and remove any detected errors. According to Mcllroy (1998), unlike the DBA that recognizes the lapses in the queries yet does nothing about them, the heuristics will work to automatically muster and reproduce a correspondent but highly optimized query. By spotting and rectifying the inaccuracies inherent in the queries input by the end users, the heuristics function will be discarding the time-consuming processes of inaccurate query execution as well as the time expended to consult the DBA for viable solutions. In this way, the system productivity and throughput will always be on an upward curve. The frequency of accesses will be marginalized as the heuristics will lessen and fully eradicate the number of tuples and columns browsed hence the data streaming processing and querying accuracy will be on a winning streak. An effort to integrate this kind of sources of information would enable a broad range of application of near real time in the areas of green information technology, smart cities, e-health and so on (Cole and Conley 2009, p.19). However, harvesting of such kind of data remains a labour-intensive and a difficult task due to the heterogeneous nature of the vast streams. In essence, such a process needs a lot of hand-crating methods. Worth noting, the remedy of this kind of scenario involves the application of Resource Description Framework data or the RDF data model (Schreiber 1977, p.38). In practice, this type of data model helps one to express knowledge in a generic way. It is also necessary to note that it does not require any adherence to a particular schema (MacLennan and Tang 2009, p.67). Efforts are underway to help in lifting stream data to a level of semantic by semantic stream/ sensor and by the group of W3C semantic
  • 20. 20 network incubator (Maringer 2005). Essentially, the primary goal of the process is to make the availability of stream data to the principles of Linked Data. Notably, this kind of concept is referred to as the Linked Stream Data (Schreiber 1977, 103). Ordinarily, the Linked Data helps in the facilitation of the process of data integration among the heterogeneous collections (Buchanan and Shortliffe 1984). Another important thing to note is that the data streams has similar goals concerning the Linked Stream Data (Schreiber 1977, p.89). Furthermore, it assists in bridging the gap between more sources of static data and streams. Besides a unified model of data representation, there is also a requirement of a processing engine that can help in supporting a continuous query on both the Linked Data and Linked Stream Data (Cole and Conley 2009, p.107). Moreover, there is always an assumption that data get stored in a centralized repository and also changes infrequently before additional processing (MacLennan and Tang 2009, 102). Ordinarily, this kind of scenario happens in a classical Linkage Data processing. According to research (e.g. Zhang and Kollios 2007, p.51), it is evident that there is always a limitation of an update on the dataset to just a small fraction of the same dataset. Additionally, it is worth noting that this process only happens in a less frequent way, and in some cases, the database gets replaced by a new version. Both ‘one-time’ and ‘pull’ forms the traditional relational databases (Schreiber 1977, p.139). In essence, there is an execution of the query after reading the data from the disk. Most importantly, the output gives out a set of results for the same point in time (Cole and Conley 2009, p.137). On the other hand, Linked Stream Data produce new items continuously. In fact, the data only becomes valid at the time of window. Additionally, it consistently gets pushed to the processing query (Buchanan and Shortliffe 1984, p.99). In practice, the registration of queries only happens once then a continuous evaluation over a given time against the dataset that
  • 21. 21 changes, in short, queries are continuous (MacLennan and Tang 2009, p.139). In effect, the appearance of the new data results in the updates of the continuous query (Abdulla and Matzke 2006, p.97). It is important to note that such continuity of continuous queries and the temporal aspect of the Linked Stream Data do not get considered in the processing engines of the Linked Data query at the same moment (Cole and Conley 2009, p.148). Worth noting, better candidates for processing continuous queries seem to be DSMSs or the Data stream management systems (Zhang and Kollios 2007, 167). Ordinarily, a Data stream management system can be useful in making a sub-component that deals with the steam data. in practice, the only problem is that no any traditional Data stream management systems support the Resource Description Framework, this makes it vital for the use of a data transformation step (Schreiber 1977, p.108). However, in most cases, the use of such overhead of data transformation can sometimes be very costly in the low-latency processing context of stream data (Sims and Yocom 2008, p.109). Furthermore, losing full control over the execution of query means delegation of processing to a sub-system such as the data stream management system (Cole and Conley 2009, p.145). Moreover, the optimisation only can always get done locally in each of the subsystems (Schreiber 1977, p.143). In this case, the subsystem is only optimized for the query patterns, a model of data, and also the distribution of data since it gets used as a black box. According to research (e.g. Buchanan and Shortliffe 1984, p.152), the difficulty of predicting the structure of graphs of Resource Description Framework proves some challenges for the traditional Data stream management systems. Moreover, they cannot effectively scale to large quantities of the same Resource Description Framework data (Schreiber 1977, p.154). Worth noting, this kind of difficulty in making predictions is also applicable to the Resource Description Framework based data streams (Sims and Yocom 2008, p.151). In effect, it makes it
  • 22. 22 tough for the optimizers of Data stream management systems to handle. It is also necessary to note that these optimisation problems of Data stream management systems were solved in some ad-hoc and restricted scenarios (Cole and Conley 2009, p.162). Furthermore, some open problems and challenges still present a good number of areas (MacLennan and Tang 2009, p.173). In addition, a heuristic is the most of the optimisation algorithms, and they also prove to work for certain kinds of data and queries. In essence, this kind of facts played a significant role in motivating me to develop a heuristic-based optimisation solution implementation for two RPS engines (C-SPARQL and CQELS) by the use of a Java code for optimization with the naïve idea (Sims and Yocom 2008, p.182). In practice, my approach aims to build engines with high processing performance for the Linked Stream Data by a combination of algorithms, structures of re-engineering efficient data, and techniques from both traditional Data stream management systems and Linked Data processing. According to several research (such as Abbass and Newton 2002, p.135; Sims and Yocom 2008, p.127), it is not a good practice to store Resource Description Framework data elements by rotational tables. On the other hand, careful design of indexing schema and physical storage plays a vital role for the performance of the triple storages (Schreiber 1977, p.94). It is now important to note that this approach aims to design a native data structure that treats both the Resource Description Framework and Resource Description Framework stream data elements as citizens of the first class (Cole and Conley 2009, p.142). Most importantly, the continuous changing of the data during the lifetime of query requires adaptive in its processing. In essence, such action requires the introduction of adaptive execution framework known as Continuous Query Evaluation for Linked Stream or the CQELS (Cole and Conley 2009, p.177). It is important to note that this kind of framework gets designed to apply adaptive
  • 23. 23 techniques of processing in meeting the performance requirements of stream processing (Buchanan and Shortliffe 1984, p.103; Zhang and Kollios 2007, p.171). Moreover, this kind of framework allows for the full control of the execution process that is continuous where both the optimization and scheduling can take place during the runtime (Schreiber 1977, p.67). In the process, I had to create a new continuous query language as one of the first works in the processing of the Linked Stream Data (Cole and Conley 2009, p.191). Worth noting, the evaluation of the Linked Stream Data processing engines and conducting of the first survey developed during the time of this thesis helps in providing an insight into the way of building an efficient Linked Stream Data. In this paper, we advance the integration of a heuristics engine in query optimisation of the CQELS as well as C-SPAQRL to augment the query execution and data streaming processes. The query optimisation operations will be done using a Java code that will serve as the optimizer in both RSP engines. The implementation of a Java code as the optimiser in the RSP engines will speed up operations and the query optimisation function in general. As MacLennan and Tang (2009) claim, the code will allow the end users to unambiguously express their queries within the code and reduce imprecise queries input. This will also help cut the incremental costs during the computation process in regards to the projection, selection, and join functionalities as well as other cost factors such as processor and communication time. As the data and ontology constituents of the Web 3.0 have stabilised through the assimilation of golden standards such as the OWL and RDF, the optimisation and solution implementation of heuristics-based querying is next in the to-do-list. The assimilation and solution implementation of the heuristics utility is outlined in this thesis in this format: Section 1 discusses how heuristics can be employed in the query
  • 24. 24 optimisation to minimize the pertinent costs. In the projected heuristic algorithm, a query is scanned and executed by use of the magic trees in the storage files which in turn demonstrates a significant progress over the previous optimization approaches. The cost-based algorithm proves that the system’s enhancement continues to improve as the query becomes even more interlaced and dense as the user performs more intricate searches. Section 2 discusses how heuristics can be enlisted in the Java code to significantly reduce the erroneous query executions by instinctively recognizing and amending inefficiencies in the CQELS and C-SPARQL queries. The detection and rectification of flaws within the queries will consequently save the huge amounts of time and effort expended by the RSP engines in retrieving information, thus enhancing the overall throughput and productivity of the engines. Section 3 demonstrates the impact of heuristics in its competency to execute queries without involving join operations. The exclusion of join operations in query optimisation will help to shrink operational costs in addition to making the RDF data volume less bulky. The empirical results confirm that the projected heuristics model overshadows the conventional querying techniques, for example the Jena, by 79%, regarding the reduction of pointless intermediate results and a faster query processing time. 2.2 Comparative and Survey Evaluations Essentially, the first experiments and survey is helpful in giving comparisons and insight of the techniques of the data stream processing and also the Linked Stream Data processing engines (Abdulla and Matzke 2006, p.487; Zhang and Kollios 2007, p.378). Additionally, the first evaluation of the cross-system is to present the Linked Stream Data processing engines. A scenario that integrates human-centric streaming data from the digital and physical world similar to the live social semantics are just a total inspiration (MacLennan and Tang 2009, p.474). Worth noting, the kinds of data from the physical world get captured and streamed
  • 25. 25 through tracking systems and sensors such as the wireless triangulation, RFID, and GPS, and the integration can get done with the use of virtual streams such as the city traffic data, Twitters, and the airport information in the delivery of up to date views or location based services of any particular situation (Cole and Conley 2009, p.479). Furthermore, the conference scenario mainly focuses on the problem of data integration between the streams of data from a static data set and a tracking system (Abdulla and Matzke 2006). Worth noting, the system of tracking, similar to the various real deployment in Live Social Semantics is useful to gather the relationship between physical spaces and the real-world identifiers of the attendees of the conference. Moreover, the non-stream datasets are used in correcting the tracking data (Cole and Conley 2009, p.482). For example, the information that is online concerning the attendees such as the online profiles, social network, records of publication and so on. In essence, there exist several benefits of correlating the two sources of information (MacLennan and Tang 2009, p.453). Most important, based on the number of the people and the topic of the talk, conference rooms could be automatically assigned to the talks based on the total number of people that might show some interest in attending it, based on their level of the profile (Cole and Conley 2009, p.491). In addition, the attendees of the conference could also get notified about the co-authors found within the location (Abdulla and Matzke 2006, p.423; Buchanan and Shortliffe 1984, p.403; Zhang and Kollios 2007, p.348). It is also important to note that it can be very easy to assign a service that the type of talk to attend based on the records of citation, profile, and the distance found between the locations of the talk. In practice, the spread of a social stream data of interest for a user occurs among various platforms of social application such as the Twitter, Facebook, Foursquare and son on (MacLennan and Tang 2009, p.496). Additionally, the social network analysis and aggression
  • 26. 26 platforms such as the Bottlenose require an integration of heterogeneous streams from various feeds and social networks (Abdulla and Matzke 2006, p.437; Buchanan and Shortliffe 1984, p.428). Most importantly, these kinds of platforms can easily use Linkage Stream Data processing engines to deal with the issues of data integration (Cole and Conley 2009, p.504). In the same context, this kind of scenario continues to focus on the different social stream aggregation sources that the social network users create (MacLennan and Tang 2009, p.511). Another important thing to note is that the social networks give provisions for rich resources of the interesting stream data that includes the uploading of the photo and the sequence of social discussions (Cole and Conley 2009, p.521). Additionally, the social networks get considered as the best area of test for Resource Description Framework engines. Furthermore, the Resource Description Framework can also exhibit its merits on how to represent graph data (MacLennan and Tang 2009, p.527). Ordinarily, skewed distribution of data correlates with the data in real life which mostly occurs in the social network data. Moreover, there is recognition of the efficient handling of correlations as a very difficult problem by the engines of the database (Abdulla and Matzke 2006, p.484; Buchanan and Shortliffe 1984, 503; Zhang and Kollios 2007, p.509). On the other hand, it also plays a significant role in opening up many opportunities for the query optimisation (MacLennan and Tang 2009, p.539). In the context of the scenario, it becomes possible to build the simulator of data in exploiting different distributions of the skewed data and the available correlations in a social network (Abdulla and Matzke 2006, p.437). As a consequence, the data simulator is useful in generating some of the realistic test cases to evaluate the Linked Stream Data processing engines. It is important to note that various parts of this thesis have earlier been published as workshop, conference and journal articles (MacLennan and Tang 2009, p.544). Furthermore,
  • 27. 27 there was an introduction of the first attempt of building a heuristic based query optimisation solution implementation of RSQ engines in several research (such as Abbass and Newton 2002, cessing engines and the Data stream management systems (MacLennan and Tang 2009, p.587). On the other hand, the RSP engines such as the CQELS and the C-SPARQL are readily available in studies (such as Abdulla and Matzke 2006, p.463; Buchanan and Shortliffe 1984, p.401; Zhang and Kollios 2007, p.409). 2.3Query Optimization Maringer (2005) describes query optimisation as an interspersed querying function in a multitude of information systems and database frameworks. All query languages, be they structured (SQL) or unstructured (C-SPARQL and CQELS), enlist query optimisation functionalities to establish the most shrewd and adept channel of executing a query that has been keyed in by a user. Such functionalities encompass query optimisers such as the PostgreSQLor the Java code (Java Runtime Environment) that analyse and carefully assess the SQL, C- SPARQL, or CQELS queries tackling the most effectual mechanism for query execution. The querying of database systems happens almost every other minute of the day and, thus, query optimisation is just as frequent(Cheung et al. 2006, p. 64). Anyone browsing the internet doing either simple or complex kind of researches engages query optimisation in the Database Management Systems (DBMS), when requesting for piece of information from the respective databases. For example, if you are searching for a Social Security No., financial statements of a company, a country’s demographics, of even trying to compute the average pay of all the civil workers in the Department of Agriculture in your regional state, you are querying the distinctive databases.
  • 28. 28 If, for instance, you are interested in investing in Ernst and Young LLP (a multinational audit firm), you will obviously want to find out how it is performing in the market and its overall productivity compared against other industry benchmarks. To locate such information, you will log in to the company’s database system and request for its financial statements, ratios, and key market/ performance indicators. A query soliciting for the financial ratios of Ernst and Young LLP will look like this: “find the consolidated balance sheet of Ernst and Young.” Before the balance sheet is availed onto your computer screen, there are a number of procedures that occur, featuring a query plan. After submitting this query, the parser within the database will parse it, and then hand it over to the query optimiser, which will then hatch several query plans in accordance with the resource costs (Moustakas 1990). The most efficient means, in terms of costs and time consumption, is chosen, after which the database server will access the pertinent database data and whip up the desired results. The prime focus of the query optimisation function of databases is centred on expeditious and prompt query execution so as to deliver the desired results in the flash of a minute (Mueller 2009, p. 34). Time consumption is top of the list in the determination of the best query plan to solve a given query. Any marginal time variance in alternative query plans will prompt the query optimizer to select the option that is fastest and consumes the least amount of time. However, the optimisation function is still lacking in regards to time efficiency and conservation as most querying processes involve redundant executions of intermediate results within the join operations. These join operation, together with other accompanying costs such as projection and selection functionalities as well as the processor time, downscale the communication time of the data results in addition to increasing the computational costs. As the selected query plan works, it makes use of various algorithms with which it collaborates to manipulate and combine tables of
  • 29. Figure 1: Semantic Web processing 29 data from the database structure so as to produce the requested knowledge material (Nirmal 1990, p.388). These manipulations and combining of data tables are called join operations and, in the retrieval of real-time steaming data such as financial statistics, slow down the data streaming process. Additionally, the processing of the intermediate results needed in the join operations contributes in making the RDF data volume bulky, thus, impeding operations and the engine’s speed in overall(Cheung et al. 2006, p.69). All these issues call for programmers to construct the query optimisation function of the RSP engines around a heuristics solution and implement this solution to improve on RDF stream processing. 2.4RDF Stream Processing and Semantic Web The recent deployment of the semantic web in divergent industry sectors such as in logistic planning in military fields, engineering analysis, health care, and life sciences has proved its worth in data search automation and information technology upscale. According to Zhang and Kollios 2007), the semantic web contributes to an instinctual and spontaneous web application that browses the precise information from linked data sources. The application works by collecting, filtering, and sampling of data items captured from differential sensor plants and stored as ontologies in RDF formats (see Figure 1).
  • 30. 30 Invented by Lee Feigenbaum in 2001, the Semantic Web (Web 3.0) has,so far, showcased some data processing differences between its database management versus other relational databases such as the World Wide Web (Web 1.0). While the Web 1.0 operates by dislodging the physical storage and networking layers, the Web 3.0 upgrades this tedious and seemingly slower process by dismissing the document and application layers. In as much as the search engines on the World Wide Web index a majority of the content stored on the Web, they still lack in the instinctive capacity of selecting the articles and web pages that an end user really desires. Rather than connecting documents and data structure like the Web 1.0, Web 3.0 capitalizes on its metadata base and ever evolving compilation of knowledge to connect facts and meaning. This algorithm is what enables the Semantic Web to build on intuitiveness and self- description that help the context-understanding programs to find the exact pages that a user is looking for. As Sims and Yocom (2008, p.411) convey, the Web 3.0 has gained its technological leverage over the Web 1.0 by its cutting-edge means of data storage, querying, and information display. The data storage means incorporated in this new technique involvesmatching data sources to ontologies that are stored in a structured form in Resource Description Frameworks (RDF). Unlike the natural text formats that Web 1.0 utilizes in data storage and retrieval, the Semantic Web models the data items sourced from diverse sensor plants into a comprehensive descriptive language to make the query processes and information display easy enough and friendly for all Internet browsers. As Abbass and Newton (2002) illustrate in their journal article, the RDF comprises of a descriptive structuring of data used for information exchange on the net. As the semantic metadata reads information from sensor plants, it filters and stores this information into a format that is easily readable by both the machine and the computer user. Engineered by the World
  • 31. 31 Wide Web Consortium (W3C), the RDF integrates the use of query languages and descriptive statements and conjunctions (e.g. has, is) to provide relevant information about web resources that a user may search for. For example, if you want to find out about the U.S. current president (web resource), you will type in “The U.S. has a current president in office.” As seen from this statement, there is an entity-relationship data model that is in the form of a subject-predicate- object expression. This model is the strategy made use of by the RDF when searching for information. Thus, the RDF refers to that language that exhibits web data by use of marginally constraining, meaningful, and constructive expressions. To incrementally expand RDF’s efficiency, we have to further advance the aspect of heuristics in the querying of RDF data stream processing engines.
  • 32. 32 Chapter 3: Background to RSP Engines 3.1 C-SPARQL Barbieri et al. (20109, p. 20) define C-SPARQL as an advanced language – a stretch of the SPARQL query language that observes windows and recent triples of RDF data streams while simultaneously allowing the streams to flow. The continuous streaming of queries by the Continuous SPARQL (C-SPARQL) facilitates the interoperability of RDF formats and implements crucial applications that allow researchers to access the ever-evolving information of web resources. Wei (2011, p. 101) refers to C-SPARQL as an orthogonal extension of the conventional SPARQL grammar, making the SPARQL a congruent component of the C- SPARQL. The C-SPARQL builds on SPARQL by its capability of combining static RDF together with real-time streaming data for purposes of stream reasoning. In as much as SPARQL has cemented its viability in querying RDF repositories, Barbieri et al. analyze that it is still lacking in producing continuous, flowing data streams (Abbass and Newton 2002, p. 21). Stream-based data emitters encompassing stock quotations, click streams, feeds, and feeds emit real-time continuous information. However, the SPARQL is still limited in its efficiency of storing entire streams; therefore, the Data Stream Management Systems (DSMS) registers consecutive queries in static forms. The invention of the C-SPARQL is thus based on its capacity to merge the static data with the streaming data – a procedure that mobilizes logical reasoning in actual time for those large and noisy data streams. 3.2CQELS According to Abbass and Newton (2002), the Continuous Query Evaluation over Linked Streams (CQELS) constitutes an adaptive and instinctive schema for supporting Linked Stream
  • 33. 33 Data, whose grammar is derived from the SPARQL 1.1, thus making them compatible. The congruence of the two query languages (CQELS and SPARQL 1.1) capacitates the performance level of the CQELS over other continuous query languages. The CQELS has been engineered with the sole objective of enlisting the white box approach that functions by utilising the prerequisite query operators in a native way to obviate all overhead costs plus any other restrictions of closed system regimes (Schreiber 1977). The CQELS offers flexibility and updatability in their execution structures as the inherent query processors continuously readjust to the changes in the incoming data. Examples of such continuous queries are contained in papers such as CF02, HFAE03, CDTW00, and ABB+02. These queries, however, are quite simple and only applicable in general-purpose event processing. This paper proposes the assimilation of heuristics in the query execution of CQELS to enable the continuous reordering of its operators, thus, improve query applicability in complex situations, not just general-purpose. The interspersion of the heuristics engine in the querying of RDF data streams is, hence, very crucial and fundamental in the upscale of RDF stream processing as it greatly minimises the lengthiness of the join operations. Besides lessening the inherent time consumption, the heuristics will additionally help spot and rectify any flaws that occur in the queries that users may input while searching for useful information from given databases. In general, the heuristics functionalitywill have a double role in the query optimisation of RDF stream processing. One, to shrink the duration of intermediate results processing for join operations, and, two, to discard the errors contained in queries, hence, curbing flawed query execution, in turn, escalating time saving during query optimisation.
  • 34. 34 Section 1: Cost-Based Heuristics Optimisation Approach 3.2.1 Introduction The move to consolidate heuristics into the query optimisation aspect of RSP engines is ingenious and groundbreaking, to say the least. The implementations of heuristics are geared towards cutting computational costs during the query optimisation and join operations executed within the C-SPARQL and CQELS languages. This section outlines in depth how enlisting the heuristics function helps minimise the costs estimated in terms of the overall time spent by the optimiser to select the most effectual query plan/ tree that will execute a given query in the least time possible thus lessening the CPU and input/ output costs. The CQELS and C-SPARQL DBMS optimisers endeavor to boil down to a single, most feasible query plan for the given query statements. In the query optimisation world, pinning down a suitable plan is contingent upon which mechanism has the least time duration as well as the most minimal costs involved in terms of query execution factors like communication, the processor, and the Input/ Output Expenses. These costs are a very critical factor and get utmost consideration during the selection of the most ideal query plan tree (Abbass and Newton 2002). When a query is input into an RDF database, the Database Management System (DBMS) initiates a process a selection course geared towards determining the most potent path to follow and give results in the shortest route possible. This course entails the optimiser devising several paths plans from which it chooses the most ideal one to utilize. All these hatched path plans, when followed, output equivalent data or information. However, they differ in regards to their cost expenses, specifically, in terms of how much time each plan consumes to finalise the data retrieval process and generate the data desired by the computer user or researcher, claims Abbass
  • 35. 35 and Newton (2002). The selection criterion hinges upon a critical question: Which path plan will take the least time to reach and deliver the user information? The optimisation course revolves around a myriad of circumstances such as how a query is stated, the access methods, the information layout, and the data set size (Oracle Help Center 2016). The access frameworks are quite influential in this stage of optimisation as they are the ones which dictate whether the data should be accessed by use of index scans or full table scans. Suppose Path A requires an index scan that will take 2 minutes while Path B requires a full table scan that will take 2.5 minutes, in estimation, Path A will be chosen. In as much as the conventional optimiser in the CQELS and C-SPARQL strive to hatch the most feasible execution plan, there are still gaps in this feature. Lots of processor time and communication time as well as input/ output costs are still considerably high. This section outlines the trends in query optimisation observed before and after the assimilation of heuristics, thus approving the positive cost-saving impact achieved after its integration. When a query is submitted to the database server, it undergoes a certain traverse within the DBMS modules; it adheres to this sequence until the final results are generated (see Figure 2). These constituent DBMS modules consist of a scanner, parser, query optimiser, code generator, and query processor.As Abbass and Newton (2002) explains, the scanner scrutinises the inherent language tokens, for example, the relation names and CQELS/ C-SPARQL keywords in the context of the query statement. The parser then follows by certifying the query syntax, its validity, and if the attribute names are semantically correct. After this, it transforms the query expression into an internal representation that is machine-readable using a query tree or even sometimes a query graph. The tree’s data structure is sketched by means of a calculus expression(Abbass and Newton 2002). The query optimiser comes into play by reading the machine-readable instruction
  • 36. 36 and then forming a multitude of execution plan strategies. The optimizer finally chooses the most amenable path by assessing all pertinent algebraic expressions relating to the input query, favoring the cheapest and shortest one. The code generator then works to create a viable code that requests the query processor to execute that plan projected by the optimizer (MacLennan and Tang 2009, p.242). Scanner Parser Optimizer Code generator Query processor
  • 37. 37 Figure 2: Query flow through a DBMS As mentioned above, the query optimizer explores relevant algebraic expressions contained within various algorithms generated by the DBMS in query searches. The traditional algorithms have always zeroed in on exhaustively enumerating all alternatives available to empower query searches. However, as explained by Abbass and Newton (2002), this exhaustive technique is defective when it comes to solving for complex queries as the algorithms cannot make it to enumerate all possible (millions of) options in a short, convenient timing. Rather, the timing is quite long and tiring even for the user waiting for the results. This occurrence is evident when an algorithm has to enumerate join orders for a query whose resulting data is contained in 50 tables. The process of enumerating all these 50 tables and joining the data items can take up several minutes before results are delivered, thus failing in fastness and cost efficiency. To solve this drawback, a heuristics solution has been implemented in both the CQELS and C-SPARQL optimisation processes. This heuristics solution activates an algorithm that basically checks the storage file in the DBMS to confirm if there is a ready-to-use query plan that matches the new input query. If the ready-to-use query plan exists in the storage file, the algorithm uses this to execute the new query expression, thus eradicating the need to create a new query plan. This ultimately saves the processing time meant for developing the new query plan as well as the input/ output costs (MacLennan and Tang 2009, p.42). Also, the communication time spanning between the input of the query and output of the data results is also shortened. This improvement in processor time/ cost and communication time continues to increase as time proceeds and even as queries get more intricate. 3.2.2 Proposed heuristics approach
  • 38. Figure 3: Binary tree 38 The heuristics solution proposed in this thesis advocates for a change in the sequence of query execution from a normal binary tree to a magic tree that is stored in the given storage file. The move to change the sequence of execution steps allows for the DBMS to save computational costs and time as well (MacLennan and Tang 2009, p.221). In the absence of heuristics, the query optimiser normally formulates a binary query tree (see Figure 3) which it uses to derive numerous path plans before choosing the most optimal alternative. The formulation of the binary tree calls for redundant operations such as the join, filter, and projection functionalities every time a query search is initiated within the DBMS. This redundancy contributes majorly to the compilation of operational expenses (join, filter, and projection), time involved in the performance of these functionalities, as well as the processor and communication time. Frequent join executions, particularly, make the RDF data volume being accessed extremely voluminous and bulky, which in turn even strains the manipulation of data depositories more complicated. However, the addition of heuristics ensures that these binary trees are replaced with a much more efficient methodology, the magic tree. The magic tree differs from the conventional
  • 39. Figure 4: Magic tree 39 binary tree by its innovation way of setting all the constituent variables (join, filter, and projection) to only one wing of the tree (see Figure 4). Each of these distinctive variables is then allocated a specific weight by the algorithm, after which the total weight is used to calculate the cost of the variables in the tree. The criterion of assigning the individual weight is dependent on the amount of time spent by each variable during the query processing, therefore, the computational time correlates with the attached weights (MacLennan and Tang 2009, p.232). The magic tree reorders marked variables such as the projection stem of the binary query tree and eliminates the redundancy implemented in binary projection mechanisms. For example, the applicable costs within the projection stem is x units. Therefore, if we administer a projection fifteen times on a nested query, the aggregate cost will be 15 * x units, in the customary binary tree. The proposed heuristics magic tree, however, shifts the projection facet to one state such that if the projection operation is to be administered on the same nested query, the projection administration would only need to be once, thus the total cost of processing would be x units only. Figure 5 below depicts the algorithm proposed by the heuristics solution.
  • 40. 40 Table 1: Algorithm 1 Function: Compose a Magic Tree. a) Query parsing. b) Transformation of the query expression into a machine-readable statement. c) Forming a query tree or graph, depending on the calculus expression used. d) Selection entity shifts to the head nodule of the query tree. e) Elimination of all candidate selection entities available. f) Formation of all the dependent groupings. These are shifted to one wing of the tree. g) All the leaf nodules are relations; they are therefore halted once the process reaches the leaf. h) The query processor begins the search query course of action. i) Once the query processor discovers the data target, it heads over to the projection stem where all the other pertinent functionalities are conducted. As MacLennan and Tang (2009, p.144) claim, heuristics has always been a viable solution for modern computational problems, more so those that deal with voluminous data sets such as telecommunication and industrial plants streaming data. The algorithms embedded in heuristics functions help solve for entity optimisation and complex real-world issues as it improves on time, costs, and space required in deciphering computational inquiries. In our case, the effect of heuristics may not be felt or seen immediately, but after a while, the cost-saving impacts will surely become visible. This is because of the working psychology assumed by heuristics. As explained above, during the early implementation stages of the heuristics, the entity operates by first monitoring how applications work. It performs meticulous appraisals and
  • 41. 41 evaluations of how program applications, in this case the query optimisation process, are run and traces all these moves and formulas onto its memory. By this, it has created a virtual image of the functioning of all the steps involved during a query search, from when the query is input to when data results are displayed on the screen. The more advanced version of heuristics thoroughly inspects then traces the guidelines put in the codes of programs prior to passing them over to the computer’s processing unit for execution. This will help the heuristics engine to assess and learn the behaviour and mannerisms of that program while it runs in a virtual setting. As soon as its memory is packed with the application performance information, it starts using this information to revamp activities and even cultivate better channels for enhanced task execution. In the case of the RDF stream processing, a user can input the same query over and over again over a given period of time, say for example, when retrieving information about a certain tweet or when researching about the manufacture status of a phone from its manufacturer. For every single time that a query search is initiated for such a research function, the parser must form a query tree for each search before handing it over to the query optimiser and code generator to formulate a code needed in the actual processing of the query statement. Building a query tree for each and every query search of the same research question consumes an awful lot of communication time and processing expenses as well, in the absence of a heuristics engine (MacLennan and Tang 2009, p.39). This time, physical storage space, and processing costs is what we all aim to eradicate in our RDF streaming processing. In a heuristics environment, however, the redundant formations of the same query tree, their optimisations, and final query processing, is noted in the heuristics’ memory. Hence, if the same research question is entered yet again, the parser will just proceed to the heuristics’ memory and retrieve the query tree that was noted before, instead of building a new one all over again. Therefore, the time that could
  • 42. 42 have otherwise been expended in the query tree formation has been saved and, in turn, also the communication time has been minimised too. The query search proposed by this heuristic is as shown in Figure 6. Table 2: Algorithm 2 Function: The Projected Heuristics Query Search. a) A query tree is crafted for each query expression that is submitted into the database system. b) Then, the heuristics function reads and stores this binary tree in a dedicated storage folder for that particular query tree. c) The storage folder is then assigned a unique company usage factor for easy identification by the parser, such that the maximum quantity of storage folders generated equal the company usage factor (c.u.f). d) Following this, the heuristics devises a unique magic tree that shifts all the dependent variables (join, select, and projection) in the binary tree to one side of the tree. e) When a similar query is submitted by a user, the parser first confirms from the storage folder if there is an equivalent query tree that can be utilized for that input inquiry. f)If there is an equivalent stored tree, it will hence proceed to the precise branch node required for processing the inquiry at hand, and perform all the relevant courses of action. g) However, if there is no suchlike tree, it will consult the magic tree stored there, and if successful, it will halt further searches and perform all the relevant courses of action necessitated. h) However, if all these searches fail such that there is no equivalent branch node even in the magic tree, the parser will now resort to generate a new magic tree as depicted in the first
  • 43. 43 algorithm, thus increasing the storage folder counter. Lastly, the database server will refresh the folder in the event that the counter is less than the company usage factor. This is commendable because the number of folders should be equal to the company usage factor (MacLennan and Tang 2009, p.19). 3.2.3 Results simulation This section puts into actual practice, through simulation, this theoretical novel approach of heuristics assimilation into an RDF stream processing engine to confirm if the prototype makes good of its promise. The RDF engines tested herewith consist of the CQELS and C- SPARQL languages. Simulation here refers to the manner in which the heuristics replication was conducted over a specified period of time (6 months). A model of the heuristics query optimization engine was replicated in a Java Runtime Environment (JRE) running on a computer powered by the Windows Operating System. With the help of the JRE, we codified some core Java codes, which were later, compiled and run in a Java eclipse environment to execute the given RDF data streams. The codification was written in Java and employed the concept of class handling. The data structuring integrated in the query tree went hand in hand with dynamic memory appropriation that primarily used linked lists. The outcome of the analysis was as expected; the integration of heuristics across the RSP engines board improved cost-saving by shrinking the processor operational costs. A heuristics approach was implemented in the CQELS and C – SPARQL query languages to form magic trees and also perform the selections earlier. As MacLennan and Tang (2009, p.66) explains, the heuristics database engine is exploited in the early performance of selections. This action considerably reduces the size and magnitude of the RDF graph databases hence speeding up the query search process in overall. For example, if we reflect on the following CQELSand C – SPARQL query expressions (see Figure 7), applying
  • 44. 44 heuristics is beneficial in terms of how it executes the selection entities very early in the process hence minimizing the communication time. Table 3: Query 1 The customary query processing of these CQELSand C – SPARQL query expressions would have initiated the formation of a binary query tree as depicted in Figure 3. With heuristics, however, the database engine will form a magic tree (see Figure 4) that will shift the selection variable to one side of the tree. As MacLennan and Tang (2009, p.41) inform, yes, the initial query processing stages of the heuristics approach will absorb come costs in constructing as well as searching the magic tree. Nonetheless, these costs will be significantly lower as compared to those expended in the formation and execution of the binary trees. The implementation of the magic tree likewise reduces all other computational costs involved since also the frequency of the selection variables also decrease. This cost-saving is evident in the comparison of the estimated cost calculations of both methods: the binary tree and magic tree query processing. As for the traditional binary tree, its aggregate running costs are 100 units while the incurred expenses for the magic tree are 50 units only.Supposing a new query is input for the first time by a user, the database server will incur seemingly high expenditures in both the formation of the binary tree as well as the conversion of this binary tree into a magic tree. However, in the next
  • 45. 45 round, there will be no conversion costs as the magic tree will be readily available in the heuristics’ storage folder. Additionally, the communication and processor will reduce in the same degree as the conversion costs, as the parser will automatically reach for the magic tree branch nodes. Figure 5 demonstrates the cost versus time chart comparing the conventional query processing versus our projected heuristics-based CQELSand C – SPARQL query optimization strategies. Figure 5: Cost versus time graph As it is shown on Figure 5, the preliminary costs are somewhat high, but as the heuristics functionality continues to track, learn, and store the magic trees in its folders, the overall computational expenditures decrease with time(Cheung et al. 2006, p. 43). To elucidate this phenomenon, as a new query is fed into an RDF format database, all the constituent stages conducted during a tree match search are carried out: parsing, query tree building, syntax checking, attribute name confirmation,optimisation, and code generation. These activities contribute to the evidently high cost expenditures as well as huge time consumptions (MacLennan and Tang 2009, p.71). As time goes by, the heuristics entity monitors the query search procedure, identifying the redundant parsing and optimisation sequences, and creating a way out. It achieves a way out by tracing a particular binary tree in its storage folder and, from
  • 46. 46 this, derives a magic tree that is equivalent matches it. Therefore, in the subsequent standard query searches, there will be no need to create yet another new binary tree for a similar inquiry (MacLennan and Tang 2009, p.83). Instead, the magic tree will be retrieved from the storage file for a duplicate tree matching, hence saving the computational conversion time and costs. The heuristics application becomes even better with the execution of nested queries as the data results are delivered much faster and more efficiently (see Figure 9). Further simulations of the heuristics algorithm can also be enlisted in extending join properties such as the right and left joins. Figure 6: Performance versus complexity 3.2.4 The performance comparison graph between new improved model and the previous version of CQELS and C-SPARQL Most of the considered systems work in progress and are scientific prototypes. Unsurprisingly, those are not able to support all the query patterns and features. The outputs of the new improved model and the previous version of the CQELS and C – SPARQL are significantly different because of their differences in implementation. These differences in performance mainly result from the technical issues of intrinsic concerning the methods of handling streaming data such as potential fluctuating execution environment and time management.
  • 47. 47 Table 4: ThePerformance Comparison by Features Special support for Input Extras C – SPARQL TF RDF and RDF streams CQELS NEST, Vos RDF and RDF streams Disk spilling Streaming SPARQL RDF streams SPARQL stream NEST Relational stream Ontology-based mapping EP – SPARQL EVENT, TF RDF and RDF streams Event operators EVENT: Even pattern, VoS: Variables on stream, TF: Built in time function, NEST: Nested patterns. Table 5: Performance Comparison by the Mechanism of Execution Re-execution Optimisation Architecture Scheduling C – SPARQL Periodical Static and algebraic Black box Logic plan CQELS Eager Adaptive & physical White box Adaptive physical plans Streaming SPARQL Periodical Static and algebraic White box Logic plans SPARQL stream Periodical Externalised Black box External call EP – SPARQL Eager Externalised Black box Logic program
  • 48. 48 Figure 7: Graphical performance comparison As the graphs shows, the throughput of scalability and performance tests of C- SPARQL are considerably lower than that of the CQELS and JTALIS. For this reason, it is clear that the recurrent execution is likely to waste significant resources of computing. A sliding window extracts the recurrences, and the outputs can be incrementally computed as a stream. Notably, the outputs of JTALIS and CQELS are useful in answering the recurrent queries. Query 1 involves counting the number of items over a tumbling window of one-second. Of note, however, this query uses a physical time window. For statistical and significant robust results, the computation is done as an average of twenty executions. The main reason for doing this activity is because of the variable time of execution that also depends on the condition of the machine.
  • 49. 49 Notice that CQELS performs better than JTALIS because it uses both the adaptive and native approach. The JTALIS and C – SPARQL performance heavily depends on of some of the underlying systems that include prolog engine and a relational stream processing engine respectively. In similar fashion, CQELS is likely to benefit from a more sophisticated algorithm that is optimised as compared to the current one. The only system that indexes and precomputes the intermediate results over the static data from sub-queries is the CQELS. However, both the C – SPARQL and the CQELS do not scale well at the time they increase the number of queries such as sharing data windows and similar patterns. Additionally, they testify that neither of the systems uses the techniques of multiple query optimisations to avoid redundant computations among the queries that share computing memory and blocks. In this case, the optimisation only occurs at statically and algebraic level since streaming both C – SPARQL and SPARL schedule the execution at a logical level(MacLennan and Tang 2009, p.102). On the contrary, CQELS can choose alternative plans of execution that get composed from the available operators’ physical implementations. In effect, the optimiser adaptively optimises the execution at the physical level.
  • 50. 50 Both SPARQL stream and EP-SPARQL schedule the execution through a logic program or a declarative query. In this case, they fully delegate the optimisation to other systems (Seshadri and Leung1998). The technique used in improving the result involves the definition of mappings, triple pattern, RDF triple, and other operations on mappings and the reuse of notations. Under the Instantaneous RDF dataset and RDF stream, the temporal nature of data is essential and requires capturing in the representation of data in the continuous processing of dynamic data. This applies to both the sources of data because the collections in linked data updates are also possible. It is an instantaneous RDF dataset. G (t + 1) = G(t) for all the values of t ≥ 0 and G(t) = G for all t = N. Pattern matching is the main primitive operation on both the instantaneous RDF dataset and RDF stream(MacLennan and Tang 2009, p.88). Notice that, triple pattern of SPARQL semantics extends the pattern matching. As a consequence, the use of notations of denotational semantics becomes helpful for the formal definition of query patterns of the processing model. The denotations are the meaning functions of the semantics compositions of abstract syntax. These compositions comprise a total of three operators namely relational, pattern matching, and stream operators. Pattern matching operators extract triples from a dataset or an RDF stream that are valid and match a given triple pattern at a certain time t as shown below Pattern matching operator’s abstract syntax The meaning of triple matching pattern operator PG gets defined in the same way as SPARQL on an RDF dataset at a given timestamp t as follows
  • 51. 51 Next is the definition of the window-based triple matching operator on an RDF stream. The denotational semantics composability results in the definition of the abstract syntax for the compound query pattern constructed from both the logical operators and matching operators. Additionally, the definition of the aggregation operator comes before the definition of the syntax and its semantics(MacLennan and Tang 2009, p.99). Notice that a uniform mapping contains only the mappings that have similar domains. In this case, a consistent mapping gets defined in an aggregate operator setΩ. The relational operators’ abstract syntax is therefore defined recursively as shown below. The mapping of the operators therefore becomes Under the streaming operators abstract, the streaming operator becomes either an RDF stream or relational stream from the above relational operators.
  • 52. 52 Next is the definition of the declarative query language CQELS-QL or CELS query language for the execution framework of CQELS. Additionally, the SPARL grammar in the notation of EBNF helps in the definition of the CQELS-QL. The first thing is the addition of the query pattern for the representation of window operators on RDF stream.
  • 53. 53 Chapter 4: State of The Art in LSDP or the Linked Stream Data Processing According to Gedik (2006), Linked Stream Data derived its usefulness in bridging the gap between Linked Data and stream, and also in the facilitation of the integration of data among Description Framework data streams enables the query processor to participate in treating the RDF elements of stream nodes and also allows for both the access to get access to RDF streams in the form of the materialized data (Abdulla and Matzke 2006, p.907; Buchanan and Shortliffe 1984, p.777; Cole and Conley 2009, p.809; Zhang and Kollios 2007, p.733). Notably, the whole process makes it possible for the application of other SPARQL query patterns (Cheung et al. 2006, p.444). In short, this chapter explores both the techniques and concepts of processing streams and the introduction of Linked Stream Data Processing engines (Calhoun and Riemer 2001, p.447). Additionally, the inclusion of the CQELS engine in the chapter helps in the clarification of the contribution of this field. 4.1 Query Semantics and Data Models This section mainly explores the possible ways of formalizing the data model for Resource Description Framework datasets and the Resource Description Framework streams in a continuous context (Cole and Conley 2009, p.931). Additionally, it touches on the continuous query semantics. 4.2 Data Model It is important to note that the modelling of Linked Stream Data occurs by extending the meaning of both the RDF triples and RDF nodes (Cohen 1985, p.303). A stream of RDF refers to a bag of different elements, while an RDF triple just denotes a temporal annotation such as time interval or time stamp. A pair of time stamps includes an interval based label. In common cases,
  • 54. 54 natural numbers help in representing logical time (Eastwood 2008, p.278). Things such as ‘start' and ‘ends' represent a pair of timestamps, and they are also useful in specifying the valid interval in which the Resource Description Framework triple (Dean 2009, p.264). On the other hand, a point based label refers to just a single natural number that represents the received or the recorded point in time of the triple (Buchanan and Shortliffe 1984, p.708). One may see a point based labels to be looking less efficient and redundant as compared to the interval based labels. Further, point based labels are less expensive than the interval based labels because the former gets considered to be an important and special case of the latter. For example, start = end. According to research (e.g. Abbass and Newton 2002, p.946), streaming SPARQL find labels useful in representing its EP-SPARQL and the items of the physical data stream in the representation of the triple based events. For the purposes of streaming data source, a point base labels out more practical results because it allows for the instantaneous and unexpected generation of a triple. It is a good example of the use of a tracking system to detect people at an office (Buchanan and Shortliffe 1984, p.707). Notably, this kind of activity results in the generation of a triple using a timestamp at any time it receives any reading from a sensor. For further processing of the information, the system must do further processing and buffer the reading in order to help in the generation of the interval of the valid triple (Bolton 1996, p.407). Furthermore, the instantaneous point based labels play a vital role for the applications that require the processing of the data immediately it arrives in the system. Additionally, the concept of the Resource Description Framework must be included in the model of data to enable the integration of stream data without stream data. primarily, the Resource Description Framework dataset always get considered as a static data source by the current state of the art. In light of the findings (e.g. by Abbass and Newton 2002,
  • 55. 55 p.944), it is important to note that the data stream applications can always run for any given number of period that ranges from days to years. In addition, the changes in the Resource Description Framework dataset during the lifetime of query must be reflected in the continuous query's outputs. 4.3 Query Semantics Semantics extend to explore things like approaches of the current state of the query operators of SPARQL-like union, join, and filter. In practice, these operators output and consume mappings (Abbass and Newton 2002, p.556). In addition, they also take part in introducing the operators on the Resource Description Framework streams to the output mappings. Worth noting, C-SPARQL defines its stream operator to access a Resource Description Framework stream that is identified by its IRI (Cohen 1985, p.301). Additionally, the window operator gets defined to help in accessing a Resource Description Framework stream based on certain windows. Essentially, the window operator is useful in adopting the window operator on Resource Description Framework streams in relation to the CQL (Cole and Conley 2009, p.954). It is also important to note that the semantics of continuous query on Resource Description Framework get defined as query operator composition. Practically, a query gets composed as an operator graph in streaming both the C-SPARQL and SPARQL (Dean 2009, p.237). The SPARQL helps to base the definition of the query graph on the query operator. 4.4 Query Languages There is a need for the introduction of a query pattern for expressing the primitive operators in order to fully define a declarative Linked Stream Data's query language (Abdulla and Matzke 2006, p.956; Buchanan and Shortliffe 1984, p.561; Zhang and Kollios 2007, p.654). In practice, this kind of data is window matching, triple matching, and sequential operators
  • 56. 56 (Eastwood 2008, p.509). In addition, the composition of these basic query patterns can later get expressed by things such as OPT, AND, filter patterns of SPARQL, and UNION. Another important thing to note is that, these patterns, corresponds to the operators in earlier definitions. In support of the aggregation operators, several types of research (e.g. Abdulla and Matzke 2006, p.966; Buchanan and Shortliffe 1984, p.906; Zhang and Kollios 2007, p.749), define their semantics with the AGG query pattern. This kind of pattern is compatible with another type of SPARQL patterns. The definition of the evaluation of query pattern AGG is [[P]]/ [[A]] = [[P AGG A]], whereby A refers to the aggregate function consuming output of an SPARQL query pattern P in returning the set of mappings. By letting, P, P1, and P2 to be the composite or basic query patterns, then the declarative query gets composed recursively by the use of this kind of rules: [[P1]]/ [[P2]] = [[P1 UNION P2]], [[P1]]/ [[P2]] = [[P1 AND P2]], [[P1]]/ [[P2]] = [[P1 AND P2]], [[P]]/ [[A]] = [[P AGG A]], andfµ 2 [[P]] jµ = [[P FILTER R]]. In practice, these type of patterns helps to extend the grammar of SPARQL for the continuous queries. It is important to note that the use of C-SPARQL is helpful for extending the SPARQL by ion Framework stream output is the triple patterns of this kind of CONSTRUCT. In essence, the grammars that are helpful in streaming both the C – SPARQL and SPARQL are the same. In practice, the use of databases is always manifold (Jeuring 2012, p.417). In fact, they give a provision for a means of retrieving either parts of the records or the entire records and in the
  • 57. 57 performance of the different kind of calculations before displaying the outcomes (Abdulla and Matzke 2006, p.504; Buchanan and Shortliffe 1984, p.703; Cole and Conley 2009, p.968; Zhang and Kollios 2007, p.974). Practically, the query language is the interface that specifies such kind of manipulations (Lucas 2010, p.608). On the other hand, the early query languages were initially very complex making the interaction with electronic databases to get done by the individuals with some special knowledge (MacLennan and Tang 2009, p.673). Ordinarily, the more user-friendly interfaces are the modern ones, in addition, they also allow for the casual users to access the information of the database. A good example of the main types of this kind of query modes is the fill in the blank, the menu, and the structured query (Gedik 2006, p.422). Most importantly, the menu needs an individual to choose from various alternatives that get displayed on a monitor that are particularly suitable for novices (Maringer 2005, p.342). On the other hand, the technique of the fill in the blank refers to one that allows the user get a promotion to enter the key words such as the statements (Moustakas 1990, p.623). Worth noting, the approach of the structured query is very effective with the databases that are relational. In simple terms, it has a powerful syntax that is formal and, in practice, a programming language. Additionally, it can accommodate logical operators (Mueller 2009, p.506). Furthermore, the Structured Query Language or the SQL has some various forms during the implementation of this kind of approach. Some of the various forms include: selecting [[field Fa, Fb, Fc..., Fn]], on the other hand, where [[Fa Field = abc]] and [[field Fb = def]], and from [[ database Da, Db, Dc… Dn]]. Several studies (e.g. Abdulla and Matzke 2006, p.678; Buchanan and Shortliffe 1984, p.985; Zhang and Kollios 2007, p.992), shows that it is important to note that the structured query language is supporting the searching of the database and also other activities by the use of various commands such as ‘sum’, ‘print’,
  • 58. 58 ‘find’, ‘delete’ and so on (Nirmal 1990, p.496). Ordinarily, the natural language looks like the sentence structure of a Structured Query Language except that the syntax of the SQL instead uses the statement of Structured Query Language. Additionally, it is also possible to show a representation of the queries in the form of tables. The technique is known as the QBE (or the query by example) helps in the displaying of an empty form. According to Mcllroy (1998), this kind of process continues to occur expecting the searcher to enter the appropriate specification of the search into the appropriate columns. This kind of SQL type of query then gets constructed by the program from the table as it does the execution (Zhang and Kollios 2007, p.997). In practice, the natural language shows the most flexible query language (Abdulla and Matzke 2006, p.911; Buchanan and Shortliffe 1984, p.703; Zhang and Kollios 2007, p.707). Most importantly, some commercial database management software allows the use of sentences of the natural language in the form of constraints to search the databases (Schreiber 1977, p.781). In essence, these kinds of programs recognize the synonyms and the action words of syntax after its parse (Abdulla and Matzke 2006, p.1002; Buchanan and Shortliffe 1984, p.734; Zhang and Kollios 2007, p. 836). In addition, the programs records identify the names of the files, perform, and field the required logical operations (Seshadri and Leung 1998, p.699). Furthermore, there has been some development in the natural language queries in the spoken voice due to the acceptance of such experimental systems (Sims and Yocom 2008, p.1003). However, the ability to employ the unrestricted natural language in query unstructured information that further needs advances in the understanding of the machine of natural language (Wei 2011, p.354). This kind of activity mainly presents in the representation of the programmatic and semantic context of ideas.
  • 59. 59 Chapter 5: The Optimization Solutions for the CQELS In essence, this kind of execution framework helps in supporting adaptive and native query execution over RDF datasets and RDF streams (Bolton 1996, p.404). Worth noting, the framework’s white box architecture accepts both the RDF datasets and RDF streams as inputs and also returns the outputs as either the relational streams or the RDF streams in the result format of SPARQL (Abdulla and Matzke 2006, p.702; Buchanan and Shortliffe 1984, p.497). In practice, it is possible to feed the output RDF streams into any CQELS engine (Wei 2011, p.4078). On the other hand, the relational streams can be useful to other relational stream processing systems (Cheung et al. 2006, p.497). Notably, the working processing involves the following: the pushing of the stream data to the input manager and using the encoder for encoding it into the normalised input stream representation (Cole and Conley 2009, p.1007). Practically, the dynamic executor is able to consume this kind of encoder. Another important aspect to note is that the decoder has to decode the outputs of the dynamic executor by streaming it to the receiver (Abdulla and Matzke 2006 p.749). Mostly, the decoder and the encoder share a dictionary for the decoding and the encoding operations. Additionally, the dynamic executor accesses the static RDF datasets via the cache fetcher. Furthermore, the SPARQL endpoints can be useful in hosting the decoder and encoder in either the remote RDF stores or the local RDF stores (Cole and Conley 2009, p.1011). On the other hand, the cache fetcher plays a vital role in retrieving the crucial data then encodes the same data for the cache manager by the use of the encoder (Wei 2011, p.507). Worth noting, the normalized representation is helpful in representing the encoded data of the intermediate results for sharing the same dictionary with input stream.