This thesis aims to explore implementing heuristics-based query optimisation solutions in RDF stream processing engines like CQELS and C-SPARQL. It proposes developing an adaptive execution framework and linked data stream processing model with algorithms and data structures for efficient window operator evaluation and multiway join optimisation. Extensive experiments will evaluate the performance of the extended RSP engines and compare them to the original versions of CQELS and C-SPARQL.
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
Big data is the biggest challenges as we need huge processing power system and good algorithms to make a decision. We need Hadoop environment with pig hive, machine learning and hadoopecosystem components. The data comes from industries. Many devices around us and sensor, and from social media sites. According to McKinsey There will be a shortage of 15000000 big data professionals by the end of 2020. There are lots of technologies to solve the problem of big data Storage and processing. Such technologies are Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we analyse the processing speed for the 4GB data on cloudx lab with Hadoop mapreduce with varing mappers and reducers and with pig script and Hive querries and spark environment along with machine learning technology and from the results we can say that machine learning with Hadoop will enhance the processing performance along with with spark, and also we can say that spark is better than Hadoop mapreduce pig and hive, spark with hive and machine learning will be the best performance enhanced compared with pig and hive, Hadoop mapreduce jar.
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
This document describes a method for processing large amounts of data stored in cloud storage using Hadoop clusters. Data is uploaded to cloud storage by users and then processed using MapReduce on Hadoop clusters. The method involves storing data in the cloud for processing and then running MapReduce algorithms on Hadoop clusters to analyze the data in parallel. The results are then stored back in the cloud for users to download. An architecture is proposed involving a controller that directs requests to Hadoop masters which coordinate nodes to perform mapping and reducing of data according to the algorithm implemented.
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey IJECEIAES
In the modern era, workflows are adopted as a powerful and attractive paradigm for expressing/solving a variety of applications like scientific, data intensive computing, and big data applications such as MapReduce and Hadoop. These complex applications are described using high-level representations in workflow methods. With the emerging model of cloud computing technology, scheduling in the cloud becomes the important research topic. Consequently, workflow scheduling problem has been studied extensively over the past few years, from homogeneous clusters, grids to the most recent paradigm, cloud computing. The challenges that need to be addressed lies in task-resource mapping, QoS requirements, resource provisioning, performance fluctuation, failure handling, resource scheduling, and data storage. This work focuses on the complete study of the resource provisioning and scheduling algorithms in cloud environment focusing on Infrastructure as a service (IaaS). We provided a comprehensive understanding of existing scheduling techniques and provided an insight into research challenges that will be a possible future direction to the researchers.
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...iosrjce
The document describes a proposed system for a semantic web information retrieval service using domain ontology, WCF services, and .NET technologies. It discusses implementing concept relevancy ranking of link and page content as web services. The system architecture includes an admin module to create domain ontology and semantic annotations, a search interface for users, and a testing module. Experimental results show the proposed approach provides more relevant results than traditional search engines for the sample query "company cts chennai taramani".
Exploring Neo4j Graph Database as a Fast Data Access LayerSambit Banerjee
This article describes the findings of an extensive investigative work conducted to explore the feasibility of using a Neo4j Graph Database to build a Fast Data Access Layer with near-real time data ingestion from the underlying source systems.
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
This summary provides an overview of the SparkR package, which provides an R frontend for the Apache Spark distributed computing framework:
- SparkR enables large-scale data analysis from the R shell by using Spark's distributed computation engine to parallelize and optimize R programs. It allows R users to leverage Spark's libraries, data sources, and optimizations while programming in R.
- The central component of SparkR is the distributed DataFrame, which provides a familiar data frame interface to R users but can handle large datasets using Spark. DataFrame operations are optimized using Spark's query optimizer.
- SparkR's architecture includes an R to JVM binding that allows R programs to submit jobs to Spark, and support for running R execut
This document summarizes a research paper on analyzing and visualizing Twitter data using the R programming language with Hadoop. The goal was to leverage Hadoop's distributed processing capabilities to support analytical functions in R. Twitter data was analyzed and visualized in a distributed manner using R packages that connect to Hadoop. This allowed large-scale Twitter data analysis and visualizations to be built as a R Shiny application on top of results from Hadoop.
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...IJECEIAES
Big data is the biggest challenges as we need huge processing power system and good algorithms to make a decision. We need Hadoop environment with pig hive, machine learning and hadoopecosystem components. The data comes from industries. Many devices around us and sensor, and from social media sites. According to McKinsey There will be a shortage of 15000000 big data professionals by the end of 2020. There are lots of technologies to solve the problem of big data Storage and processing. Such technologies are Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we analyse the processing speed for the 4GB data on cloudx lab with Hadoop mapreduce with varing mappers and reducers and with pig script and Hive querries and spark environment along with machine learning technology and from the results we can say that machine learning with Hadoop will enhance the processing performance along with with spark, and also we can say that spark is better than Hadoop mapreduce pig and hive, spark with hive and machine learning will be the best performance enhanced compared with pig and hive, Hadoop mapreduce jar.
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
This document describes a method for processing large amounts of data stored in cloud storage using Hadoop clusters. Data is uploaded to cloud storage by users and then processed using MapReduce on Hadoop clusters. The method involves storing data in the cloud for processing and then running MapReduce algorithms on Hadoop clusters to analyze the data in parallel. The results are then stored back in the cloud for users to download. An architecture is proposed involving a controller that directs requests to Hadoop masters which coordinate nodes to perform mapping and reducing of data according to the algorithm implemented.
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey IJECEIAES
In the modern era, workflows are adopted as a powerful and attractive paradigm for expressing/solving a variety of applications like scientific, data intensive computing, and big data applications such as MapReduce and Hadoop. These complex applications are described using high-level representations in workflow methods. With the emerging model of cloud computing technology, scheduling in the cloud becomes the important research topic. Consequently, workflow scheduling problem has been studied extensively over the past few years, from homogeneous clusters, grids to the most recent paradigm, cloud computing. The challenges that need to be addressed lies in task-resource mapping, QoS requirements, resource provisioning, performance fluctuation, failure handling, resource scheduling, and data storage. This work focuses on the complete study of the resource provisioning and scheduling algorithms in cloud environment focusing on Infrastructure as a service (IaaS). We provided a comprehensive understanding of existing scheduling techniques and provided an insight into research challenges that will be a possible future direction to the researchers.
Design and Implementation of SOA Enhanced Semantic Information Retrieval web ...iosrjce
The document describes a proposed system for a semantic web information retrieval service using domain ontology, WCF services, and .NET technologies. It discusses implementing concept relevancy ranking of link and page content as web services. The system architecture includes an admin module to create domain ontology and semantic annotations, a search interface for users, and a testing module. Experimental results show the proposed approach provides more relevant results than traditional search engines for the sample query "company cts chennai taramani".
Exploring Neo4j Graph Database as a Fast Data Access LayerSambit Banerjee
This article describes the findings of an extensive investigative work conducted to explore the feasibility of using a Neo4j Graph Database to build a Fast Data Access Layer with near-real time data ingestion from the underlying source systems.
Map-Reduce Synchronized and Comparative Queue Capacity Scheduler in Hadoop fo...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
This summary provides an overview of the SparkR package, which provides an R frontend for the Apache Spark distributed computing framework:
- SparkR enables large-scale data analysis from the R shell by using Spark's distributed computation engine to parallelize and optimize R programs. It allows R users to leverage Spark's libraries, data sources, and optimizations while programming in R.
- The central component of SparkR is the distributed DataFrame, which provides a familiar data frame interface to R users but can handle large datasets using Spark. DataFrame operations are optimized using Spark's query optimizer.
- SparkR's architecture includes an R to JVM binding that allows R programs to submit jobs to Spark, and support for running R execut
This document summarizes a research paper on analyzing and visualizing Twitter data using the R programming language with Hadoop. The goal was to leverage Hadoop's distributed processing capabilities to support analytical functions in R. Twitter data was analyzed and visualized in a distributed manner using R packages that connect to Hadoop. This allowed large-scale Twitter data analysis and visualizations to be built as a R Shiny application on top of results from Hadoop.
European Pharmaceutical Contractor: SAS and R Team in Clinical ResearchKCR
Statistical analysis constitutes an essential part of every serious scientific research. Without data and a formal process of searching for evidences supporting or disproving stated hypotheses, there is nothing but mere opinion. Evidence-based medicine is no exception
Workshop on Real-time & Stream Analytics IEEE BigData 2016Sabri Skhiri
Introduction presentation of the Workshop on Real-time & Stream Analytics co-located with the IEEE Big Data Conference.
We have seen new business models emerging that require real-time features. However, the real-time nature impacts the IT systems. It impacts the IT in term of (1) Data architecture, (2) Stream Mining and (3) Stream Processor technologies. Those three impacts are still very interesting research areas. The papers presented at the workshop cover those three areas and provide interesting view points.
Scientific Application Development and Early results on SummitGanesan Narayanasamy
The document summarizes Oak Ridge National Laboratory's (ORNL) new supercomputer Summit and its capabilities for scientific applications and early results. Summit is the most powerful and smartest supercomputer in the world, with 200 petaflops of performance and capabilities well-suited for machine learning and artificial intelligence applications. ORNL is preparing scientific applications for Summit through its Center for Accelerated Application Readiness program to enable early science results and ensure applications are optimized for Summit's architecture.
The document proposes a Rapid Prototyping Capability (RPC) system to efficiently evaluate integrating Earth observation data from NASA satellites and models. The RPC would:
1) Integrate tools to access, process, and analyze data and model outputs to support experiments.
2) Reduce the time typically required to evaluate new data streams in models through a simulated operational environment.
3) Be accessible to various user groups, including specialists responsible for data/models and domain experts performing analyses.
Data Partitioning in Mongo DB with CloudIJAAS Team
Cloud computing offers various and useful services like IAAS, PAAS SAAS for deploying the applications at low cost. Making it available anytime anywhere with the expectation to be it scalable and consistent. One of the technique to improve the scalability is Data partitioning. The alive techniques which are used are not that capable to track the data access pattern. This paper implements the scalable workload-driven technique for polishing the scalability of web applications. The experiments are carried out over cloud using NoSQL data store MongoDB to scale out. This approach offers low response time, high throughput and less number of distributed transaction. The result of partitioning technique is conducted and evaluated using TPC-C benchmark.
Data performance characterization of frequent pattern mining algorithmsIJDKP
Big data quickly comes under the spotlight in recent years. As big data is supposed to handle extremely
huge amount of data, it is quite natural that the demand for the computational environment to accelerates,
and scales out big data applications increases. The important thing is, however, the behavior of big data
applications is not clearly defined yet. Among big data applications, this paper specifically focuses on stream mining applications. The behavior of stream mining applications varies according to the characteristics of the input data. The parameters for data characterization are, however, not clearly defined yet, and there is no study investigating explicit relationships between the input data, and streammining applications, either. Therefore, this paper picks up frequent pattern mining as one of the
representative stream mining applications, and interprets the relationships between the characteristics of the input data, and behaviors of signature algorithms for frequent pattern mining.
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
The document discusses performance evaluation of social network analysis algorithms using Apache Spark. It analyzes the performance of algorithms like PageRank, connected components, triangle counting and K-means clustering on different social network datasets. The results show that GraphX PageRank performs faster than the naive implementation in Spark. Connected components execution time grows super linearly initially and then fluctuates. Triangle counting time grows linearly with size. K-means clustering is tested using both naive and MLlib implementations in Spark.
A Query Model for Ad Hoc Queries using a Scanning ArchitectureFlurry, Inc.
Systems like Hadoop, Hbase and Hive allowed the world to take huge strides in managing and analyzing large amounts of data. Products like Flurry analytics make efficient use of large amounts of hardware using these tools to build statistics for hundreds of thousands of applications. However, these tools require the end user to first set up relevant analytics queries and then wait days for the results. If the results prompt new questions or the original query is not quite right, the user must rerun and wait again for the results.
We present the Burst system developed at Flurry to support low-latency single pass queries over very large and complex mobile application streams. We have created a data schema and query model that can answer very complex ad-hoc queries over data, and is highly parallelizable while maintaining low-latency. We implement these scans so that they are time and space efficient using the advanced disk scanning techniques provided by the underlying operating system.
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...dbpublications
It is cost-efficient for a tenant with a limited budget to establish a virtual Map Reduce cluster by renting multiple virtual private servers (VPSs) from a VPS provider. To provide an appropriate scheduling scheme for this type of computing environment, we propose in this paper a hybrid job-driven scheduling scheme (JoSS for short) from a tenant’s perspective. JoSS provides not only job level scheduling, but also map-task level scheduling and reduce-task level scheduling. JoSS classifies Map Reduce jobs based on job scale and job type and designs an appropriate scheduling policy to schedule each class of jobs. The goal is to improve data locality for both map tasks and reduce tasks, avoid job starvation, and improve job execution performance. Two variations of JoSS are further introduced to separately achieve a better map-data locality and a faster task assignment. We conduct extensive experiments to evaluate and compare the two variations with current scheduling algorithms supported by Hadoop. The results show that the two variations outperform the other tested algorithms in terms of map-data locality, reduce-data locality, and network overhead without incurring significant overhead. In addition, the two variations are separately suitable for different Map Reduce workload scenarios and provide the best job performance among all tested algorithms.
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsFlurry, Inc.
We present Burst, an analytic query system with a scalable and flexible approach to performing lowlatency ad hoc analysis over large complex datasets. The architecture consists of hardwareefficient scan techniques and a language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans. These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently being implemented for the next major release.
The document describes TrafficDB, a shared-memory data store designed by HERE to provide high throughput access to traffic data. TrafficDB was created to handle the high volumes of read operations required by HERE's traffic-aware services, with minimal latency. It uses shared memory to allow direct memory access for applications. Evaluation showed TrafficDB can handle millions of read operations per second and provides near-linear scalability by allowing additional processes to increase throughput without impacting latency. TrafficDB is now used in production by HERE to power routing, rendering, and other traffic-aware services.
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
This document lists several Java/J2EE/J2ME projects related to utility computing environments, schema matching, fuzzy ontology generation, wireless sensor networks, wireless MAC protocols, distributed cache updating, selfish routing, collaborative key agreement, TCP congestion control, global roaming in mobile networks, GPS-based emergency response systems, network intrusion detection, honey pots, voice over IP, vehicle tracking, SIP-based teleconferencing, online security systems, and location-aided routing in ad hoc networks. The projects cover a wide range of topics related to distributed systems, wireless networks, and Internet applications.
This document proposes using the R statistical analysis and visualization environment as an interface for analyzing network flow data from SiLK tools. It details how R provides powerful and flexible analysis capabilities while preserving command line control. A prototype wrapper function called rwcount.analyze() is presented that takes SiLK command line queries as input, runs the rwcount tool to generate time series data, and returns an output object in R containing the data, visualization, and other metadata. This integrated environment allows for rapid prototyping and visualization of network security analyses.
This document presents a framework that migrates data from MySQL to NoSQL databases like MongoDB and HBase, and maps MySQL queries to queries in the NoSQL databases. The framework consists of a front-end GUI and modules for migrating data between the databases and mapping queries. It migrates data from MySQL tables to collections in MongoDB and HBase. When a user enters a MySQL query, a decision maker selects the target database and the query is mapped to that database's format to retrieve the data. The mapping time for various query types is measured to be very small, making query execution on NoSQL databases efficient using this framework.
This document discusses several key differences between traditional databases and Hive. Hive uses a schema-on-read model where the schema is not enforced during data loading, making the initial load much faster. However, this impacts query performance since indexing and compression cannot be applied during loading. Pig Latin is a data flow language where each step transforms the input relation, unlike SQL which is declarative. While Hive originally lacked features like updates, transactions and indexing, the developers are working to integrate HBase and improve support for these features.
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
The document discusses techniques for scaling applications across computing nodes in high performance computing (HPC) clusters. It analyzes the performance of different computing nodes on various applications like BLASTX, HPL, and JAGS. Array job facilities are used to parallelize applications by dividing iterations into independent tasks assigned across nodes. Python programs are created to analyze system performance based on log files and produce plots showing differences in node performance on different applications. The plots help with preventative maintenance and capacity management of the HPC system.
Svm Classifier Algorithm for Data Stream Mining Using Hive and RIRJET Journal
This document proposes using Hive and R to perform data stream mining on big data. Hive is used to query and analyze large datasets stored in Hadoop. Test and trained datasets are extracted from the data using Hive queries. The Support Vector Machine (SVM) classifier algorithm analyzes the data to produce a statistical report in R, comparing the accuracy of linear and nonlinear models. The proposed method aims to improve data processing speed and ability to analyze large volumes of data as compared to other tools.
Performance Analysis and Parallelization of CosineSimilarity of DocumentsIRJET Journal
This document discusses performance analysis and parallelization of the cosine similarity algorithm for calculating document similarity. It proposes an optimized algorithm that utilizes parallel computing to calculate cosine similarity for large sets of retrieved documents more efficiently. The conventional cosine similarity algorithm becomes inefficient for large document sets. The parallelized approach aims to enhance efficiency and reduce latency by processing more documents in less time. The document reviews related work applying techniques like parallelization, cosine similarity, and dimensionality reduction to problems involving document clustering, text summarization, and information retrieval.
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET Journal
1. The document discusses a workflow management system for scalable data mining on clouds. It proposes using MapReduce and Hadoop frameworks to parallelize k-means clustering of large datasets on cloud infrastructure.
2. The system aims to improve efficiency, security, and transmission speed over existing cloud systems by generating hash codes for files before classification and storage on cloud. It uses deduplication to avoid redundant uploads.
3. The document outlines the system implementation including user modules for registration, login, profile editing, training data upload, file upload and download while avoiding redundancy, and changing/logging out of passwords. It also discusses testing the system functionality using unit testing libraries.
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET Journal
This document discusses issues with analyzing sub-datasets in a distributed manner using Hadoop, such as imbalanced computational loads and inefficient data scanning. It proposes a new approach called Data-Net that uses metadata about sub-dataset distributions stored in an Elastic-Map structure to optimize storage placement and queries. Experimental results on a 128-node cluster show that Data-Net provides better load balancing and performance for various sub-dataset analysis applications compared to the default Hadoop implementation.
European Pharmaceutical Contractor: SAS and R Team in Clinical ResearchKCR
Statistical analysis constitutes an essential part of every serious scientific research. Without data and a formal process of searching for evidences supporting or disproving stated hypotheses, there is nothing but mere opinion. Evidence-based medicine is no exception
Workshop on Real-time & Stream Analytics IEEE BigData 2016Sabri Skhiri
Introduction presentation of the Workshop on Real-time & Stream Analytics co-located with the IEEE Big Data Conference.
We have seen new business models emerging that require real-time features. However, the real-time nature impacts the IT systems. It impacts the IT in term of (1) Data architecture, (2) Stream Mining and (3) Stream Processor technologies. Those three impacts are still very interesting research areas. The papers presented at the workshop cover those three areas and provide interesting view points.
Scientific Application Development and Early results on SummitGanesan Narayanasamy
The document summarizes Oak Ridge National Laboratory's (ORNL) new supercomputer Summit and its capabilities for scientific applications and early results. Summit is the most powerful and smartest supercomputer in the world, with 200 petaflops of performance and capabilities well-suited for machine learning and artificial intelligence applications. ORNL is preparing scientific applications for Summit through its Center for Accelerated Application Readiness program to enable early science results and ensure applications are optimized for Summit's architecture.
The document proposes a Rapid Prototyping Capability (RPC) system to efficiently evaluate integrating Earth observation data from NASA satellites and models. The RPC would:
1) Integrate tools to access, process, and analyze data and model outputs to support experiments.
2) Reduce the time typically required to evaluate new data streams in models through a simulated operational environment.
3) Be accessible to various user groups, including specialists responsible for data/models and domain experts performing analyses.
Data Partitioning in Mongo DB with CloudIJAAS Team
Cloud computing offers various and useful services like IAAS, PAAS SAAS for deploying the applications at low cost. Making it available anytime anywhere with the expectation to be it scalable and consistent. One of the technique to improve the scalability is Data partitioning. The alive techniques which are used are not that capable to track the data access pattern. This paper implements the scalable workload-driven technique for polishing the scalability of web applications. The experiments are carried out over cloud using NoSQL data store MongoDB to scale out. This approach offers low response time, high throughput and less number of distributed transaction. The result of partitioning technique is conducted and evaluated using TPC-C benchmark.
Data performance characterization of frequent pattern mining algorithmsIJDKP
Big data quickly comes under the spotlight in recent years. As big data is supposed to handle extremely
huge amount of data, it is quite natural that the demand for the computational environment to accelerates,
and scales out big data applications increases. The important thing is, however, the behavior of big data
applications is not clearly defined yet. Among big data applications, this paper specifically focuses on stream mining applications. The behavior of stream mining applications varies according to the characteristics of the input data. The parameters for data characterization are, however, not clearly defined yet, and there is no study investigating explicit relationships between the input data, and streammining applications, either. Therefore, this paper picks up frequent pattern mining as one of the
representative stream mining applications, and interprets the relationships between the characteristics of the input data, and behaviors of signature algorithms for frequent pattern mining.
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
The document discusses performance evaluation of social network analysis algorithms using Apache Spark. It analyzes the performance of algorithms like PageRank, connected components, triangle counting and K-means clustering on different social network datasets. The results show that GraphX PageRank performs faster than the naive implementation in Spark. Connected components execution time grows super linearly initially and then fluctuates. Triangle counting time grows linearly with size. K-means clustering is tested using both naive and MLlib implementations in Spark.
A Query Model for Ad Hoc Queries using a Scanning ArchitectureFlurry, Inc.
Systems like Hadoop, Hbase and Hive allowed the world to take huge strides in managing and analyzing large amounts of data. Products like Flurry analytics make efficient use of large amounts of hardware using these tools to build statistics for hundreds of thousands of applications. However, these tools require the end user to first set up relevant analytics queries and then wait days for the results. If the results prompt new questions or the original query is not quite right, the user must rerun and wait again for the results.
We present the Burst system developed at Flurry to support low-latency single pass queries over very large and complex mobile application streams. We have created a data schema and query model that can answer very complex ad-hoc queries over data, and is highly parallelizable while maintaining low-latency. We implement these scans so that they are time and space efficient using the advanced disk scanning techniques provided by the underlying operating system.
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...dbpublications
It is cost-efficient for a tenant with a limited budget to establish a virtual Map Reduce cluster by renting multiple virtual private servers (VPSs) from a VPS provider. To provide an appropriate scheduling scheme for this type of computing environment, we propose in this paper a hybrid job-driven scheduling scheme (JoSS for short) from a tenant’s perspective. JoSS provides not only job level scheduling, but also map-task level scheduling and reduce-task level scheduling. JoSS classifies Map Reduce jobs based on job scale and job type and designs an appropriate scheduling policy to schedule each class of jobs. The goal is to improve data locality for both map tasks and reduce tasks, avoid job starvation, and improve job execution performance. Two variations of JoSS are further introduced to separately achieve a better map-data locality and a faster task assignment. We conduct extensive experiments to evaluate and compare the two variations with current scheduling algorithms supported by Hadoop. The results show that the two variations outperform the other tested algorithms in terms of map-data locality, reduce-data locality, and network overhead without incurring significant overhead. In addition, the two variations are separately suitable for different Map Reduce workload scenarios and provide the best job performance among all tested algorithms.
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsFlurry, Inc.
We present Burst, an analytic query system with a scalable and flexible approach to performing lowlatency ad hoc analysis over large complex datasets. The architecture consists of hardwareefficient scan techniques and a language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans. These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently being implemented for the next major release.
The document describes TrafficDB, a shared-memory data store designed by HERE to provide high throughput access to traffic data. TrafficDB was created to handle the high volumes of read operations required by HERE's traffic-aware services, with minimal latency. It uses shared memory to allow direct memory access for applications. Evaluation showed TrafficDB can handle millions of read operations per second and provides near-linear scalability by allowing additional processes to increase throughput without impacting latency. TrafficDB is now used in production by HERE to power routing, rendering, and other traffic-aware services.
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
This document lists several Java/J2EE/J2ME projects related to utility computing environments, schema matching, fuzzy ontology generation, wireless sensor networks, wireless MAC protocols, distributed cache updating, selfish routing, collaborative key agreement, TCP congestion control, global roaming in mobile networks, GPS-based emergency response systems, network intrusion detection, honey pots, voice over IP, vehicle tracking, SIP-based teleconferencing, online security systems, and location-aided routing in ad hoc networks. The projects cover a wide range of topics related to distributed systems, wireless networks, and Internet applications.
This document proposes using the R statistical analysis and visualization environment as an interface for analyzing network flow data from SiLK tools. It details how R provides powerful and flexible analysis capabilities while preserving command line control. A prototype wrapper function called rwcount.analyze() is presented that takes SiLK command line queries as input, runs the rwcount tool to generate time series data, and returns an output object in R containing the data, visualization, and other metadata. This integrated environment allows for rapid prototyping and visualization of network security analyses.
This document presents a framework that migrates data from MySQL to NoSQL databases like MongoDB and HBase, and maps MySQL queries to queries in the NoSQL databases. The framework consists of a front-end GUI and modules for migrating data between the databases and mapping queries. It migrates data from MySQL tables to collections in MongoDB and HBase. When a user enters a MySQL query, a decision maker selects the target database and the query is mapped to that database's format to retrieve the data. The mapping time for various query types is measured to be very small, making query execution on NoSQL databases efficient using this framework.
This document discusses several key differences between traditional databases and Hive. Hive uses a schema-on-read model where the schema is not enforced during data loading, making the initial load much faster. However, this impacts query performance since indexing and compression cannot be applied during loading. Pig Latin is a data flow language where each step transforms the input relation, unlike SQL which is declarative. While Hive originally lacked features like updates, transactions and indexing, the developers are working to integrate HBase and improve support for these features.
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
The document discusses techniques for scaling applications across computing nodes in high performance computing (HPC) clusters. It analyzes the performance of different computing nodes on various applications like BLASTX, HPL, and JAGS. Array job facilities are used to parallelize applications by dividing iterations into independent tasks assigned across nodes. Python programs are created to analyze system performance based on log files and produce plots showing differences in node performance on different applications. The plots help with preventative maintenance and capacity management of the HPC system.
Svm Classifier Algorithm for Data Stream Mining Using Hive and RIRJET Journal
This document proposes using Hive and R to perform data stream mining on big data. Hive is used to query and analyze large datasets stored in Hadoop. Test and trained datasets are extracted from the data using Hive queries. The Support Vector Machine (SVM) classifier algorithm analyzes the data to produce a statistical report in R, comparing the accuracy of linear and nonlinear models. The proposed method aims to improve data processing speed and ability to analyze large volumes of data as compared to other tools.
Performance Analysis and Parallelization of CosineSimilarity of DocumentsIRJET Journal
This document discusses performance analysis and parallelization of the cosine similarity algorithm for calculating document similarity. It proposes an optimized algorithm that utilizes parallel computing to calculate cosine similarity for large sets of retrieved documents more efficiently. The conventional cosine similarity algorithm becomes inefficient for large document sets. The parallelized approach aims to enhance efficiency and reduce latency by processing more documents in less time. The document reviews related work applying techniques like parallelization, cosine similarity, and dimensionality reduction to problems involving document clustering, text summarization, and information retrieval.
IRJET- A Workflow Management System for Scalable Data Mining on CloudsIRJET Journal
1. The document discusses a workflow management system for scalable data mining on clouds. It proposes using MapReduce and Hadoop frameworks to parallelize k-means clustering of large datasets on cloud infrastructure.
2. The system aims to improve efficiency, security, and transmission speed over existing cloud systems by generating hash codes for files before classification and storage on cloud. It uses deduplication to avoid redundant uploads.
3. The document outlines the system implementation including user modules for registration, login, profile editing, training data upload, file upload and download while avoiding redundancy, and changing/logging out of passwords. It also discusses testing the system functionality using unit testing libraries.
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET Journal
This document discusses issues with analyzing sub-datasets in a distributed manner using Hadoop, such as imbalanced computational loads and inefficient data scanning. It proposes a new approach called Data-Net that uses metadata about sub-dataset distributions stored in an Elastic-Map structure to optimize storage placement and queries. Experimental results on a 128-node cluster show that Data-Net provides better load balancing and performance for various sub-dataset analysis applications compared to the default Hadoop implementation.
Towards efficient processing of RDF data streamsAlejandro Llaves
Presentation of short paper submitted to OrdRing workshop, held at ISWC 2014 - http://streamreasoning.org/events/ordring2014.
In the last years, there has been an increase in the amount of real-time data generated. Sensors attached to things are transforming how we interact with our environment. Extracting meaningful information from these streams of data is essential for some application areas and requires processing systems that scale to varying conditions in data sources, complex queries, and system failures. This paper describes ongoing research on the development of a scalable RDF streaming engine.
Towards efficient processing of RDF data streamsAlejandro Llaves
This document discusses efficient processing of RDF data streams. It proposes using the Storm distributed stream processing system and Lambda Architecture to address challenges of scalability, latency, and integrating historical and real-time data. Key components include Storm-based operators to parallelize SPARQL queries over streams, adaptive query processing to adjust to changing conditions, and an ERI compression format to reduce transmission costs for structured RDF streams. Open questions remain around parallelization and handling of out-of-order tuples.
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET Journal
The document proposes a new framework for efficient semantic search in large datasets. It aims to improve understanding of short texts by enriching them with concepts and related terms from a probabilistic knowledge base. A deep learning model using stacked autoencoders is designed to learn features from the enriched short texts and encode them into binary codes, allowing similarity searches. Experiments show the new approach captures semantics better than existing methods and enables applications like short text retrieval and classification.
Smart E-Logistics for SCM Spend AnalysisIRJET Journal
This document discusses applying predictive analytics and machine learning techniques like LSTM models to supply chain management problems. It focuses on spend analysis and extracting fields from invoices and proofs of delivery using optical character recognition. The key points are:
1. LSTM models are applied to time series spend analysis data and shown to provide more accurate predictions than ARIMA models.
2. A technique is proposed to extract fields from printed and handwritten documents using models trained on Form Recognizer and then cleaning the extracted data.
3. The technique aims to reconcile invoices and proofs of delivery by comparing extracted data fields and calculating a match confidence score.
The document describes CloudTPS, a middleware system that implements support for join queries and transactions in NoSQL cloud data stores. CloudTPS sits between web applications and their underlying data store (e.g. Bigtable, SimpleDB) to provide consistent join queries and strongly consistent multi-item transactions while retaining the scalability of the cloud data store. CloudTPS focuses on supporting foreign-key equi-join queries, which start with records identified by their primary keys and follow references to other records, allowing it to efficiently process queries that access a small number of data items.
The previous research has focused on quick and efficient generation of wrappers; the
development of tools for wrapper maintenance has received less attention. This is an important research
problem because Web sources often change in ways that prevent the wrappers from extracting data
correctly. Present an efficient algorithm that extract unstructured data to structural data from web. The
wrapper verification system detects when a wrapper is not extracting correct data, usually because the
Web source has changed its format. The Verification framework automatically recovers data using
Dimension Reduction Techniques from changes in the Web source by identifying data on Web pages.
After apply wrapped data to One Class Classification in Numerical features for avoid classification
problem. Finally, the result data apply in Top-K query for provide best rank based on probabilities
scores. Wrapper verification system relies on one-class classification techniques to beat previous
weaknesses to identify the problem by analysing both the signature and the classifier output. If there are
sufficient mislabelled slots, a technique to find a pattern could be explored.
LoadAwareDistributor: An Algorithmic Approach for Cloud Resource AllocationIRJET Journal
This document summarizes research on load balancing algorithms for cloud resource allocation. It proposes a new LoadAwareDistributor algorithm that prioritizes virtual machines with lower CPU utilization to improve efficiency. A literature review covers existing load balancing techniques and their goals. The proposed algorithm is evaluated through simulation and shown to improve metrics like VM utilization and task completion time over round-robin methods. The study advocates for future algorithm advances incorporating machine learning to better address dynamic load balancing challenges in cloud computing environments.
An Energy Efficient Data Transmission and Aggregation of WSN using Data Proce...IRJET Journal
The document proposes a system for efficient data transmission and aggregation in wireless sensor networks (WSNs) using MapReduce processing. Sensors are grouped into three clusters, with a cluster head elected in each based on distance, memory, and battery to reduce energy consumption. Sensor data is encrypted and sent to cluster heads, which aggregate the data and append a signature before sending to the base station. The signature is verified and data is stored in Hadoop and processed using MapReduce. The system aims to provide data integrity and privacy during concealed data aggregation to reduce overhead in heterogeneous WSNs.
This document discusses developing cyberinfrastructure to support computational chemistry workflows. It describes the OREChem project which aims to develop infrastructure for scholarly materials in chemistry. It outlines IU's objectives to build pipelines to fetch OREChem data, perform computations on resources like TeraGrid, and store results. It also discusses the GridChem science gateway which supports various chemistry applications and the ParamChem project which automates parameterization of molecular mechanics methods through workflows. Finally, it covers the Open Gateway Computing Environments project and efforts to sustain software through the Apache Software Foundation.
THE DEVELOPMENT AND STUDY OF THE METHODS AND ALGORITHMS FOR THE CLASSIFICATIO...IJCNCJournal
This document summarizes a study on developing methods and algorithms for classifying data flows of cloud applications in the network of a virtual data center. The researchers developed a hybrid approach using data mining and machine learning methods to classify traffic flows in real-time. They created an algorithm for classifying and adaptively routing cloud application traffic flows, which was implemented as a module in the software-defined network controller. This solution aims to improve the efficiency of handling user requests to cloud applications and reduce response times.
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
M phil-computer-science-data-mining-projectsVijay Karan
This document provides summaries for several M.Phil Computer Science Data Mining Projects written in C#. The projects cover topics such as bridging virtual communities, mood recognition during online tests, surveying the size of the World Wide Web, knowledge sharing in virtual organizations, adaptive provisioning of human expertise in service-oriented systems, cost-aware rank joins with random and sorted access, improving data quality with dynamic forms, targeted data delivery algorithms, and sentiment classification using feature relation networks.
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...IRJET Journal
The document empirically analyzes the radix sort algorithm using curve fitting techniques on data collected from running radix sort on different data sizes on a personal computer. It implements radix sort in C and runs it 100 times for data sizes ranging from 10,000 to 27,000, recording the average run times. It then uses curve fitting to identify the model that best fits the run time versus data size data points, using R-squared, adjusted R-squared, and root mean square error. The analysis finds that the power model provides the best fit for the data.
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...IRJET Journal
This document discusses optimizing task completion time in cloud computing through efficient resource allocation using genetic and differential evolutionary algorithms. It aims to reduce makespan (completion time) by combining a genetic algorithm with differential evolutionary algorithms. The genetic algorithm uses selection, crossover and mutation to allocate tasks to resources. The outputs are then input to the differential evolutionary algorithm, which has the same operations in reverse order. This double process refines the allocation to provide the best allocation minimizing completion time. The document outlines the related work in genetic algorithms for resource allocation and task scheduling in cloud computing.
Marketing strategy. This is a paper I wrote for my assignment during the summer classes. The paper was marked and I scored 90 percent even though the tutor in formed me that I had to do some of the improvements on it.
The document discusses health and safety legislation in the UK. It emphasizes the importance of enforcing health and safety rules to protect workers. The legislation focuses on providing welfare services, training for safe equipment handling, and regular maintenance. Employers must provide a safe work environment and inform staff of hazards. Employees should report any incidents. The Health and Safety Act of 1974 gives regulations for employers, including risk assessments, and failure to comply can result in liability. Stress management is also an important area covered by the legislation. Case studies are presented to illustrate the duties of employers and employees.
Heuristic based query optimisation for rsp(rdf stream processing) enginesWilliam Aruga
This thesis addresses improving query optimization in RDF stream processing engines like CQELS and C-SPARQL. The author proposes implementing a heuristics-based approach to identify errors early and optimize queries. Key contributions include deploying the approach on existing engines, extending the engines to allow sharing processing and resources among concurrent queries, and evaluating performance which shows improvements over original engines. The thesis explores optimization techniques like adaptive execution, dynamic query planning, multi-way join optimization, and shared window joins.
The document discusses consistency between Jonathan's comment and Shelley's ideas about a club fundraiser. Jonathan's comment that the club had invested in a rodeo was consistent with Shelley's idea that the fundraiser should continue getting better each year and give the club a presence in the community. However, Jonathan's comment aimed to increase the club's wealth rather than give back to the community. The document also provides steps to make the rodeo profitable and lists references.
The document discusses marketing strategies for special events, using Glastonbury Festivals as a case study. It defines special events and how they are classified. It then performs a SWOT analysis of Glastonbury Festivals and discusses how the marketing mix or 7P's are applied, focusing on product, price, and place. Finally, it outlines marketing strategies like segmentation, targeting, positioning, and monitoring and control used by Glastonbury Festivals.
This document provides a financial analysis of Saudi Telecom Corporation (STC) and a comparison with its competitor Mobily. It includes a SWOT analysis, industry analysis using Porter's Five Forces, and an analysis of key financial ratios for STC. It also discusses sources of internal and external finance available to STC, budgeting, and concludes with recommendations for performance enhancement. Financial data for STC such as net revenue, net income, cash flow, market capitalization, and dividend yield are presented alongside the same metrics for Mobily to facilitate comparison between the two companies.
The document describes a flow chart for restocking items within a company. The process begins with identifying required items and submitting a list to the manager for approval. If approved, the list goes to the finance office for funding approval before seeking suppliers. Items are collected, accounted for, and distributed. The document analyzes each step and identifies areas for improvement, such as eliminating duplicative roles to streamline the process and reduce time and inefficiency.
Heuristic based query optimisation for rsp(rdf stream processing) enginesWilliam Aruga
This is an original report of the dissertation which I wrote some days back on Heuristic based query optimisation for rsp(rdf stream processing) engines. The report was done by Wilfred Govern on my behalf
Fraud examination bre x minerals case studyWilliam Aruga
The Bre-X Minerals case involved a mining company that convinced investors it had discovered one of the largest gold deposits ever. However, it was later revealed that gold samples had been tampered with to mislead investors and inflate the company's value. When an investigation found the samples were fraudulent, the company's stock price collapsed. The fraud triangle model explains there was pressure on perpetrators to meet financial targets, an opportunity due to lack of controls, and rationalizations to save the company. Examining management compensation and company relationships could have provided clues about the true value of the gold deposit.
The impact of digital platform on the sharing economyWilliam Aruga
The digital platform of Airbnb has significantly impacted the sharing economy in three key ways:
1) It allows individuals to list, discover, and book unique accommodations from over 34,000 cities and 190 countries.
2) It facilitates peer-to-peer transactions and builds trust between hosts and guests through a user review system.
3) While allowing property owners control over their listings, Airbnb maintains control over its brand by vetting hosts and creating incentives to provide a quality customer experience.
Takotsubo cardiomyopathy potential differential diagnosis in acute coronary s...William Aruga
1. Takotsubo cardiomyopathy (TCM) and acute coronary syndrome (ACS) can present with similar symptoms but have distinct causes. TCM is often triggered by emotional or physical stress and causes temporary left ventricular dysfunction, while ACS is caused by coronary artery blockages.
2. It is important to differentiate between TCM and ACS to determine the appropriate treatment approach. Electrocardiograms may show different abnormalities in TCM compared to ACS. Imaging tests like coronary angiography can also help establish a diagnosis.
3. While diagnostic criteria have been proposed for TCM, it can still be challenging to distinguish from ACS. Careful assessment of symptoms, risk factors, and test results is needed
Evaluation of doctoral study foundation of studyWilliam Aruga
This document provides background information and establishes the foundation for a study on strategies small business owners use to achieve profitability beyond five years. It identifies that small businesses face challenges like lack of resources and management skills that contribute to high failure rates. The purpose of the study is to explore industry strategies successful small retail business owners employ to remain profitable for over five years. The central research question asks what strategies small business owners use to achieve long-term profitability. The conceptual framework draws from theories of disruptive innovation and susceptibility.
This document discusses a research study that analyzed the structural behavior of laminated glass beams in comparison to monolithic and layered glass beams. The study aimed to better understand the mechanical behavior of laminated glass and evaluate its suitability for structural design. The research examined laminated glass under both dynamic and static loading using experimental and numerical data. It also developed mathematical models to understand the salient features of laminated glass and compared current design codes to its actual behavior. Various case studies on laminated glass applications were also conducted to analyze its use in building structures.
The document discusses the impact of digital platforms on the sharing economy, using Airbnb as a case study. It makes three key points:
1) Airbnb has grown rapidly due to technological innovations that allow individuals to share unused resources through an online platform. This platform model reduces transaction costs and builds trust between strangers.
2) As a digital platform, Airbnb utilizes a modular system that facilitates product innovation and achieves economies of scale. It also executes control over hosts and customers through mechanisms like reviews, commissions, and branding.
3) The emergence of sharing platforms like Airbnb is driven by consumers' desire for new economic and experiential options beyond traditional hotels. This competition has led
This document appears to be a plagiarism report for a work submitted by John Duran on March 25th, 2017. The report details that the work is 8,297 words and 48,254 characters long. It also indicates that the similarity index found just 1% similarity to other sources and that quotes and the bibliography were excluded from the analysis. The report is then followed by 34 pages of additional analysis or documentation.
This document provides an analysis of the Mark & Spenser organization. It analyzes the company's internal and external environments through a PESTEL analysis, SWOT analysis, and evaluation of the company's value chain and resources. Some of the company's strengths identified include its strong brand recognition and large customer base in the UK. Weaknesses include some customer perceptions of high prices and lack of interest in some product lines. Opportunities exist in expanding into new markets and adapting to customer preferences for incentives. Threats include increased competition and potential economic downturns. The document concludes the company has a strong strategic position but must effectively utilize resources to maintain its competitive advantage.
The document summarizes artifacts from different cultures represented in the San Antonio Museum of Art, including ancient Rome, Judaism, Christianity, Islam, and the Middle Ages. For ancient Rome, a chest with writing depicts a ritual offering. Judaism is represented by a mezuzah placed on doors. Christianity uses a crucifix to signify Jesus' suffering. Islam's artifact is a compass used to face Mecca in prayer. A golden drinking cup from the Middle Ages was used in ceremonies and some communities still use cups today.
This document reviews literature on childhood obesity. It discusses how the rate of childhood obesity has significantly increased over the past 30 years. Obesity in children can lead to both short-term and long-term health impacts. Common causes of childhood obesity include increased calorie intake, lack of exercise, and consumption of sugary drinks. Addressing childhood obesity is important as obese children are more likely to be obese adults and develop related health conditions like diabetes if obesity continues into adulthood.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
1. 1
HEURISTICS-BASED QUERY OPTIMISATION SOLUTION IMPLEMENTATION IN RSP
ENGINES: THE CQELS AND C-SPARQL
Submitted in fulfilment of the requirements for the degree of Masters of Science
Supervisor:
Co-supervisor
The Insight Centre for Data Analytics, National University of Ireland, Galway
September, 2016
2. 2
Abstract
This thesis addresses the gravity of basing and constructing the query optimisation process
executed in RDF stream processing engines around an efficient heuristics engine. The Resource
Description Framework (RDF has taken the world over by storm and by its golden standard of
data stream processing and communication of real-time data items collected from medical
institutions, industrial plants, financial entities, and telecommunication service providers. For
instance, DBPedia and Yago help reinforce structural querying in Wikipedia searches by
retrieving metadata and encoding them in an RDF format. Also, biological information such as
experiments and their distinctive results are stored as RDF data compilations to enable sufficient
communication between chemists and biological specialists.
The data streaming framework has been highlighted by the invention of the Semantic
Web by Tim-Berners Lee that works to stream linked data from sourced documents and
applications, thus serving users with precise web pages. However, the query optimisation
performed in both of these query languages is still somewhat deficient in regards to the time
expended before the results of the search are delivered. The execution of flawed queries is also
another worrying factor in the query optimisation function of the RSP engines. All these
elements: lengthy run-time, extravagant computational costs such as join operations, and the
implementation of inaccurate queries contribute to the downgrade of RDF stream processing.
Heuristics will help identify early error signs in the user queries and solve them by use of
its inbuilt configurations and algorithms. The novel heuristics optimisation model can be used as
a benchmark in querying of the Semantic Web metadata in departments such as in military
logistics, data warehousing, engineering analysis, and health care. Some of the main
3. 3
contributions of this research work include: (i) Deploy an implementation of reference on
existing CQELS and C-SPARQL execution framework; (ii) Extend the two RSP engines
(CQELS and C-SPARQL). This new engine helps in allowing the processing and resource space
sharing among multiple concurrent queries; (iii)Evaluate the performance of the extended RSP
engines and compare them with the first released CQELS and C-SPARQL engines. The results
of the evaluation show a remarkable improvement in the performance in addition to the
demonstration of the practicality of the approach used.
4. 4
Table of Contents
Table of Contents........................................................................................................................................4
Chapter 1: Introduction...............................................................................................................................9
1.1 Motivation.........................................................................................................................................9
1.2 Problem Statement and Hypotheses...............................................................................................10
1.3 The Outcome of the Thesis..............................................................................................................14
1.3.1 Adaptive execution framework.................................................................................................14
1.3.2The linked data stream adaptive processing model...................................................................14
1.3.3 Algorithms and data structures for triple-based windowing operator incremental evaluation15
1.3.4 The techniques for optimization for multiway joins.................................................................16
1.4The Outline of This Thesis.................................................................................................................16
Chapter 2: The General Background..........................................................................................................17
2.1Introduction......................................................................................................................................17
2.2 Comparative and Survey Evaluations...............................................................................................24
2.3Query Optimization..........................................................................................................................27
2.4RDF Stream Processing and Semantic Web......................................................................................29
Chapter 3: Background to RSP Engines......................................................................................................32
3.1 C-SPARQL.........................................................................................................................................32
3.2CQELS................................................................................................................................................32
3.2.1 Introduction..............................................................................................................................34
3.2.2 Proposed heuristics approach...................................................................................................37
3.2.3 Results simulation.....................................................................................................................43
3.2.4 The performance comparison graph between new improved model and the previous version
of CQELS and C-SPARQL.....................................................................................................................46
Chapter 4: State of The Art in LSDP or the Linked Stream Data Processing...............................................53
4.1 Query Semantics and Data Models..................................................................................................53
4.2 Data Model......................................................................................................................................53
4.3 Query Semantics..............................................................................................................................55
4.4 Query Languages.............................................................................................................................55
5. 5
Chapter 5: The Optimization Solutions for the CQELS...............................................................................59
5.1 The Adaptive Optimizer...................................................................................................................65
5.2 The Dynamic Executor.....................................................................................................................67
Chapter 6:Exploration of the RDF Engine – Continuous C-SPARQL............................................................69
Chapter 7: Adaptive Query Optimiser in RDF Engines...............................................................................74
7.1 Adaptive Query Optimiser...............................................................................................................74
7.2 MultiwayJoins Adaptive Cost-based Optimisation...........................................................................74
7.3 Shared Window Joins Optimisation.................................................................................................76
7.4 Multiple Join Operator.....................................................................................................................76
7.5 Features of Adaptive Query Optimization.......................................................................................78
7.6 Adaptive Plans Concepts..................................................................................................................79
Chapter 8: Conclusion and Future Work....................................................................................................81
8.1 Conclusion.......................................................................................................................................81
8.2 Future Work.....................................................................................................................................84
References.................................................................................................................................................87
6. 6
List of Figures
Figure 1: Semantic Web processing...........................................................................................................29
Figure 2: Query flow through a DBMS.......................................................................................................37
Figure 3: Binary tree..................................................................................................................................38
Figure 4: Magic tree...................................................................................................................................39
Figure 5: Cost versus time graph...............................................................................................................45
Figure 6: Performance versus complexity..................................................................................................46
Figure 7: Graphical performance comparison...........................................................................................48
Figure 8: An architecture of the C-SPARQL engine....................................................................................72
7. 7
List of Tables
Table 1: Algorithm 1..................................................................................................................................40
Table 2: Algorithm 2..................................................................................................................................42
Table 3: Query 1........................................................................................................................................44
Table 4: ThePerformance Comparison by Features...................................................................................47
Table 5: Performance Comparison by the Mechanism of Execution.........................................................47
8. 8
Summary
This work aims at exploring query optimisation solution implementation in RSP engines namely;
the CQELS and C – SPARQL. The framework presents one of the continuous query languages
which are compatible with SPARQL. This structure is introduced over both linked data and
linked stream data. In practice, the framework is very flexible, hence enabling performance gains
of various magnitude orders over other related systems. An efficient hybrid physical data
organisation that uses a novel data structure in supporting algorithms helps to deal with high
update throughput RDF streams and large RDF datasets. Additionally, this framework also gives
provision for various adaptive optimisation algorithms. This thesis also provides extensive
experimental evaluations for the demonstration of the advantages of the CQELS and C –
SPARQL processing engines and framework regarding performance. Furthermore, these
assessments aim at covering a comprehensive set of parameters that plays a significant role in
dictating the performance of the continuous queries over both the linked data and the linked stream
data.
9. 9
Chapter 1: Introduction
As the primary purpose of this research study is the exploration of the gravity of basing
and constructing the query optimisation process executed in RDF stream processing engines
around an efficient heuristics engine, the introduction starts with the motivation. Afterward, it
discusses the problem statement and hypotheses. Next, this chapter touches on the thesis
outcome, and lastly the thesis outline.
1.1 Motivation
It is crucial to note that the world is currently witnessing a scenario of paradigm shift
(Abdulla and Matzke 2006, p.29). In essence, the real time and data that depends on such time
continuous to become ubiquitous (MacLennan and Tang 2009, p.61). In the last few years, little
was known about things such as the sensor devices (Mueller 2009). For example, compasses,
cameras, mobile phones, GPS, accelerometer and so on. Additionally, stations for observing the
weather such as humidity, temperature and so forth are on the continuous rise in producing a
large quantity of information in the form of data streams (Cheung, Hong, and Fong 2006, p.55).
Furthermore, things like the systems that monitor the patients such as the blood pressures, heart
rate and so on, and also the systems that track the locations such as the RFID, GPS, etc., play a
vital role in this process. Moreover, the building management systems that includes the
conditions of the environment, the consumption of energy and so on, the cars that include the
both the driver and engine (Abdulla and Matzke 2006) monitoring also records an equal
tremendous increase in the production of such quality information (Cole and Conley 2009, p.53).
In addition, the web equally has several services that include the use of Facebook, Twitter, and
10. 10
some blogs that helps in the delivery of streams of real-time data that is typically unstructured on
various topics.
1.2 Problem Statement and Hypotheses
In practice, the motivation of this kind of thesis results in larger problems of research that
always arise at the time of building an efficient Linkage Stream Data query processing engine.
One of the major problems is how to design a new declarative query language. According to
research (e.g. Abdulla and Matzke 2006, 145; Buchanan and Shortliffe 1984, p.99), this kind of a
problem mostly arises due to neither SPARQL nor the state of the art continuous query
languages could assist in querying Linkage Stream Data. In practice, a query language requires a
sound semantics and a formal data model of continuous query operators (MacLennan and Tang
2009, p.187). In essence, the data model must have the ability to represent both the Linked Data
and the Linked Stream Data in a unified view. In this case, the new data model must be an
extension of the Resource Description Framework model to allow for a transparent integration of
conventional Resource Description Framework databases (Zhang and Kollios 2007, p.85). In
light of the continuous query processing, there is a required property of a temporal aspect of the
data that has not been earlier covered by any of the Resource Description Framework
(MacLennan and Tang 2009, p.22). Alongside the data model, there must also be a definition of a
graph base query operators that have continuous semantics in the specification of the meaning of
the declarative query patterns (Buchanan and Shortliffe 1984, p.163). Worth noting, for the
primary purposes of reducing the efforts of learning, it is important to have a query pattern that
resembles an SPARQL. Furthermore, the kind of activity requires some alignment of the query
operators with the semantics of SPARQL. Additionally, this kind of alignment must be
11. 11
compatible with the operations of the window according to its definition in the traditional
continuous queries, for example, the CQL.
It is important to note that when given the disadvantages of using unmodified triple
storages and the Data stream management systems for the Linked Stream Data, Resource
Description Framework based stream data displays new issues for the physical organization for
both the Linked Data and Linked Stream Data (MacLennan and Tang 2009, p.149). Most
importantly, a triple table storing identifiers that represent literals and URIs are the standard
models of storing bags (Abdulla and Matzke 2006, p.109). In the process, such activity combines
with mapping tables in the form of dictionaries in translating that identifies back into the form of
lexical (Cole and Conley 2009, p.203). The Linked Stream Data necessitates a high writing
throughout, on the other hand, this kind of data has the design for heavy read intensive context
(Zhang and Kollios 2007, p.148). Another important thing to note is the remedy of the Data
stream management systems for the Linked Stream that can write intensive requirement by the
use of storage of in-memory. However, this kind of data entails a Linked Data that can
sometimes be not possible in hosting in the main memory (Cole and Conley 2009, p.209).
Furthermore, Resource Description Framework based data elements such as temporal RDF
triples and RDF triples are very small. In effect, they display an enormous individual data points
in comparison to the quantity of the encoded information.
In practice, the efficiency of the raw-based data structure used in relational Data stream
management systems is very is not sufficient since it needs sizes of tuple header that can
dominate the total size of the storage (Cole and Conley 2009, p.211). It is important to note that
the raw based data structure designed for shorter and wider tables can sometimes rises
significantly the ways for processing stream. In effect, there is a need for a new physical
12. 12
approach of organisation for processing both Linked Data and Linked Stream Data (Buchanan
and Shortliffe 1984, p.92). Resource Description Framework based operators of continuous
query typically operate on a few or one very large tables (MacLennan and Tang 2009).
Therefore, it plays a vital role in having indexes for the random data items’ access. It is also
important to note that most of the modern Resource Description Framework stores give
provision for a massive strategy of indexing in overcoming their large handicap (Cole and
Conley 2009, p.173). In essence, it is always possible in bypassing such tables since the indexes
cover all the accessing patterns. Notably, a comprehensive indexing scheme has a very high
maintenance cost hence making it impractical for stream processing. In addition, some of the
stream data indexing solutions might appear helpful but their designs only make them applicable
for relational streams (Abdulla and Matzke 2006). In effect, an investment of a hybrid solutions
that can be applicable for strategies of indexing of both stream data processing and triple
storages forms part of an interesting problem (Cole and Conley 2009, p.239). Additionally,
another issue that associates with the physical representation of Resource Description
Framework based stream data is the way of efficient evaluation of the unbound nature of streams
versus the window operators.
It is both worthy and in order to note that there are several attempts in Data stream
management systems to support the queries of the sliding windows (Cole and Conley 2009,
p.243). Most importantly, one of such efforts is the independent re-evaluation over each of the
windows from all other windows. In practice, this kind of process is referred to as the re-
evaluation computation (Abdulla and Matzke 2006, p.199). Worth noting, this approach is useful
in both the Borealis and Aurora. Additionally, there is also another method known as the
incremental evaluation computation that only plays a significant role in processing changes that
13. 13
get expired and inserted tuples in the windows in the pipeline of the query (MacLennan and Tang
2009, p.272). In essence, this kind of approach is useful in Nile and Stream. In contexts of these
activities, there exist some shortcoming to employ incremental methods of evaluation (Cole and
Conley 2009, p.287). Practically, these methods include both the negative tuples and the direct
time stamps. Notably, the method of the negative tuple doubles the tuple number through the
pipeline of the query. On the other hand, the direct method of the timestamps requires some extra
timestamps. In practice, with the introduction of the new data structures in this thesis, the
associated effective algorithms to compute operators of windowing must always address the
unusual characteristics of data.
A Resource Description Framework triple storage has an exceptionally thin and long
table that are not standard optimization (Cole and Conley 2009, p.368). In this case, it is always
quite challenging for the traditional Data stream management systems to give statistics that are
relevant for query optimizer. In addition, this kind of challenge is still applicable to the
processing Linkage Data and Linked Stream Data. It becomes even more challenging to maintain
high dynamic datasets of statistics in the setting of stream processing (Cole and Conley 2009,
p.394). Most importantly, such type continuous query processing’s adaptivity query optimiser
becomes harder to achieve due to the unpredictivity of Resource Description Framework data
and the dynamic nature of the stream data distributions (MacLennan and Tang 2009, p.400).
Moreover, the SPARQL just like queries always consist of share query patterns posting the
requirements of optimization of the multi-query (Cole and Conley 2009, p.386). Some of the
proposed approaches for relational streams might sometimes fail to work on the Resource
Description Framework based on the stream, although there exist several efforts in the multi-
query optimisation (Abdulla and Matzke 2006, p.397). In light of these approaches, such failure
14. 14
to work mostly results from its various natures in comparison to the relational one (Zhang and
Kollios 2007, p.391). In effect, it becomes very challenging in enabling multi-query optimisation
for Linked Data Streams.
1.3 The Outcome of the Thesis
In light of the issues stated above, the outcome this thesis would include:
1.3.1 Adaptive execution framework
This kind of framework will enable adaptivity in RSP engines: the CQELS and C –
SPARQL (Abdulla and Matzke 2006, p.402). Additionally, the framework can allow full control
of the process of execution with the flexibility of adding new algorithms and new data structure
to the query engine component (MacLennan and Tang 2009, p.433). Essentially, the framework
uses the encoding mechanisms in enabling the implementation of a small footprint and less
workload of the operators by performing only on fixed, small size integers (Buchanan and
Shortliffe 1984, p.266). It is important to note that the Linked Data parts catching solution for
subqueries helps in improving the performance and scalability of the query processing on the
collection of Linked Data (Zhang and Kollios 2007). In practice, the framework can address the
problem of scalability to integrate large static datasets with the proposed caching mechanism.
1.3.2The linked data stream adaptive processing model
This thesis recommends an adaptive processing model such as the formal definition of
query semantics, the data model, and the model of execution (Cole and Conley 2009, p.437). It is
important to note that the data model covers both the temporal aspect of Linked Data sets and the
Linked Stream Data that are yet to be addressed (Zhang and Kollios 2007, 434). On the other
hand, the query semantics get formalized by the use of both the operational and mathematical
meanings. In the first place, the precise meaning is helpful in showing the way of mapping a
15. 15
declarative query fragment in response to the mathematical expressions (Cole and Conley 2009,
p.441). Additionally, the abstract syntaxes play a significant role in accompanying all the query
fragments to define a declarative query language with an extension from SPARQL (Buchanan
and Shortliffe 1984, p.280; Zhang and Kollios 2007, p.404). On the other hand, the operational
meanings help in the definition of the way of executing the operators in the expressions in the
physical execution plans (MacLennan and Tang 2009, p.432). In this case, the operational
semantics plays a significant role in showing the performance model for the constant execution
of the equivalent execution plans for a query expressed in both CQELS and C – SPARQL
languages (Cole and Conley 2009, p.470). In effect, this kind of operational feature helps in the
facilitation of the adaptivity of the execution engines based on the processing models (Zhang and
Kollios 2007, 355). This kind of scenario occurs due to the its ability to execute engine to
dynamically change to another equivalent execution plan from the current one for adapting to the
variations in the run-time (MacLennan and Tang 2009, p.446). In short, the CQELS language is
both the only language that get accompanied with the sound operational and mathematical
semantics and also one of the first query language for Linked Stream Data.
1.3.3 Algorithms and data structures for triple-based windowing operator incremental
evaluation
In this case, the introduction of the novel operator-aware data structures in association
with efficient additional evaluation algorithms in dealing with both the unusual properties of
query patterns and the RDF stream is helpful (Cole and Conley 2009, p.422). Most importantly,
the design of these data structures allows for the handling of intermediate mappings and small
data items contained in the processing state. Worth noting, these kind of data structures consists
of different cost indexes that have low maintenance in supporting high throughput in the
16. 16
operations of probing that are useful in various implementations of operators (Abdulla and
Matzke 2006). In context of this kind of data there was a need for proposing various algorithms
in enabling incremental evaluation of some of the basic operators that include the elimination of
the duplicates, join, and aggregation (MacLennan and Tang 2009, p.453). In short, these kind of
algorithms aims at overcoming typical issues involved in incremental evaluation of the
windowing operators.
1.3.4 The techniques for optimization for multiway joins
In essence, this thesis explores the use of techniques of adaptive optimization to improve
the performance of the multiway joins (Abdulla and Matzke 2006, p.456). It is important to note
that this is one of the most expensive operators of query in the pipeline of query (Cole and
Conley 2009, p.472). Practically, this kind of adaptive cost model is useful in designing two
adaptive algorithms for the dynamic optimization of a query of a two-multiway join.
1.4The Outline of This Thesis
The organisation of the remaining part of this thesis is as follows: Chapter 2 explores the
general background on Linked Data processing and stream processing. Chapter 3 presents the
background to RSP engines (the CQELS and the C-SPARQL). Chapter 4 touches on the State of
the Art in LSDP or the Linked Stream Data Processing. Chapter 5: explores the optimisation
solutions for the CQELS. Chapter 6 mainly explores the RDF engine – continuous C – SPARQL.
Chapter 7 evaluates the RSP engines framework, and finally, Chapter 8 points to the future work
after the conclusion of this thesis in the same chapter.
17. 17
Chapter 2: The General Background
This chapter explores the background techniques and concepts for Linked Data
processing and stream processing. Additionally, this background information also gives provision
for the fundamentals of stream processing that is applicable to the Linked Stream Data. In short,
the chapter discusses the representation of continuous semantics, basic techniques and models,
and the operators and the methods of optimisation, and the way of handling issues such as
memory overflow and time management (MacLennan and Tang 2009). In addition, the chapter
presents the definition of the semantics of SPARQL and Resource Description Framework data
model queries and the relevant notations. This general background also gives an overview of the
way of storing Resource Description Framework and its query by the use of SPARQL.
2.1Introduction
The term ‘heuristic’ is Greek for ‘discover’ or ‘find’ (Calhoun and Riemer 2001).
Heuristics is a common practice applied in multiple industry fields for the benefits of observing,
learning, and spotting malware errors and other problems by use of experience. For example,a
well-modelled heuristics technique is enlisted in antimalware programs to learn and spot
computer threats such as Trojan horses, viruses, and worms. The learning and observation aspect
of the heuristics framework operates by scanning computer documents capturing the signatures
they are differentiated with (Chen 2009). After reading the unique signatures of the computer
files such as tiny macros, find commands, or even subroutines, the heuristics uses its memory
and experience to identify the already read threats.
According toCIKM 2006 Workshops (2006), heuristics entail a suite of rules geared
towards enhancing the probability of identifying and ironing out problems in a given structure.
18. 18
When applied in the computer science field, heuristics is considered as an algorithm engineered
to present viable solutions to any arising glitches in a given scenario. The heuristics discipline
generally examines how information is studied, captured, and discovered.When engaged in
artificial intelligence, computer science, or mathematical optimisation, heuristics engineswork to
decipher problems in a fast and efficient way when the conventional methods are acting up, are
not fast enough, or fail to calculate accurate solutions(Cheung et al. 2006, p. 49). If the heuristics
path is chosen in the failure of conventional methods, it is seen as a shortcut as it speeds up the
process. As Cohen (1985), says, heuristics can either work in isolation generating solutions by
themselves or in combination with optimisation algorithms all geared towards increasing the
RSP’s effectiveness (Gedik 2006). The more advanced version of heuristics thoroughly inspects
then traces the guidelines put in the codes of programs prior to passing them over to the
computer’s processing unit for execution. This will help the heuristics engine to assess and learn
the behaviour and mannerisms of that program while it runs in a virtual setting.
The current querying strategies enlisted in CQELS and C-SPAQRL waste a lot of
valuable time while performing incorrect and inept queries that may be keyed in by an end user
who is not quite familiar with the intricate querying descriptions, say Gore (1964). In as much as
the database servers within the CQELS and C-SPARQL systems may recognize these inefficient
queries, the end computer users and internet browsers are not aware of these incorrectly stated
queries, and, hence, may continue ringing them. As this happens, the entire performance and
speed of the language engines is incrementally impaired thus having less and less of total number
of data retrieval executed per unit time. In a bid to look for solution of the system downgrade, the
users opt to refer the issue to the DBA to help them code the efficient queries. Similarly, this
DBA consultation also results in time wastage as well. This is where the incorporation of a
19. 19
heuristics engine comes into play. By assimilating a heuristics function in the querying of the
CQELS and C-SPARQL languages, a substantial amount of time and querying effort will be
saved(Cheung et al. 2006, p. 57). The heuristics function will serve as a query optimiser that will
skim through the input user query, inspecting it thoroughly to highlight and remove any detected
errors. According to Mcllroy (1998), unlike the DBA that recognizes the lapses in the queries yet
does nothing about them, the heuristics will work to automatically muster and reproduce a
correspondent but highly optimized query. By spotting and rectifying the inaccuracies inherent in
the queries input by the end users, the heuristics function will be discarding the time-consuming
processes of inaccurate query execution as well as the time expended to consult the DBA for
viable solutions. In this way, the system productivity and throughput will always be on an
upward curve. The frequency of accesses will be marginalized as the heuristics will lessen and
fully eradicate the number of tuples and columns browsed hence the data streaming processing
and querying accuracy will be on a winning streak.
An effort to integrate this kind of sources of information would enable a broad range of
application of near real time in the areas of green information technology, smart cities, e-health
and so on (Cole and Conley 2009, p.19). However, harvesting of such kind of data remains a
labour-intensive and a difficult task due to the heterogeneous nature of the vast streams. In
essence, such a process needs a lot of hand-crating methods. Worth noting, the remedy of this
kind of scenario involves the application of Resource Description Framework data or the RDF
data model (Schreiber 1977, p.38). In practice, this type of data model helps one to express
knowledge in a generic way. It is also necessary to note that it does not require any adherence to
a particular schema (MacLennan and Tang 2009, p.67). Efforts are underway to help in lifting
stream data to a level of semantic by semantic stream/ sensor and by the group of W3C semantic
20. 20
network incubator (Maringer 2005). Essentially, the primary goal of the process is to make the
availability of stream data to the principles of Linked Data. Notably, this kind of concept is
referred to as the Linked Stream Data (Schreiber 1977, 103). Ordinarily, the Linked Data helps in
the facilitation of the process of data integration among the heterogeneous collections (Buchanan
and Shortliffe 1984). Another important thing to note is that the data streams has similar goals
concerning the Linked Stream Data (Schreiber 1977, p.89). Furthermore, it assists in bridging the
gap between more sources of static data and streams.
Besides a unified model of data representation, there is also a requirement of a processing
engine that can help in supporting a continuous query on both the Linked Data and Linked
Stream Data (Cole and Conley 2009, p.107). Moreover, there is always an assumption that data
get stored in a centralized repository and also changes infrequently before additional processing
(MacLennan and Tang 2009, 102). Ordinarily, this kind of scenario happens in a classical
Linkage Data processing. According to research (e.g. Zhang and Kollios 2007, p.51), it is evident
that there is always a limitation of an update on the dataset to just a small fraction of the same
dataset. Additionally, it is worth noting that this process only happens in a less frequent way, and
in some cases, the database gets replaced by a new version.
Both ‘one-time’ and ‘pull’ forms the traditional relational databases (Schreiber 1977,
p.139). In essence, there is an execution of the query after reading the data from the disk. Most
importantly, the output gives out a set of results for the same point in time (Cole and Conley
2009, p.137). On the other hand, Linked Stream Data produce new items continuously. In fact,
the data only becomes valid at the time of window. Additionally, it consistently gets pushed to
the processing query (Buchanan and Shortliffe 1984, p.99). In practice, the registration of queries
only happens once then a continuous evaluation over a given time against the dataset that
21. 21
changes, in short, queries are continuous (MacLennan and Tang 2009, p.139). In effect, the
appearance of the new data results in the updates of the continuous query (Abdulla and Matzke
2006, p.97). It is important to note that such continuity of continuous queries and the temporal
aspect of the Linked Stream Data do not get considered in the processing engines of the Linked
Data query at the same moment (Cole and Conley 2009, p.148). Worth noting, better candidates
for processing continuous queries seem to be DSMSs or the Data stream management systems
(Zhang and Kollios 2007, 167). Ordinarily, a Data stream management system can be useful in
making a sub-component that deals with the steam data. in practice, the only problem is that no
any traditional Data stream management systems support the Resource Description Framework,
this makes it vital for the use of a data transformation step (Schreiber 1977, p.108). However, in
most cases, the use of such overhead of data transformation can sometimes be very costly in the
low-latency processing context of stream data (Sims and Yocom 2008, p.109). Furthermore,
losing full control over the execution of query means delegation of processing to a sub-system
such as the data stream management system (Cole and Conley 2009, p.145). Moreover, the
optimisation only can always get done locally in each of the subsystems (Schreiber 1977, p.143).
In this case, the subsystem is only optimized for the query patterns, a model of data, and also the
distribution of data since it gets used as a black box.
According to research (e.g. Buchanan and Shortliffe 1984, p.152), the difficulty of
predicting the structure of graphs of Resource Description Framework proves some challenges
for the traditional Data stream management systems. Moreover, they cannot effectively scale to
large quantities of the same Resource Description Framework data (Schreiber 1977, p.154).
Worth noting, this kind of difficulty in making predictions is also applicable to the Resource
Description Framework based data streams (Sims and Yocom 2008, p.151). In effect, it makes it
22. 22
tough for the optimizers of Data stream management systems to handle. It is also necessary to
note that these optimisation problems of Data stream management systems were solved in some
ad-hoc and restricted scenarios (Cole and Conley 2009, p.162). Furthermore, some open
problems and challenges still present a good number of areas (MacLennan and Tang 2009,
p.173). In addition, a heuristic is the most of the optimisation algorithms, and they also prove to
work for certain kinds of data and queries.
In essence, this kind of facts played a significant role in motivating me to develop a
heuristic-based optimisation solution implementation for two RPS engines (C-SPARQL and
CQELS) by the use of a Java code for optimization with the naïve idea (Sims and Yocom 2008,
p.182). In practice, my approach aims to build engines with high processing performance for the
Linked Stream Data by a combination of algorithms, structures of re-engineering efficient data,
and techniques from both traditional Data stream management systems and Linked Data
processing. According to several research (such as Abbass and Newton 2002, p.135; Sims and
Yocom 2008, p.127), it is not a good practice to store Resource Description Framework data
elements by rotational tables. On the other hand, careful design of indexing schema and physical
storage plays a vital role for the performance of the triple storages (Schreiber 1977, p.94). It is
now important to note that this approach aims to design a native data structure that treats both the
Resource Description Framework and Resource Description Framework stream data elements as
citizens of the first class (Cole and Conley 2009, p.142). Most importantly, the continuous
changing of the data during the lifetime of query requires adaptive in its processing.
In essence, such action requires the introduction of adaptive execution framework known
as Continuous Query Evaluation for Linked Stream or the CQELS (Cole and Conley 2009,
p.177). It is important to note that this kind of framework gets designed to apply adaptive
23. 23
techniques of processing in meeting the performance requirements of stream processing
(Buchanan and Shortliffe 1984, p.103; Zhang and Kollios 2007, p.171). Moreover, this kind of
framework allows for the full control of the execution process that is continuous where both the
optimization and scheduling can take place during the runtime (Schreiber 1977, p.67). In the
process, I had to create a new continuous query language as one of the first works in the
processing of the Linked Stream Data (Cole and Conley 2009, p.191). Worth noting, the
evaluation of the Linked Stream Data processing engines and conducting of the first survey
developed during the time of this thesis helps in providing an insight into the way of building an
efficient Linked Stream Data.
In this paper, we advance the integration of a heuristics engine in query optimisation of
the CQELS as well as C-SPAQRL to augment the query execution and data streaming processes.
The query optimisation operations will be done using a Java code that will serve as the optimizer
in both RSP engines. The implementation of a Java code as the optimiser in the RSP engines will
speed up operations and the query optimisation function in general. As MacLennan and Tang
(2009) claim, the code will allow the end users to unambiguously express their queries within the
code and reduce imprecise queries input. This will also help cut the incremental costs during the
computation process in regards to the projection, selection, and join functionalities as well as
other cost factors such as processor and communication time. As the data and ontology
constituents of the Web 3.0 have stabilised through the assimilation of golden standards such as
the OWL and RDF, the optimisation and solution implementation of heuristics-based querying is
next in the to-do-list.
The assimilation and solution implementation of the heuristics utility is outlined in this
thesis in this format: Section 1 discusses how heuristics can be employed in the query
24. 24
optimisation to minimize the pertinent costs. In the projected heuristic algorithm, a query is
scanned and executed by use of the magic trees in the storage files which in turn demonstrates a
significant progress over the previous optimization approaches. The cost-based algorithm proves
that the system’s enhancement continues to improve as the query becomes even more interlaced
and dense as the user performs more intricate searches. Section 2 discusses how heuristics can be
enlisted in the Java code to significantly reduce the erroneous query executions by instinctively
recognizing and amending inefficiencies in the CQELS and C-SPARQL queries. The detection
and rectification of flaws within the queries will consequently save the huge amounts of time and
effort expended by the RSP engines in retrieving information, thus enhancing the overall
throughput and productivity of the engines. Section 3 demonstrates the impact of heuristics in its
competency to execute queries without involving join operations. The exclusion of join
operations in query optimisation will help to shrink operational costs in addition to making the
RDF data volume less bulky. The empirical results confirm that the projected heuristics model
overshadows the conventional querying techniques, for example the Jena, by 79%, regarding the
reduction of pointless intermediate results and a faster query processing time.
2.2 Comparative and Survey Evaluations
Essentially, the first experiments and survey is helpful in giving comparisons and insight
of the techniques of the data stream processing and also the Linked Stream Data processing
engines (Abdulla and Matzke 2006, p.487; Zhang and Kollios 2007, p.378). Additionally, the
first evaluation of the cross-system is to present the Linked Stream Data processing engines.
A scenario that integrates human-centric streaming data from the digital and physical
world similar to the live social semantics are just a total inspiration (MacLennan and Tang 2009,
p.474). Worth noting, the kinds of data from the physical world get captured and streamed
25. 25
through tracking systems and sensors such as the wireless triangulation, RFID, and GPS, and the
integration can get done with the use of virtual streams such as the city traffic data, Twitters, and
the airport information in the delivery of up to date views or location based services of any
particular situation (Cole and Conley 2009, p.479). Furthermore, the conference scenario mainly
focuses on the problem of data integration between the streams of data from a static data set and
a tracking system (Abdulla and Matzke 2006). Worth noting, the system of tracking, similar to
the various real deployment in Live Social Semantics is useful to gather the relationship between
physical spaces and the real-world identifiers of the attendees of the conference. Moreover, the
non-stream datasets are used in correcting the tracking data (Cole and Conley 2009, p.482). For
example, the information that is online concerning the attendees such as the online profiles,
social network, records of publication and so on. In essence, there exist several benefits of
correlating the two sources of information (MacLennan and Tang 2009, p.453). Most important,
based on the number of the people and the topic of the talk, conference rooms could be
automatically assigned to the talks based on the total number of people that might show some
interest in attending it, based on their level of the profile (Cole and Conley 2009, p.491). In
addition, the attendees of the conference could also get notified about the co-authors found
within the location (Abdulla and Matzke 2006, p.423; Buchanan and Shortliffe 1984, p.403;
Zhang and Kollios 2007, p.348). It is also important to note that it can be very easy to assign a
service that the type of talk to attend based on the records of citation, profile, and the distance
found between the locations of the talk.
In practice, the spread of a social stream data of interest for a user occurs among various
platforms of social application such as the Twitter, Facebook, Foursquare and son on
(MacLennan and Tang 2009, p.496). Additionally, the social network analysis and aggression
26. 26
platforms such as the Bottlenose require an integration of heterogeneous streams from various
feeds and social networks (Abdulla and Matzke 2006, p.437; Buchanan and Shortliffe 1984,
p.428). Most importantly, these kinds of platforms can easily use Linkage Stream Data
processing engines to deal with the issues of data integration (Cole and Conley 2009, p.504). In
the same context, this kind of scenario continues to focus on the different social stream
aggregation sources that the social network users create (MacLennan and Tang 2009, p.511).
Another important thing to note is that the social networks give provisions for rich resources of
the interesting stream data that includes the uploading of the photo and the sequence of social
discussions (Cole and Conley 2009, p.521). Additionally, the social networks get considered as
the best area of test for Resource Description Framework engines. Furthermore, the Resource
Description Framework can also exhibit its merits on how to represent graph data (MacLennan
and Tang 2009, p.527). Ordinarily, skewed distribution of data correlates with the data in real life
which mostly occurs in the social network data. Moreover, there is recognition of the efficient
handling of correlations as a very difficult problem by the engines of the database (Abdulla and
Matzke 2006, p.484; Buchanan and Shortliffe 1984, 503; Zhang and Kollios 2007, p.509). On
the other hand, it also plays a significant role in opening up many opportunities for the query
optimisation (MacLennan and Tang 2009, p.539). In the context of the scenario, it becomes
possible to build the simulator of data in exploiting different distributions of the skewed data and
the available correlations in a social network (Abdulla and Matzke 2006, p.437). As a
consequence, the data simulator is useful in generating some of the realistic test cases to evaluate
the Linked Stream Data processing engines.
It is important to note that various parts of this thesis have earlier been published as
workshop, conference and journal articles (MacLennan and Tang 2009, p.544). Furthermore,
27. 27
there was an introduction of the first attempt of building a heuristic based query optimisation
solution implementation of RSQ engines in several research (such as Abbass and Newton 2002,
cessing engines and the Data stream management systems (MacLennan and Tang 2009, p.587).
On the other hand, the RSP engines such as the CQELS and the C-SPARQL are readily available
in studies (such as Abdulla and Matzke 2006, p.463; Buchanan and Shortliffe 1984, p.401;
Zhang and Kollios 2007, p.409).
2.3Query Optimization
Maringer (2005) describes query optimisation as an interspersed querying function in a
multitude of information systems and database frameworks. All query languages, be they
structured (SQL) or unstructured (C-SPARQL and CQELS), enlist query optimisation
functionalities to establish the most shrewd and adept channel of executing a query that has been
keyed in by a user. Such functionalities encompass query optimisers such as the PostgreSQLor
the Java code (Java Runtime Environment) that analyse and carefully assess the SQL, C-
SPARQL, or CQELS queries tackling the most effectual mechanism for query execution. The
querying of database systems happens almost every other minute of the day and, thus, query
optimisation is just as frequent(Cheung et al. 2006, p. 64). Anyone browsing the internet doing
either simple or complex kind of researches engages query optimisation in the Database
Management Systems (DBMS), when requesting for piece of information from the respective
databases. For example, if you are searching for a Social Security No., financial statements of a
company, a country’s demographics, of even trying to compute the average pay of all the civil
workers in the Department of Agriculture in your regional state, you are querying the distinctive
databases.
28. 28
If, for instance, you are interested in investing in Ernst and Young LLP (a multinational
audit firm), you will obviously want to find out how it is performing in the market and its overall
productivity compared against other industry benchmarks. To locate such information, you will
log in to the company’s database system and request for its financial statements, ratios, and key
market/ performance indicators. A query soliciting for the financial ratios of Ernst and Young
LLP will look like this: “find the consolidated balance sheet of Ernst and Young.” Before the
balance sheet is availed onto your computer screen, there are a number of procedures that occur,
featuring a query plan. After submitting this query, the parser within the database will parse it,
and then hand it over to the query optimiser, which will then hatch several query plans in
accordance with the resource costs (Moustakas 1990). The most efficient means, in terms of
costs and time consumption, is chosen, after which the database server will access the pertinent
database data and whip up the desired results.
The prime focus of the query optimisation function of databases is centred on expeditious
and prompt query execution so as to deliver the desired results in the flash of a minute (Mueller
2009, p. 34). Time consumption is top of the list in the determination of the best query plan to
solve a given query. Any marginal time variance in alternative query plans will prompt the query
optimizer to select the option that is fastest and consumes the least amount of time. However, the
optimisation function is still lacking in regards to time efficiency and conservation as most
querying processes involve redundant executions of intermediate results within the join
operations. These join operation, together with other accompanying costs such as projection and
selection functionalities as well as the processor time, downscale the communication time of the
data results in addition to increasing the computational costs. As the selected query plan works, it
makes use of various algorithms with which it collaborates to manipulate and combine tables of
29. Figure 1: Semantic Web processing
29
data from the database structure so as to produce the requested knowledge material (Nirmal
1990, p.388). These manipulations and combining of data tables are called join operations and, in
the retrieval of real-time steaming data such as financial statistics, slow down the data streaming
process. Additionally, the processing of the intermediate results needed in the join operations
contributes in making the RDF data volume bulky, thus, impeding operations and the engine’s
speed in overall(Cheung et al. 2006, p.69). All these issues call for programmers to construct the
query optimisation function of the RSP engines around a heuristics solution and implement this
solution to improve on RDF stream processing.
2.4RDF Stream Processing and Semantic Web
The recent deployment of the semantic web in divergent industry sectors such as in
logistic planning in military fields, engineering analysis, health care, and life sciences has proved
its worth in data search automation and information technology upscale. According to Zhang and
Kollios 2007), the semantic web contributes to an instinctual and spontaneous web application
that browses the precise information from linked data sources. The application works by
collecting, filtering, and sampling of data items captured from differential sensor plants and
stored as ontologies in RDF formats (see Figure 1).
30. 30
Invented by Lee Feigenbaum in 2001, the Semantic Web (Web 3.0) has,so far,
showcased some data processing differences between its database management versus other
relational databases such as the World Wide Web (Web 1.0). While the Web 1.0 operates by
dislodging the physical storage and networking layers, the Web 3.0 upgrades this tedious and
seemingly slower process by dismissing the document and application layers. In as much as the
search engines on the World Wide Web index a majority of the content stored on the Web, they
still lack in the instinctive capacity of selecting the articles and web pages that an end user really
desires. Rather than connecting documents and data structure like the Web 1.0, Web 3.0
capitalizes on its metadata base and ever evolving compilation of knowledge to connect facts and
meaning. This algorithm is what enables the Semantic Web to build on intuitiveness and self-
description that help the context-understanding programs to find the exact pages that a user is
looking for. As Sims and Yocom (2008, p.411) convey, the Web 3.0 has gained its technological
leverage over the Web 1.0 by its cutting-edge means of data storage, querying, and information
display. The data storage means incorporated in this new technique involvesmatching data
sources to ontologies that are stored in a structured form in Resource Description Frameworks
(RDF). Unlike the natural text formats that Web 1.0 utilizes in data storage and retrieval, the
Semantic Web models the data items sourced from diverse sensor plants into a comprehensive
descriptive language to make the query processes and information display easy enough and
friendly for all Internet browsers.
As Abbass and Newton (2002) illustrate in their journal article, the RDF comprises of a
descriptive structuring of data used for information exchange on the net. As the semantic
metadata reads information from sensor plants, it filters and stores this information into a format
that is easily readable by both the machine and the computer user. Engineered by the World
31. 31
Wide Web Consortium (W3C), the RDF integrates the use of query languages and descriptive
statements and conjunctions (e.g. has, is) to provide relevant information about web resources
that a user may search for. For example, if you want to find out about the U.S. current president
(web resource), you will type in “The U.S. has a current president in office.” As seen from this
statement, there is an entity-relationship data model that is in the form of a subject-predicate-
object expression. This model is the strategy made use of by the RDF when searching for
information. Thus, the RDF refers to that language that exhibits web data by use of marginally
constraining, meaningful, and constructive expressions. To incrementally expand RDF’s
efficiency, we have to further advance the aspect of heuristics in the querying of RDF data
stream processing engines.
32. 32
Chapter 3: Background to RSP Engines
3.1 C-SPARQL
Barbieri et al. (20109, p. 20) define C-SPARQL as an advanced language – a stretch of
the SPARQL query language that observes windows and recent triples of RDF data streams
while simultaneously allowing the streams to flow. The continuous streaming of queries by the
Continuous SPARQL (C-SPARQL) facilitates the interoperability of RDF formats and
implements crucial applications that allow researchers to access the ever-evolving information of
web resources. Wei (2011, p. 101) refers to C-SPARQL as an orthogonal extension of the
conventional SPARQL grammar, making the SPARQL a congruent component of the C-
SPARQL. The C-SPARQL builds on SPARQL by its capability of combining static RDF
together with real-time streaming data for purposes of stream reasoning. In as much as SPARQL
has cemented its viability in querying RDF repositories, Barbieri et al. analyze that it is still
lacking in producing continuous, flowing data streams (Abbass and Newton 2002, p. 21).
Stream-based data emitters encompassing stock quotations, click streams, feeds, and feeds emit
real-time continuous information. However, the SPARQL is still limited in its efficiency of
storing entire streams; therefore, the Data Stream Management Systems (DSMS) registers
consecutive queries in static forms. The invention of the C-SPARQL is thus based on its capacity
to merge the static data with the streaming data – a procedure that mobilizes logical reasoning in
actual time for those large and noisy data streams.
3.2CQELS
According to Abbass and Newton (2002), the Continuous Query Evaluation over Linked
Streams (CQELS) constitutes an adaptive and instinctive schema for supporting Linked Stream
33. 33
Data, whose grammar is derived from the SPARQL 1.1, thus making them compatible. The
congruence of the two query languages (CQELS and SPARQL 1.1) capacitates the performance
level of the CQELS over other continuous query languages. The CQELS has been engineered
with the sole objective of enlisting the white box approach that functions by utilising the
prerequisite query operators in a native way to obviate all overhead costs plus any other
restrictions of closed system regimes (Schreiber 1977). The CQELS offers flexibility and
updatability in their execution structures as the inherent query processors continuously readjust
to the changes in the incoming data. Examples of such continuous queries are contained in
papers such as CF02, HFAE03, CDTW00, and ABB+02. These queries, however, are quite
simple and only applicable in general-purpose event processing. This paper proposes the
assimilation of heuristics in the query execution of CQELS to enable the continuous reordering
of its operators, thus, improve query applicability in complex situations, not just general-purpose.
The interspersion of the heuristics engine in the querying of RDF data streams is, hence, very
crucial and fundamental in the upscale of RDF stream processing as it greatly minimises the
lengthiness of the join operations. Besides lessening the inherent time consumption, the
heuristics will additionally help spot and rectify any flaws that occur in the queries that users
may input while searching for useful information from given databases. In general, the heuristics
functionalitywill have a double role in the query optimisation of RDF stream processing. One, to
shrink the duration of intermediate results processing for join operations, and, two, to discard the
errors contained in queries, hence, curbing flawed query execution, in turn, escalating time
saving during query optimisation.
34. 34
Section 1: Cost-Based Heuristics Optimisation Approach
3.2.1 Introduction
The move to consolidate heuristics into the query optimisation aspect of RSP engines is
ingenious and groundbreaking, to say the least. The implementations of heuristics are geared
towards cutting computational costs during the query optimisation and join operations executed
within the C-SPARQL and CQELS languages. This section outlines in depth how enlisting the
heuristics function helps minimise the costs estimated in terms of the overall time spent by the
optimiser to select the most effectual query plan/ tree that will execute a given query in the least
time possible thus lessening the CPU and input/ output costs.
The CQELS and C-SPARQL DBMS optimisers endeavor to boil down to a single, most
feasible query plan for the given query statements. In the query optimisation world, pinning
down a suitable plan is contingent upon which mechanism has the least time duration as well as
the most minimal costs involved in terms of query execution factors like communication, the
processor, and the Input/ Output Expenses. These costs are a very critical factor and get utmost
consideration during the selection of the most ideal query plan tree (Abbass and Newton 2002).
When a query is input into an RDF database, the Database Management System (DBMS)
initiates a process a selection course geared towards determining the most potent path to follow
and give results in the shortest route possible. This course entails the optimiser devising several
paths plans from which it chooses the most ideal one to utilize. All these hatched path plans,
when followed, output equivalent data or information. However, they differ in regards to their
cost expenses, specifically, in terms of how much time each plan consumes to finalise the data
retrieval process and generate the data desired by the computer user or researcher, claims Abbass
35. 35
and Newton (2002). The selection criterion hinges upon a critical question: Which path plan will
take the least time to reach and deliver the user information? The optimisation course revolves
around a myriad of circumstances such as how a query is stated, the access methods, the
information layout, and the data set size (Oracle Help Center 2016). The access frameworks are
quite influential in this stage of optimisation as they are the ones which dictate whether the data
should be accessed by use of index scans or full table scans. Suppose Path A requires an index
scan that will take 2 minutes while Path B requires a full table scan that will take 2.5 minutes, in
estimation, Path A will be chosen.
In as much as the conventional optimiser in the CQELS and C-SPARQL strive to hatch
the most feasible execution plan, there are still gaps in this feature. Lots of processor time and
communication time as well as input/ output costs are still considerably high. This section
outlines the trends in query optimisation observed before and after the assimilation of heuristics,
thus approving the positive cost-saving impact achieved after its integration. When a query is
submitted to the database server, it undergoes a certain traverse within the DBMS modules; it
adheres to this sequence until the final results are generated (see Figure 2). These constituent
DBMS modules consist of a scanner, parser, query optimiser, code generator, and query
processor.As Abbass and Newton (2002) explains, the scanner scrutinises the inherent language
tokens, for example, the relation names and CQELS/ C-SPARQL keywords in the context of the
query statement. The parser then follows by certifying the query syntax, its validity, and if the
attribute names are semantically correct. After this, it transforms the query expression into an
internal representation that is machine-readable using a query tree or even sometimes a query
graph. The tree’s data structure is sketched by means of a calculus expression(Abbass and
Newton 2002). The query optimiser comes into play by reading the machine-readable instruction
36. 36
and then forming a multitude of execution plan strategies. The optimizer finally chooses the most
amenable path by assessing all pertinent algebraic expressions relating to the input query,
favoring the cheapest and shortest one. The code generator then works to create a viable code
that requests the query processor to execute that plan projected by the optimizer (MacLennan and
Tang 2009, p.242).
Scanner
Parser
Optimizer
Code generator
Query processor
37. 37
Figure 2: Query flow through a DBMS
As mentioned above, the query optimizer explores relevant algebraic expressions
contained within various algorithms generated by the DBMS in query searches. The traditional
algorithms have always zeroed in on exhaustively enumerating all alternatives available to
empower query searches. However, as explained by Abbass and Newton (2002), this exhaustive
technique is defective when it comes to solving for complex queries as the algorithms cannot
make it to enumerate all possible (millions of) options in a short, convenient timing. Rather, the
timing is quite long and tiring even for the user waiting for the results. This occurrence is evident
when an algorithm has to enumerate join orders for a query whose resulting data is contained in
50 tables. The process of enumerating all these 50 tables and joining the data items can take up
several minutes before results are delivered, thus failing in fastness and cost efficiency. To solve
this drawback, a heuristics solution has been implemented in both the CQELS and C-SPARQL
optimisation processes. This heuristics solution activates an algorithm that basically checks the
storage file in the DBMS to confirm if there is a ready-to-use query plan that matches the new
input query. If the ready-to-use query plan exists in the storage file, the algorithm uses this to
execute the new query expression, thus eradicating the need to create a new query plan. This
ultimately saves the processing time meant for developing the new query plan as well as the
input/ output costs (MacLennan and Tang 2009, p.42). Also, the communication time spanning
between the input of the query and output of the data results is also shortened. This improvement
in processor time/ cost and communication time continues to increase as time proceeds and even
as queries get more intricate.
3.2.2 Proposed heuristics approach
38. Figure 3: Binary tree
38
The heuristics solution proposed in this thesis advocates for a change in the sequence of
query execution from a normal binary tree to a magic tree that is stored in the given storage file.
The move to change the sequence of execution steps allows for the DBMS to save computational
costs and time as well (MacLennan and Tang 2009, p.221). In the absence of heuristics, the
query optimiser normally formulates a binary query tree (see Figure 3) which it uses to derive
numerous path plans before choosing the most optimal alternative. The formulation of the binary
tree calls for redundant operations such as the join, filter, and projection functionalities every
time a query search is initiated within the DBMS. This redundancy contributes majorly to the
compilation of operational expenses (join, filter, and projection), time involved in the
performance of these functionalities, as well as the processor and communication time. Frequent
join executions, particularly, make the RDF data volume being accessed extremely voluminous
and bulky, which in turn even strains the manipulation of data depositories more complicated.
However, the addition of heuristics ensures that these binary trees are replaced with a
much more efficient methodology, the magic tree. The magic tree differs from the conventional
39. Figure 4: Magic tree
39
binary tree by its innovation way of setting all the constituent variables (join, filter, and
projection) to only one wing of the tree (see Figure 4). Each of these distinctive variables is then
allocated a specific weight by the algorithm, after which the total weight is used to calculate the
cost of the variables in the tree. The criterion of assigning the individual weight is dependent on
the amount of time spent by each variable during the query processing, therefore, the
computational time correlates with the attached weights (MacLennan and Tang 2009, p.232). The
magic tree reorders marked variables such as the projection stem of the binary query tree and
eliminates the redundancy implemented in binary projection mechanisms. For example, the
applicable costs within the projection stem is x units. Therefore, if we administer a projection
fifteen times on a nested query, the aggregate cost will be 15 * x units, in the customary binary
tree. The proposed heuristics magic tree, however, shifts the projection facet to one state such
that if the projection operation is to be administered on the same nested query, the projection
administration would only need to be once, thus the total cost of processing would be x units
only. Figure 5 below depicts the algorithm proposed by the heuristics solution.
40. 40
Table 1: Algorithm 1
Function: Compose a Magic Tree.
a) Query parsing.
b) Transformation of the query expression into a machine-readable statement.
c) Forming a query tree or graph, depending on the calculus expression used.
d) Selection entity shifts to the head nodule of the query tree.
e) Elimination of all candidate selection entities available.
f) Formation of all the dependent groupings. These are shifted to one wing of
the tree.
g) All the leaf nodules are relations; they are therefore halted once the process
reaches the leaf.
h) The query processor begins the search query course of action.
i) Once the query processor discovers the data target, it heads over to the
projection stem where all the other pertinent functionalities are conducted.
As MacLennan and Tang (2009, p.144) claim, heuristics has always been a viable
solution for modern computational problems, more so those that deal with voluminous data sets
such as telecommunication and industrial plants streaming data. The algorithms embedded in
heuristics functions help solve for entity optimisation and complex real-world issues as it
improves on time, costs, and space required in deciphering computational inquiries. In our case,
the effect of heuristics may not be felt or seen immediately, but after a while, the cost-saving
impacts will surely become visible. This is because of the working psychology assumed by
heuristics. As explained above, during the early implementation stages of the heuristics, the
entity operates by first monitoring how applications work. It performs meticulous appraisals and
41. 41
evaluations of how program applications, in this case the query optimisation process, are run and
traces all these moves and formulas onto its memory. By this, it has created a virtual image of the
functioning of all the steps involved during a query search, from when the query is input to when
data results are displayed on the screen. The more advanced version of heuristics thoroughly
inspects then traces the guidelines put in the codes of programs prior to passing them over to the
computer’s processing unit for execution. This will help the heuristics engine to assess and learn
the behaviour and mannerisms of that program while it runs in a virtual setting.
As soon as its memory is packed with the application performance information, it starts
using this information to revamp activities and even cultivate better channels for enhanced task
execution. In the case of the RDF stream processing, a user can input the same query over and
over again over a given period of time, say for example, when retrieving information about a
certain tweet or when researching about the manufacture status of a phone from its manufacturer.
For every single time that a query search is initiated for such a research function, the parser must
form a query tree for each search before handing it over to the query optimiser and code
generator to formulate a code needed in the actual processing of the query statement. Building a
query tree for each and every query search of the same research question consumes an awful lot
of communication time and processing expenses as well, in the absence of a heuristics engine
(MacLennan and Tang 2009, p.39). This time, physical storage space, and processing costs is
what we all aim to eradicate in our RDF streaming processing. In a heuristics environment,
however, the redundant formations of the same query tree, their optimisations, and final query
processing, is noted in the heuristics’ memory. Hence, if the same research question is entered
yet again, the parser will just proceed to the heuristics’ memory and retrieve the query tree that
was noted before, instead of building a new one all over again. Therefore, the time that could
42. 42
have otherwise been expended in the query tree formation has been saved and, in turn, also the
communication time has been minimised too. The query search proposed by this heuristic is as
shown in Figure 6.
Table 2: Algorithm 2
Function: The Projected Heuristics Query Search.
a) A query tree is crafted for each query expression that is submitted into the database
system.
b) Then, the heuristics function reads and stores this binary tree in a dedicated storage folder
for that particular query tree.
c) The storage folder is then assigned a unique company usage factor for easy identification
by the parser, such that the maximum quantity of storage folders generated equal the
company usage factor (c.u.f).
d) Following this, the heuristics devises a unique magic tree that shifts all the dependent
variables (join, select, and projection) in the binary tree to one side of the tree.
e) When a similar query is submitted by a user, the parser first confirms from the storage
folder if there is an equivalent query tree that can be utilized for that input inquiry.
f)If there is an equivalent stored tree, it will hence proceed to the precise branch node
required for processing the inquiry at hand, and perform all the relevant courses of action.
g) However, if there is no suchlike tree, it will consult the magic tree stored there, and if
successful, it will halt further searches and perform all the relevant courses of action
necessitated.
h) However, if all these searches fail such that there is no equivalent branch node even in the
magic tree, the parser will now resort to generate a new magic tree as depicted in the first
43. 43
algorithm, thus increasing the storage folder counter.
Lastly, the database server will refresh the folder in the event that the counter is less than the
company usage factor. This is commendable because the number of folders should be equal
to the company usage factor (MacLennan and Tang 2009, p.19).
3.2.3 Results simulation
This section puts into actual practice, through simulation, this theoretical novel approach
of heuristics assimilation into an RDF stream processing engine to confirm if the prototype
makes good of its promise. The RDF engines tested herewith consist of the CQELS and C-
SPARQL languages. Simulation here refers to the manner in which the heuristics replication was
conducted over a specified period of time (6 months). A model of the heuristics query
optimization engine was replicated in a Java Runtime Environment (JRE) running on a computer
powered by the Windows Operating System. With the help of the JRE, we codified some core
Java codes, which were later, compiled and run in a Java eclipse environment to execute the
given RDF data streams. The codification was written in Java and employed the concept of class
handling. The data structuring integrated in the query tree went hand in hand with dynamic
memory appropriation that primarily used linked lists. The outcome of the analysis was as
expected; the integration of heuristics across the RSP engines board improved cost-saving by
shrinking the processor operational costs. A heuristics approach was implemented in the CQELS
and C – SPARQL query languages to form magic trees and also perform the selections earlier.
As MacLennan and Tang (2009, p.66) explains, the heuristics database engine is exploited in the
early performance of selections. This action considerably reduces the size and magnitude of the
RDF graph databases hence speeding up the query search process in overall. For example, if we
reflect on the following CQELSand C – SPARQL query expressions (see Figure 7), applying
44. 44
heuristics is beneficial in terms of how it executes the selection entities very early in the process
hence minimizing the communication time.
Table 3: Query 1
The customary query processing of these CQELSand C – SPARQL query expressions
would have initiated the formation of a binary query tree as depicted in Figure 3. With heuristics,
however, the database engine will form a magic tree (see Figure 4) that will shift the selection
variable to one side of the tree. As MacLennan and Tang (2009, p.41) inform, yes, the initial
query processing stages of the heuristics approach will absorb come costs in constructing as well
as searching the magic tree. Nonetheless, these costs will be significantly lower as compared to
those expended in the formation and execution of the binary trees. The implementation of the
magic tree likewise reduces all other computational costs involved since also the frequency of
the selection variables also decrease. This cost-saving is evident in the comparison of the
estimated cost calculations of both methods: the binary tree and magic tree query processing. As
for the traditional binary tree, its aggregate running costs are 100 units while the incurred
expenses for the magic tree are 50 units only.Supposing a new query is input for the first time by
a user, the database server will incur seemingly high expenditures in both the formation of the
binary tree as well as the conversion of this binary tree into a magic tree. However, in the next
45. 45
round, there will be no conversion costs as the magic tree will be readily available in the
heuristics’ storage folder. Additionally, the communication and processor will reduce in the same
degree as the conversion costs, as the parser will automatically reach for the magic tree branch
nodes. Figure 5 demonstrates the cost versus time chart comparing the conventional query
processing versus our projected heuristics-based CQELSand C – SPARQL query optimization
strategies.
Figure 5: Cost versus time graph
As it is shown on Figure 5, the preliminary costs are somewhat high, but as the heuristics
functionality continues to track, learn, and store the magic trees in its folders, the overall
computational expenditures decrease with time(Cheung et al. 2006, p. 43). To elucidate this
phenomenon, as a new query is fed into an RDF format database, all the constituent stages
conducted during a tree match search are carried out: parsing, query tree building, syntax
checking, attribute name confirmation,optimisation, and code generation. These activities
contribute to the evidently high cost expenditures as well as huge time consumptions
(MacLennan and Tang 2009, p.71). As time goes by, the heuristics entity monitors the query
search procedure, identifying the redundant parsing and optimisation sequences, and creating a
way out. It achieves a way out by tracing a particular binary tree in its storage folder and, from
46. 46
this, derives a magic tree that is equivalent matches it. Therefore, in the subsequent standard
query searches, there will be no need to create yet another new binary tree for a similar inquiry
(MacLennan and Tang 2009, p.83). Instead, the magic tree will be retrieved from the storage file
for a duplicate tree matching, hence saving the computational conversion time and costs. The
heuristics application becomes even better with the execution of nested queries as the data results
are delivered much faster and more efficiently (see Figure 9). Further simulations of the
heuristics algorithm can also be enlisted in extending join properties such as the right and left
joins.
Figure 6: Performance versus complexity
3.2.4 The performance comparison graph between new improved model and the previous
version of CQELS and C-SPARQL
Most of the considered systems work in progress and are scientific prototypes.
Unsurprisingly, those are not able to support all the query patterns and features. The outputs of
the new improved model and the previous version of the CQELS and C – SPARQL are
significantly different because of their differences in implementation. These differences in
performance mainly result from the technical issues of intrinsic concerning the methods of
handling streaming data such as potential fluctuating execution environment and time
management.
47. 47
Table 4: ThePerformance Comparison by Features
Special support for Input Extras
C – SPARQL TF RDF and RDF
streams
CQELS NEST, Vos RDF and RDF
streams
Disk spilling
Streaming SPARQL RDF streams
SPARQL stream NEST Relational stream Ontology-based mapping
EP – SPARQL EVENT, TF RDF and RDF
streams
Event operators
EVENT: Even pattern, VoS: Variables on stream, TF: Built in time function, NEST: Nested
patterns.
Table 5: Performance Comparison by the Mechanism of Execution
Re-execution Optimisation Architecture Scheduling
C – SPARQL Periodical Static and algebraic Black box Logic plan
CQELS Eager Adaptive & physical White box Adaptive
physical plans
Streaming SPARQL Periodical Static and algebraic White box Logic plans
SPARQL stream Periodical Externalised Black box External call
EP – SPARQL Eager Externalised Black box Logic program
48. 48
Figure 7: Graphical performance comparison
As the graphs shows, the throughput of scalability and performance tests of C- SPARQL
are considerably lower than that of the CQELS and JTALIS. For this reason, it is clear that the
recurrent execution is likely to waste significant resources of computing. A sliding window
extracts the recurrences, and the outputs can be incrementally computed as a stream. Notably, the
outputs of JTALIS and CQELS are useful in answering the recurrent queries.
Query 1 involves counting the number of items over a tumbling window of one-second.
Of note, however, this query uses a physical time window. For statistical and significant robust
results, the computation is done as an average of twenty executions. The main reason for doing
this activity is because of the variable time of execution that also depends on the condition of the
machine.
49. 49
Notice that CQELS performs better than JTALIS because it uses both the adaptive and
native approach. The JTALIS and C – SPARQL performance heavily depends on of some of the
underlying systems that include prolog engine and a relational stream processing engine
respectively. In similar fashion, CQELS is likely to benefit from a more sophisticated algorithm
that is optimised as compared to the current one. The only system that indexes and precomputes
the intermediate results over the static data from sub-queries is the CQELS. However, both the C
– SPARQL and the CQELS do not scale well at the time they increase the number of queries
such as sharing data windows and similar patterns. Additionally, they testify that neither of the
systems uses the techniques of multiple query optimisations to avoid redundant computations
among the queries that share computing memory and blocks. In this case, the optimisation only
occurs at statically and algebraic level since streaming both C – SPARQL and SPARL schedule
the execution at a logical level(MacLennan and Tang 2009, p.102). On the contrary, CQELS can
choose alternative plans of execution that get composed from the available operators’ physical
implementations. In effect, the optimiser adaptively optimises the execution at the physical level.
50. 50
Both SPARQL stream and EP-SPARQL schedule the execution through a logic program or a
declarative query. In this case, they fully delegate the optimisation to other systems (Seshadri
and Leung1998). The technique used in improving the result involves the definition of mappings,
triple pattern, RDF triple, and other operations on mappings and the reuse of notations.
Under the Instantaneous RDF dataset and RDF stream, the temporal nature of data is
essential and requires capturing in the representation of data in the continuous processing of
dynamic data. This applies to both the sources of data because the collections in linked data
updates are also possible. It is an instantaneous RDF dataset. G (t + 1) = G(t) for all the values of
t ≥ 0 and G(t) = G for all t = N. Pattern matching is the main primitive operation on both the
instantaneous RDF dataset and RDF stream(MacLennan and Tang 2009, p.88). Notice that, triple
pattern of SPARQL semantics extends the pattern matching. As a consequence, the use of
notations of denotational semantics becomes helpful for the formal definition of query patterns
of the processing model. The denotations are the meaning functions of the semantics
compositions of abstract syntax. These compositions comprise a total of three operators namely
relational, pattern matching, and stream operators. Pattern matching operators extract triples
from a dataset or an RDF stream that are valid and match a given triple pattern at a certain time t
as shown below
Pattern matching operator’s abstract syntax
The meaning of triple matching pattern operator PG gets defined in the same way as SPARQL
on an RDF dataset at a given timestamp t as follows
51. 51
Next is the definition of the window-based triple matching operator on an RDF stream.
The denotational semantics composability results in the definition of the abstract syntax
for the compound query pattern constructed from both the logical operators and matching
operators. Additionally, the definition of the aggregation operator comes before the definition of
the syntax and its semantics(MacLennan and Tang 2009, p.99). Notice that a uniform mapping
contains only the mappings that have similar domains. In this case, a consistent mapping gets
defined in an aggregate operator setΩ. The relational operators’ abstract syntax is therefore
defined recursively as shown below.
The mapping of the operators therefore becomes
Under the streaming operators abstract, the streaming operator becomes either an RDF
stream or relational stream from the above relational operators.
52. 52
Next is the definition of the declarative query language CQELS-QL or CELS query
language for the execution framework of CQELS. Additionally, the SPARL grammar in the
notation of EBNF helps in the definition of the CQELS-QL. The first thing is the addition of the
query pattern for the representation of window operators on RDF stream.
53. 53
Chapter 4: State of The Art in LSDP or the Linked Stream Data Processing
According to Gedik (2006), Linked Stream Data derived its usefulness in bridging the
gap between Linked Data and stream, and also in the facilitation of the integration of data among
Description Framework data streams enables the query processor to participate in treating the
RDF elements of stream nodes and also allows for both the access to get access to RDF streams
in the form of the materialized data (Abdulla and Matzke 2006, p.907; Buchanan and Shortliffe
1984, p.777; Cole and Conley 2009, p.809; Zhang and Kollios 2007, p.733). Notably, the whole
process makes it possible for the application of other SPARQL query patterns (Cheung et al.
2006, p.444). In short, this chapter explores both the techniques and concepts of processing
streams and the introduction of Linked Stream Data Processing engines (Calhoun and Riemer
2001, p.447). Additionally, the inclusion of the CQELS engine in the chapter helps in the
clarification of the contribution of this field.
4.1 Query Semantics and Data Models
This section mainly explores the possible ways of formalizing the data model for
Resource Description Framework datasets and the Resource Description Framework streams in a
continuous context (Cole and Conley 2009, p.931). Additionally, it touches on the continuous
query semantics.
4.2 Data Model
It is important to note that the modelling of Linked Stream Data occurs by extending the
meaning of both the RDF triples and RDF nodes (Cohen 1985, p.303). A stream of RDF refers to
a bag of different elements, while an RDF triple just denotes a temporal annotation such as time
interval or time stamp. A pair of time stamps includes an interval based label. In common cases,
54. 54
natural numbers help in representing logical time (Eastwood 2008, p.278). Things such as ‘start'
and ‘ends' represent a pair of timestamps, and they are also useful in specifying the valid interval
in which the Resource Description Framework triple (Dean 2009, p.264). On the other hand, a
point based label refers to just a single natural number that represents the received or the
recorded point in time of the triple (Buchanan and Shortliffe 1984, p.708). One may see a point
based labels to be looking less efficient and redundant as compared to the interval based labels.
Further, point based labels are less expensive than the interval based labels because the former
gets considered to be an important and special case of the latter. For example, start = end.
According to research (e.g. Abbass and Newton 2002, p.946), streaming SPARQL find labels
useful in representing its EP-SPARQL and the items of the physical data stream in the
representation of the triple based events.
For the purposes of streaming data source, a point base labels out more practical results
because it allows for the instantaneous and unexpected generation of a triple. It is a good
example of the use of a tracking system to detect people at an office (Buchanan and Shortliffe
1984, p.707). Notably, this kind of activity results in the generation of a triple using a timestamp
at any time it receives any reading from a sensor. For further processing of the information, the
system must do further processing and buffer the reading in order to help in the generation of the
interval of the valid triple (Bolton 1996, p.407). Furthermore, the instantaneous point based
labels play a vital role for the applications that require the processing of the data immediately it
arrives in the system. Additionally, the concept of the Resource Description Framework must be
included in the model of data to enable the integration of stream data without stream data.
primarily, the Resource Description Framework dataset always get considered as a static data
source by the current state of the art. In light of the findings (e.g. by Abbass and Newton 2002,
55. 55
p.944), it is important to note that the data stream applications can always run for any given
number of period that ranges from days to years. In addition, the changes in the Resource
Description Framework dataset during the lifetime of query must be reflected in the continuous
query's outputs.
4.3 Query Semantics
Semantics extend to explore things like approaches of the current state of the query
operators of SPARQL-like union, join, and filter. In practice, these operators output and consume
mappings (Abbass and Newton 2002, p.556). In addition, they also take part in introducing the
operators on the Resource Description Framework streams to the output mappings. Worth noting,
C-SPARQL defines its stream operator to access a Resource Description Framework stream that
is identified by its IRI (Cohen 1985, p.301). Additionally, the window operator gets defined to
help in accessing a Resource Description Framework stream based on certain windows.
Essentially, the window operator is useful in adopting the window operator on Resource
Description Framework streams in relation to the CQL (Cole and Conley 2009, p.954). It is also
important to note that the semantics of continuous query on Resource Description Framework
get defined as query operator composition. Practically, a query gets composed as an operator
graph in streaming both the C-SPARQL and SPARQL (Dean 2009, p.237). The SPARQL helps
to base the definition of the query graph on the query operator.
4.4 Query Languages
There is a need for the introduction of a query pattern for expressing the primitive
operators in order to fully define a declarative Linked Stream Data's query language (Abdulla
and Matzke 2006, p.956; Buchanan and Shortliffe 1984, p.561; Zhang and Kollios 2007, p.654).
In practice, this kind of data is window matching, triple matching, and sequential operators
56. 56
(Eastwood 2008, p.509). In addition, the composition of these basic query patterns can later get
expressed by things such as OPT, AND, filter patterns of SPARQL, and UNION. Another
important thing to note is that, these patterns, corresponds to the operators in earlier definitions.
In support of the aggregation operators, several types of research (e.g. Abdulla and Matzke 2006,
p.966; Buchanan and Shortliffe 1984, p.906; Zhang and Kollios 2007, p.749), define their
semantics with the AGG query pattern. This kind of pattern is compatible with another type of
SPARQL patterns. The definition of the evaluation of query pattern AGG is [[P]]/ [[A]] = [[P
AGG A]], whereby A refers to the aggregate function consuming output of an SPARQL query
pattern P in returning the set of mappings. By letting, P, P1, and P2 to be the composite or basic
query patterns, then the declarative query gets composed recursively by the use of this kind of
rules:
[[P1]]/ [[P2]] = [[P1 UNION P2]],
[[P1]]/ [[P2]] = [[P1 AND P2]],
[[P1]]/ [[P2]] = [[P1 AND P2]],
[[P]]/ [[A]] = [[P AGG A]],
andfµ 2 [[P]] jµ = [[P FILTER R]].
In practice, these type of patterns helps to extend the grammar of SPARQL for the continuous
queries.
It is important to note that the use of C-SPARQL is helpful for extending the SPARQL by
ion Framework stream output is the triple patterns of this kind of CONSTRUCT. In essence, the
grammars that are helpful in streaming both the C – SPARQL and SPARQL are the same. In
practice, the use of databases is always manifold (Jeuring 2012, p.417). In fact, they give a
provision for a means of retrieving either parts of the records or the entire records and in the
57. 57
performance of the different kind of calculations before displaying the outcomes (Abdulla and
Matzke 2006, p.504; Buchanan and Shortliffe 1984, p.703; Cole and Conley 2009, p.968; Zhang
and Kollios 2007, p.974). Practically, the query language is the interface that specifies such kind
of manipulations (Lucas 2010, p.608). On the other hand, the early query languages were
initially very complex making the interaction with electronic databases to get done by the
individuals with some special knowledge (MacLennan and Tang 2009, p.673). Ordinarily, the
more user-friendly interfaces are the modern ones, in addition, they also allow for the casual
users to access the information of the database.
A good example of the main types of this kind of query modes is the fill in the blank, the
menu, and the structured query (Gedik 2006, p.422). Most importantly, the menu needs an
individual to choose from various alternatives that get displayed on a monitor that are
particularly suitable for novices (Maringer 2005, p.342). On the other hand, the technique of the
fill in the blank refers to one that allows the user get a promotion to enter the key words such as
the statements (Moustakas 1990, p.623). Worth noting, the approach of the structured query is
very effective with the databases that are relational. In simple terms, it has a powerful syntax that
is formal and, in practice, a programming language. Additionally, it can accommodate logical
operators (Mueller 2009, p.506). Furthermore, the Structured Query Language or the SQL has
some various forms during the implementation of this kind of approach. Some of the various
forms include: selecting [[field Fa, Fb, Fc..., Fn]], on the other hand, where [[Fa Field = abc]]
and [[field Fb = def]], and from [[ database Da, Db, Dc… Dn]]. Several studies (e.g. Abdulla and
Matzke 2006, p.678; Buchanan and Shortliffe 1984, p.985; Zhang and Kollios 2007, p.992),
shows that it is important to note that the structured query language is supporting the searching
of the database and also other activities by the use of various commands such as ‘sum’, ‘print’,
58. 58
‘find’, ‘delete’ and so on (Nirmal 1990, p.496). Ordinarily, the natural language looks like the
sentence structure of a Structured Query Language except that the syntax of the SQL instead uses
the statement of Structured Query Language. Additionally, it is also possible to show a
representation of the queries in the form of tables.
The technique is known as the QBE (or the query by example) helps in the displaying of
an empty form. According to Mcllroy (1998), this kind of process continues to occur expecting
the searcher to enter the appropriate specification of the search into the appropriate columns.
This kind of SQL type of query then gets constructed by the program from the table as it does the
execution (Zhang and Kollios 2007, p.997). In practice, the natural language shows the most
flexible query language (Abdulla and Matzke 2006, p.911; Buchanan and Shortliffe 1984, p.703;
Zhang and Kollios 2007, p.707). Most importantly, some commercial database management
software allows the use of sentences of the natural language in the form of constraints to search
the databases (Schreiber 1977, p.781). In essence, these kinds of programs recognize the
synonyms and the action words of syntax after its parse (Abdulla and Matzke 2006, p.1002;
Buchanan and Shortliffe 1984, p.734; Zhang and Kollios 2007, p. 836). In addition, the programs
records identify the names of the files, perform, and field the required logical operations
(Seshadri and Leung 1998, p.699). Furthermore, there has been some development in the natural
language queries in the spoken voice due to the acceptance of such experimental systems (Sims
and Yocom 2008, p.1003). However, the ability to employ the unrestricted natural language in
query unstructured information that further needs advances in the understanding of the machine
of natural language (Wei 2011, p.354). This kind of activity mainly presents in the representation
of the programmatic and semantic context of ideas.
59. 59
Chapter 5: The Optimization Solutions for the CQELS
In essence, this kind of execution framework helps in supporting adaptive and native
query execution over RDF datasets and RDF streams (Bolton 1996, p.404). Worth noting, the
framework’s white box architecture accepts both the RDF datasets and RDF streams as inputs
and also returns the outputs as either the relational streams or the RDF streams in the result
format of SPARQL (Abdulla and Matzke 2006, p.702; Buchanan and Shortliffe 1984, p.497). In
practice, it is possible to feed the output RDF streams into any CQELS engine (Wei 2011,
p.4078). On the other hand, the relational streams can be useful to other relational stream
processing systems (Cheung et al. 2006, p.497). Notably, the working processing involves the
following: the pushing of the stream data to the input manager and using the encoder for
encoding it into the normalised input stream representation (Cole and Conley 2009, p.1007).
Practically, the dynamic executor is able to consume this kind of encoder. Another important
aspect to note is that the decoder has to decode the outputs of the dynamic executor by streaming
it to the receiver (Abdulla and Matzke 2006 p.749). Mostly, the decoder and the encoder share a
dictionary for the decoding and the encoding operations. Additionally, the dynamic executor
accesses the static RDF datasets via the cache fetcher. Furthermore, the SPARQL endpoints can
be useful in hosting the decoder and encoder in either the remote RDF stores or the local RDF
stores (Cole and Conley 2009, p.1011). On the other hand, the cache fetcher plays a vital role in
retrieving the crucial data then encodes the same data for the cache manager by the use of the
encoder (Wei 2011, p.507). Worth noting, the normalized representation is helpful in representing
the encoded data of the intermediate results for sharing the same dictionary with input stream.