In this study, we evaluate the performance of trajectory queries that are handled by Cassandra, MongoDB, and PostgreSQL. The evaluation is conducted on a multiprocessor and a cluster.
Telecommunication companies collect a lot of data from their mobile users. These data must be analyzed in order to support business decisions, such as infrastructure planning. The optimal
choice of hardware platform and database can be different from a query to another. We use data
collected from Telenor Sverige, a telecommunication company that operates in Sweden. These data are collected every five minutes for an entire week in a medium sized city. The execution
time results show that Cassandra performs much better than MongoDB and PostgreSQL for queries that do not have spatial features. Statio’s Cassandra Lucene index incorporates a
geospatial index into Cassandra, thus making Cassandra to perform similarly as MongoDB to
handle spatial queries. In four use cases, namely, distance query, k-nearest neigbhor query, range query, and region query, Cassandra performs much better than MongoDB and
PostgreSQL for two cases, namely range query and region query. The scalability is also good for these two use cases
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERijdms
In this study, we evaluate the performance of SQL and NoSQL database management systems namely;
Cassandra, CouchDB, MongoDB, PostgreSQL, and RethinkDB. We use a cluster of four nodes to run the
database systems, with external load generators.The evaluation is conducted using data from Telenor
Sverige, a telecommunication company that operates in Sweden. The experiments are conducted using
three datasets of different sizes.The write throughput and latency as well as the read throughput and
latency are evaluated for four queries; namely distance query, k-nearest neighbour query, range query, and
region query. For write operations Cassandra has the highest throughput when multiple nodes are used,
whereas PostgreSQL has the lowest latency and the highest throughput for a single node. For read
operations MongoDB has the lowest latency for all queries. However, Cassandra has the highest
throughput for reads. The throughput decreasesas the dataset size increases for both write and read, for
both sequential as well as random order access. However, this decrease is more significant for random
read and write. In this study, we present the experience we had with these different database management
systems including setup and configuration complexity
Optimized Access Strategies for a Distributed Database DesignWaqas Tariq
Abstract Distributed Database Query Optimization has been an active area of research for Database research Community in this decade. Research work mostly involves mathematical programming and evolving new algorithm design techniques in order to minimize the combined cost of storing the database, processing transactions and communication amongst various sites of storage. The complete problem and most of its subsets as well are NP-Hard. Most of proposed solutions till date are based on use of Enumerative Techniques or using Heuristics. In this paper we have shown benefits of using innovative Genetic Algorithms (GA) for optimizing the sequence of sub-query operations over the enumerative methods and heuristics. A stochastic simulator has been designed and experimental results show encouraging improvements in decreasing the total cost of a query. An exhaustive enumerative method is also applied and solutions are compared with that of GA on various parameters of a Distributed Query, like up to 12 joins and 10 sites. Keywords: Distributed Query Optimization, Database Statistics, Query Execution Plan, Genetic Algorithms, Operation Allocation.
Ranking Preferences to Data by Using R-TreesIOSR Journals
This document discusses algorithms for efficiently processing top-k spatial preference queries in databases containing spatial and non-spatial data. It defines top-k spatial preference queries as ranking objects based on quality and features in their nearest neighborhoods. It presents the branch and bound and feature join algorithms for computing the top-k results without having to calculate scores for all objects. It also discusses using R-trees to index spatial data and feature data to accelerate query processing.
Matching GPS Traces with Personal
Schedules,” Proc. First ACM Int’l Workshop
Personalized Context Modeling and
Management for UbiComp Applications
(PCM), 2009.
[8] X. Li, Y.-Y. Chen, T. Suel, and A.
Markowetz, “Efficient Query Processing in
Geographic Web Search,” Proc. Int’l ACM
SIGIR Conf. Research and Development in
Information Retrieval (SIGIR), 2006.
[9] B.J. Jansen, A. Spink, and T. Saracevic,
“Real Life, Real Users, and Real Needs: A
Study and Analysis of User Queries
The document summarizes research on performing spatio-textual similarity joins. It discusses:
1) Developing a filter-and-refine framework to efficiently find similar object pairs from two datasets using signatures.
2) Generating spatial and textual signatures for objects and building inverted indexes on the signatures to find candidate pairs.
3) Refining the candidate pairs to obtain the final result pairs that satisfy spatial and textual similarity thresholds.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
The document proposes an extension to the M-tree family of index structures called M*-tree. M*-tree improves upon M-tree by maintaining a nearest-neighbor graph within each node. The nearest-neighbor graph stores, for each entry in a node, a reference and distance to its nearest neighbor among the other entries in that node. This additional structure allows for more efficient filtering of non-relevant subtrees during search queries through the use of "sacrifice pivots". The experiments showed that M*-tree can perform searches significantly faster than M-tree while keeping construction costs low.
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Editor IJCATR
Time series forecasting is important because it can often provide the foundation for decision making in a large variety of fields. A tree-ensemble method, referred to as time series forest (TSF), is proposed for time series classification. The approach is based on the concept of data series envelopes and essential attributes generated by a multilayer neural network... These claims are further investigated by applying statistical tests. With the results presented in this article and results from related investigations that are considered as well, we want to support practitioners or scholars in answering the following question: Which measure should be looked at first if accuracy is the most important criterion, if an application is time-critical, or if a compromise is needed? In this paper demonstrated feature extraction by novel method can improvement in time series data forecasting process
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERijdms
In this study, we evaluate the performance of SQL and NoSQL database management systems namely;
Cassandra, CouchDB, MongoDB, PostgreSQL, and RethinkDB. We use a cluster of four nodes to run the
database systems, with external load generators.The evaluation is conducted using data from Telenor
Sverige, a telecommunication company that operates in Sweden. The experiments are conducted using
three datasets of different sizes.The write throughput and latency as well as the read throughput and
latency are evaluated for four queries; namely distance query, k-nearest neighbour query, range query, and
region query. For write operations Cassandra has the highest throughput when multiple nodes are used,
whereas PostgreSQL has the lowest latency and the highest throughput for a single node. For read
operations MongoDB has the lowest latency for all queries. However, Cassandra has the highest
throughput for reads. The throughput decreasesas the dataset size increases for both write and read, for
both sequential as well as random order access. However, this decrease is more significant for random
read and write. In this study, we present the experience we had with these different database management
systems including setup and configuration complexity
Optimized Access Strategies for a Distributed Database DesignWaqas Tariq
Abstract Distributed Database Query Optimization has been an active area of research for Database research Community in this decade. Research work mostly involves mathematical programming and evolving new algorithm design techniques in order to minimize the combined cost of storing the database, processing transactions and communication amongst various sites of storage. The complete problem and most of its subsets as well are NP-Hard. Most of proposed solutions till date are based on use of Enumerative Techniques or using Heuristics. In this paper we have shown benefits of using innovative Genetic Algorithms (GA) for optimizing the sequence of sub-query operations over the enumerative methods and heuristics. A stochastic simulator has been designed and experimental results show encouraging improvements in decreasing the total cost of a query. An exhaustive enumerative method is also applied and solutions are compared with that of GA on various parameters of a Distributed Query, like up to 12 joins and 10 sites. Keywords: Distributed Query Optimization, Database Statistics, Query Execution Plan, Genetic Algorithms, Operation Allocation.
Ranking Preferences to Data by Using R-TreesIOSR Journals
This document discusses algorithms for efficiently processing top-k spatial preference queries in databases containing spatial and non-spatial data. It defines top-k spatial preference queries as ranking objects based on quality and features in their nearest neighborhoods. It presents the branch and bound and feature join algorithms for computing the top-k results without having to calculate scores for all objects. It also discusses using R-trees to index spatial data and feature data to accelerate query processing.
Matching GPS Traces with Personal
Schedules,” Proc. First ACM Int’l Workshop
Personalized Context Modeling and
Management for UbiComp Applications
(PCM), 2009.
[8] X. Li, Y.-Y. Chen, T. Suel, and A.
Markowetz, “Efficient Query Processing in
Geographic Web Search,” Proc. Int’l ACM
SIGIR Conf. Research and Development in
Information Retrieval (SIGIR), 2006.
[9] B.J. Jansen, A. Spink, and T. Saracevic,
“Real Life, Real Users, and Real Needs: A
Study and Analysis of User Queries
The document summarizes research on performing spatio-textual similarity joins. It discusses:
1) Developing a filter-and-refine framework to efficiently find similar object pairs from two datasets using signatures.
2) Generating spatial and textual signatures for objects and building inverted indexes on the signatures to find candidate pairs.
3) Refining the candidate pairs to obtain the final result pairs that satisfy spatial and textual similarity thresholds.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
The document proposes an extension to the M-tree family of index structures called M*-tree. M*-tree improves upon M-tree by maintaining a nearest-neighbor graph within each node. The nearest-neighbor graph stores, for each entry in a node, a reference and distance to its nearest neighbor among the other entries in that node. This additional structure allows for more efficient filtering of non-relevant subtrees during search queries through the use of "sacrifice pivots". The experiments showed that M*-tree can perform searches significantly faster than M-tree while keeping construction costs low.
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Editor IJCATR
Time series forecasting is important because it can often provide the foundation for decision making in a large variety of fields. A tree-ensemble method, referred to as time series forest (TSF), is proposed for time series classification. The approach is based on the concept of data series envelopes and essential attributes generated by a multilayer neural network... These claims are further investigated by applying statistical tests. With the results presented in this article and results from related investigations that are considered as well, we want to support practitioners or scholars in answering the following question: Which measure should be looked at first if accuracy is the most important criterion, if an application is time-critical, or if a compromise is needed? In this paper demonstrated feature extraction by novel method can improvement in time series data forecasting process
The paper presents a nature inspired algorithm that copies the big bang theory of evolution.
This algorithm is simple with regard to number of parameters. Embedded systems are powered by
batteries and enhancing the operating time of the battery by reducing the power consumption is vital.
Embedded systems consume power while accessing the memory during their operation. An efficient
method for power management is proposed in this work. The proposed method, reduce the energy
consumption in memories from 76% up to 98% as compared to other methods reported in the
literature.
The document discusses representing relational spatiotemporal data using information granules. It proposes:
1) Describing the relational data using a vocabulary of granular descriptors formed from Cartesian products of spatial, temporal, and signal information granules. This granular representation provides an interpretable perspective on the data.
2) Analyzing the capabilities of different vocabularies to capture the essence of the data through the processes of granulation and degranulation, where the original data is reconstructed from its granular representation. The quality of reconstruction is used to optimize the vocabulary.
3) Extending the approach to analyze evolvability of the granular description as the relational data changes across consecutive
Skyline Query Processing using Filtering in Distributed EnvironmentIJMER
This document summarizes a research paper about skyline query processing in distributed databases. Skyline queries return multidimensional data points that are not dominated by other points. In distributed databases, skyline queries must be processed across multiple data sites. The paper proposes using multiple filtering points selected from each local skyline result to reduce the number of false positive results and communication costs between sites. Two heuristics called MaxSum and MaxDist are described for selecting filtering points that maximize their combined dominating potential across sites to improve distributed skyline query processing performance.
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
The document summarizes a research paper that proposes a framework called DRSP (Dimension Reduction for Similarity Matching and Pruning) for time series data streams. DRSP addresses the challenges of large streaming data size by:
1) Performing dimension reduction using a Multi-level Segment Mean technique to compactly represent the data while retaining crucial information.
2) Incorporating a similarity matching technique to analyze if new data objects match existing streams.
3) Applying a pruning technique to filter out non-relevant data object pairs and join only relevant pairs.
The framework aims to reduce storage and computation costs for similarity matching on large time series data streams.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
Enhancement of Single Moving Average Time Series Model Using Rough k-Means fo...IJERA Editor
This document proposes combining rough k-means clustering with a single moving average time series model to improve network traffic prediction. The document first discusses related work on network traffic prediction using various time series models. It then describes using a single moving average model to initially predict network packet loads, and enhancing this prediction by incorporating clusters identified through rough k-means analysis of the network data. The proposed integrated model is evaluated on real network traffic data and shown to improve prediction accuracy over the conventional single moving average model alone.
Application Of Extreme Value Theory To Bursts PredictionCSCJournals
Bursts and extreme events in quantities such as connection durations, file sizes, throughput, etc. may produce undesirable consequences in computer networks. Deterioration in the quality of service is a major consequence. Predicting these extreme events and burst is important. It helps in reserving the right resources for a better quality of service. We applied Extreme value theory (EVT) to predict bursts in network traffic. We took a deeper look into the application of EVT by using EVT based Exploratory Data Analysis. We found that traffic is naturally divided into two categories, Internal and external traffic. The internal traffic follows generalized extreme value (GEV) model with a negative shape parameter, which is also the same as Weibull distribution. The external traffic follows a GEV with positive shape parameter, which is Frechet distribution. These findings are of great value to the quality of service in data networks, especially when included in service level agreement as traffic descriptor parameters.
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSijcses
Nodes in Mobile Ad-hoc network are connected wirelessly and the network is auto configuring [1]. This paper introduces the usefulness of data warehouse as an alternative to manage data collected by WSN.Wireless Sensor Network produces huge quantity of data that need to be proceeded and homogenised, so as to help researchers and other people interested in the information. Collected data is managed and compared with other coming from datasources and systems could participate in technical report and decision making. This paper proposes a model to design, extract, transform and normalize data collected by Wireless Sensor Networks by implementing a multidimensional warehouse for comparing many aspects in WSN such as (routing protocol[4], sensor, sensor mobility, cluster ….). Hence, data warehouse defined and applied to the context above is presented as a useful approach that gives specialists row data and information for decision processes and navigate from one aspect to another.
Progressive Mining of Sequential Patterns Based on Single ConstraintTELKOMNIKA JOURNAL
Data that were appeared in the order of time and stored in a sequence database can be processed to obtain sequential patterns. Sequential pattern mining is the process to obtain sequential patterns from database. However, large amount of data with a variety of data type and rapid data growth raise the scalability issue in data mining process. On the other hand, user needs to analyze data based on specific organizational needs. Therefore, constraint is used to impose limitation in the mining process. Constraint in sequential pattern mining can reduce the short and trivial sequential patterns so that the sequential patterns satisfy user needs. Progressive mining of sequential patterns, PISA, based on single constraint utilizes Period of Interest (POI) as predefined time frame set by user in progressive sequential tree. Single constraint checking in PISA utilizes the concept of anti monotonic or monotonic constraint. Therefore, the number of sequential patterns will decrease, the total execution time of mining process will decrease and as a result, the system scalability will be achieved.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
This document describes three approaches for indexing moving point data over time: the 3D R-tree, the HR-tree, and the 2+3 R-tree. An experiment was conducted to evaluate the storage space requirements and query performance of each approach. The results showed that while the HR-tree required the most storage space, its query processing costs were over 50% lower than the 3D R-tree and 2+3 R-tree. Compared to maintaining separate R-trees for each time state, the HR-tree also offered similar query performance while using around one third less storage space.
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...IJORCS
This document proposes a framework to improve the processing of spatio-temporal queries for global positioning systems. The framework employs a new indexing algorithm built on SQL Server 2008 that avoids the overhead of R-Tree indexing. It utilizes dynamic materialized views and an adaptive safe region to reduce communication costs and update loads. Caching is used to enhance performance. The notification engine processes concurrent queries using publish/subscribe to group similar queries. Experiments showed the framework outperformed R-Tree indexing.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
This document summarizes a research paper on developing an improved LEACH (Low-Energy Adaptive Clustering Hierarchy) communication protocol for energy efficient data mining in multi-feature sensor networks. It begins with background on wireless sensor networks and issues like energy efficiency. It then discusses the existing LEACH protocol and its drawbacks. The proposed improved LEACH protocol includes cluster heads, sub-cluster heads, and cluster nodes to address LEACH's limitations. This new version aims to minimize energy consumption during cluster formation and data aggregation in multi-feature sensor networks.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
The document proposes a Modified Pure Radix Sort algorithm for large heterogeneous datasets. The algorithm divides the data into numeric and string processes that work simultaneously. The numeric process further divides data into sublists by element length and sorts them simultaneously using an even/odd logic across digits. The string process identifies common patterns to convert strings to numbers that are then sorted. This optimizes problems with traditional radix sort through a distributed computing approach.
11.challenging issues of spatio temporal data miningAlexander Decker
This document discusses the challenging issues of spatio-temporal data mining. It begins with an introduction to spatio-temporal databases and how they differ from traditional databases by managing moving objects and their locations over time. It then provides an overview of spatial data mining and temporal data mining before focusing on spatio-temporal data mining, which aims to analyze large databases containing both spatial and temporal information. The document outlines some of the key challenges in applying traditional data mining techniques to spatio-temporal data due to its continuous and correlated nature.
This document describes a context-aware automatic traffic notification system for cell phones that can learn a user's common destinations and routes over time using location and context data. It collects GPS and other data from users, identifies important locations through clustering, learns frequent routes between locations, and can predict a user's destination and route to then notify them of any traffic conditions. The system is implemented on a mobile phone to provide automated traffic alerts to users during their daily commutes without needing to manually enter a destination.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Mobile information collectors trajectory data warehouse designIJMIT JOURNAL
To analyze complex phenomena which involve moving objects, Trajectory Data Warehouse (TDW) seems to be an answer for many recent decision problems related to various professions (physicians, commercial representatives, transporters, ecologists …) concerned with mobility. This work aims to make trajectories as a first class concept in the trajectory data conceptual model and to design a TDW, in which data resulting from mobile information collectors’ trajectory are gathered. These data will be analyzed, according to trajectory characteristics, for decision making purposes, such as new products commercialization, new commerce implementation, etc.
The paper presents a nature inspired algorithm that copies the big bang theory of evolution.
This algorithm is simple with regard to number of parameters. Embedded systems are powered by
batteries and enhancing the operating time of the battery by reducing the power consumption is vital.
Embedded systems consume power while accessing the memory during their operation. An efficient
method for power management is proposed in this work. The proposed method, reduce the energy
consumption in memories from 76% up to 98% as compared to other methods reported in the
literature.
The document discusses representing relational spatiotemporal data using information granules. It proposes:
1) Describing the relational data using a vocabulary of granular descriptors formed from Cartesian products of spatial, temporal, and signal information granules. This granular representation provides an interpretable perspective on the data.
2) Analyzing the capabilities of different vocabularies to capture the essence of the data through the processes of granulation and degranulation, where the original data is reconstructed from its granular representation. The quality of reconstruction is used to optimize the vocabulary.
3) Extending the approach to analyze evolvability of the granular description as the relational data changes across consecutive
Skyline Query Processing using Filtering in Distributed EnvironmentIJMER
This document summarizes a research paper about skyline query processing in distributed databases. Skyline queries return multidimensional data points that are not dominated by other points. In distributed databases, skyline queries must be processed across multiple data sites. The paper proposes using multiple filtering points selected from each local skyline result to reduce the number of false positive results and communication costs between sites. Two heuristics called MaxSum and MaxDist are described for selecting filtering points that maximize their combined dominating potential across sites to improve distributed skyline query processing performance.
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
The document summarizes a research paper that proposes a framework called DRSP (Dimension Reduction for Similarity Matching and Pruning) for time series data streams. DRSP addresses the challenges of large streaming data size by:
1) Performing dimension reduction using a Multi-level Segment Mean technique to compactly represent the data while retaining crucial information.
2) Incorporating a similarity matching technique to analyze if new data objects match existing streams.
3) Applying a pruning technique to filter out non-relevant data object pairs and join only relevant pairs.
The framework aims to reduce storage and computation costs for similarity matching on large time series data streams.
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
Enhancement of Single Moving Average Time Series Model Using Rough k-Means fo...IJERA Editor
This document proposes combining rough k-means clustering with a single moving average time series model to improve network traffic prediction. The document first discusses related work on network traffic prediction using various time series models. It then describes using a single moving average model to initially predict network packet loads, and enhancing this prediction by incorporating clusters identified through rough k-means analysis of the network data. The proposed integrated model is evaluated on real network traffic data and shown to improve prediction accuracy over the conventional single moving average model alone.
Application Of Extreme Value Theory To Bursts PredictionCSCJournals
Bursts and extreme events in quantities such as connection durations, file sizes, throughput, etc. may produce undesirable consequences in computer networks. Deterioration in the quality of service is a major consequence. Predicting these extreme events and burst is important. It helps in reserving the right resources for a better quality of service. We applied Extreme value theory (EVT) to predict bursts in network traffic. We took a deeper look into the application of EVT by using EVT based Exploratory Data Analysis. We found that traffic is naturally divided into two categories, Internal and external traffic. The internal traffic follows generalized extreme value (GEV) model with a negative shape parameter, which is also the same as Weibull distribution. The external traffic follows a GEV with positive shape parameter, which is Frechet distribution. These findings are of great value to the quality of service in data networks, especially when included in service level agreement as traffic descriptor parameters.
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSijcses
Nodes in Mobile Ad-hoc network are connected wirelessly and the network is auto configuring [1]. This paper introduces the usefulness of data warehouse as an alternative to manage data collected by WSN.Wireless Sensor Network produces huge quantity of data that need to be proceeded and homogenised, so as to help researchers and other people interested in the information. Collected data is managed and compared with other coming from datasources and systems could participate in technical report and decision making. This paper proposes a model to design, extract, transform and normalize data collected by Wireless Sensor Networks by implementing a multidimensional warehouse for comparing many aspects in WSN such as (routing protocol[4], sensor, sensor mobility, cluster ….). Hence, data warehouse defined and applied to the context above is presented as a useful approach that gives specialists row data and information for decision processes and navigate from one aspect to another.
Progressive Mining of Sequential Patterns Based on Single ConstraintTELKOMNIKA JOURNAL
Data that were appeared in the order of time and stored in a sequence database can be processed to obtain sequential patterns. Sequential pattern mining is the process to obtain sequential patterns from database. However, large amount of data with a variety of data type and rapid data growth raise the scalability issue in data mining process. On the other hand, user needs to analyze data based on specific organizational needs. Therefore, constraint is used to impose limitation in the mining process. Constraint in sequential pattern mining can reduce the short and trivial sequential patterns so that the sequential patterns satisfy user needs. Progressive mining of sequential patterns, PISA, based on single constraint utilizes Period of Interest (POI) as predefined time frame set by user in progressive sequential tree. Single constraint checking in PISA utilizes the concept of anti monotonic or monotonic constraint. Therefore, the number of sequential patterns will decrease, the total execution time of mining process will decrease and as a result, the system scalability will be achieved.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
This document describes three approaches for indexing moving point data over time: the 3D R-tree, the HR-tree, and the 2+3 R-tree. An experiment was conducted to evaluate the storage space requirements and query performance of each approach. The results showed that while the HR-tree required the most storage space, its query processing costs were over 50% lower than the 3D R-tree and 2+3 R-tree. Compared to maintaining separate R-trees for each time state, the HR-tree also offered similar query performance while using around one third less storage space.
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...IJORCS
This document proposes a framework to improve the processing of spatio-temporal queries for global positioning systems. The framework employs a new indexing algorithm built on SQL Server 2008 that avoids the overhead of R-Tree indexing. It utilizes dynamic materialized views and an adaptive safe region to reduce communication costs and update loads. Caching is used to enhance performance. The notification engine processes concurrent queries using publish/subscribe to group similar queries. Experiments showed the framework outperformed R-Tree indexing.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
This document summarizes a research paper on developing an improved LEACH (Low-Energy Adaptive Clustering Hierarchy) communication protocol for energy efficient data mining in multi-feature sensor networks. It begins with background on wireless sensor networks and issues like energy efficiency. It then discusses the existing LEACH protocol and its drawbacks. The proposed improved LEACH protocol includes cluster heads, sub-cluster heads, and cluster nodes to address LEACH's limitations. This new version aims to minimize energy consumption during cluster formation and data aggregation in multi-feature sensor networks.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
The document proposes a Modified Pure Radix Sort algorithm for large heterogeneous datasets. The algorithm divides the data into numeric and string processes that work simultaneously. The numeric process further divides data into sublists by element length and sorts them simultaneously using an even/odd logic across digits. The string process identifies common patterns to convert strings to numbers that are then sorted. This optimizes problems with traditional radix sort through a distributed computing approach.
11.challenging issues of spatio temporal data miningAlexander Decker
This document discusses the challenging issues of spatio-temporal data mining. It begins with an introduction to spatio-temporal databases and how they differ from traditional databases by managing moving objects and their locations over time. It then provides an overview of spatial data mining and temporal data mining before focusing on spatio-temporal data mining, which aims to analyze large databases containing both spatial and temporal information. The document outlines some of the key challenges in applying traditional data mining techniques to spatio-temporal data due to its continuous and correlated nature.
This document describes a context-aware automatic traffic notification system for cell phones that can learn a user's common destinations and routes over time using location and context data. It collects GPS and other data from users, identifies important locations through clustering, learns frequent routes between locations, and can predict a user's destination and route to then notify them of any traffic conditions. The system is implemented on a mobile phone to provide automated traffic alerts to users during their daily commutes without needing to manually enter a destination.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Mobile information collectors trajectory data warehouse designIJMIT JOURNAL
To analyze complex phenomena which involve moving objects, Trajectory Data Warehouse (TDW) seems to be an answer for many recent decision problems related to various professions (physicians, commercial representatives, transporters, ecologists …) concerned with mobility. This work aims to make trajectories as a first class concept in the trajectory data conceptual model and to design a TDW, in which data resulting from mobile information collectors’ trajectory are gathered. These data will be analyzed, according to trajectory characteristics, for decision making purposes, such as new products commercialization, new commerce implementation, etc.
a data mining approach for location production in mobile environments marwaeng
The document proposes a three-phase algorithm for predicting the next location of mobile users. In the first phase, mobility patterns are mined from historical user trajectory data. In the second phase, mobility rules are extracted from these patterns. In the third phase, predictions are made by matching mobility rules to a user's current trajectory. The algorithm aims to overcome limitations of prior work by discovering regular patterns in user movements and distinguishing between random and regular movements. A simulation evaluation found the proposed method achieved more accurate predictions than other methods.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
A Big Data Telco Solution by Dr. Laura Wynterwkwsci-research
Presented during the WKWSCI Symposium 2014
21 March 2014
Marina Bay Sands Expo and Convention Centre
Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University
Efficient processing of spatial range queries on wireless broadcast streamsijdms
With advances in wireless networks and hand-held computing devices equipped with location sensing
capability (e.g., PDAs, laptops, and smart phones), a large number of location based services (LBSs) have
been successfully deployed. In LBSs, wireless broadcast is an efficient method to support the large number
of users. In wireless broadcast environment, existing research proposed to support range queries search,
may tune into unnecessary indexes or data object. This paper addresses the problem of processing range
queries on wireless broadcast streams. In order to support range queries efficiently, we propose a novel
indexing scheme called Distributed Space-Partitioning Index (DSPI). DSPI consists of hierarchical grids
that provide mobile clients with the global view as well as the local view of the broadcast data. The
algorithm for processing range queries based on DSPI is also proposed. Simulation experiments
demonstrate DSPI is superior to the existing index schemes.
This document discusses techniques for predicting the next location of a user based on their location history data. It proposes using incremental learning methods like multivariate multiple regression, spherical-spherical regression, and randomized spherical K-NN regression on a damped window model to solve the location prediction problem in a streaming data setup. The techniques allow planning travel by providing routes and nearby facilities to predicted and current locations using APIs like Google Maps.
Ranking spatial data by quality preferences pptSaurav Kumar
A spatial preference query ranks objects based on the qualities of features in their spatial neighborhood. For example, using a real estate agency database of flats for lease, a customer may want to rank the flats with respect to the appropriateness of their location, defined after aggregating the qualities of other features (e.g., restaurants, cafes, hospital, market, etc.) within their spatial neighborhood. Such a neighborhood concept can be specified by the user via different functions. It can be an explicit circular region within a given distance from the flat. Another intuitive definition is to assign higher weights to the features based on their proximity to the flat. In this paper, we formally define spatial preference queries and propose appropriate indexing techniques and search algorithms for them. Extensive evaluation of our methods on both real and synthetic data reveals that an optimized branch-and-bound solution is efficient and robust with respect to different parameters
Stream Processing Environmental Applications in Jordan ValleyCSCJournals
This document discusses stream processing applications for environmental monitoring in Jordan Valley. It presents statistical data collected from weather stations in different Jordan Valley locations. Stream processing is important for continuous monitoring systems to detect events in real-time. The document outlines considerations for stream processing engine design like communication, computation, and flexibility. It also describes Jordan's Irrigation Management Information System, which uses real-time meteorological data from weather stations to optimize water usage for agriculture.
This document describes a web application for analyzing building energy management data using predictive modeling and machine learning techniques. The application contains years of sensor data from a CUNY building and allows users to visualize the data, perform statistical analysis, and generate forecasts using Python modules. Key features include interactive data visualization, filtering and selecting subsets of data, defining expressions of sensor variables, and applying machine learning models for prediction. The application provides a customizable platform for exploring time series data while allowing different users to share their work.
Survey on confidentiality of the user and query processing on spatial networkeSAT Journals
Abstract
The administration of transhipment systems has become increasingly important in many applications such as position-based services, supply cycle management, travel control, and so on. These applications usually involve queries over spatial networks with vigorously changing and problematical travel conditions. There may be possibilities of user's privacy violated when they are querying about the location information on the third party servers where the location information about the users will be tracked. The malicious attackers may steal the location information about the users. The k nearest neighbour query verification with location points on Voronoi diagram increases the verification cost on mobile clients. The reverse nearest neighbour queries by assigning each object and query with a safe region is applied such that the expensive recomputation is not required as long as the query and objects remain in their respective safe regions. The proposed system reduces the communication cost in client-server architectures because an object does not report its location to the server unless it leaves its safe region or the server sends a location update request. Hilbert curve is used here for the capability of partially retaining the neighbouring adjacency of the original data. The user data is protected by applying Hilbert transform over the original values and storing the transformed values in the Hilbert curve.
Keywords— Hilbert Curve, Voronoi diagram, Hilbert Transform
Recently with the increasing development of distributed computer systems (DCSs) in networked
industrial and manufacturing applications on the World Wide Web (WWW) platform, including service-oriented
architecture and Web of Things QoS-aware systems, it has become important to predict the Web performance.
In this paper, we present Web performance prediction in time by making a forecast of a Web resource
downloading using the Efficient Turning Bands (TB) geostatistical simulation method. Real-life data for the
research were obtained from our own website named "Distributed forecasting system". Generation of log file
form website and performing monitoring of a group of Web clients from connected LAN. For better web
prediction we used spatio temporal prediction method with time utility for downloading particular file from
website and calculate forecasting result using Turning bands method but improving more forecasting
accuracy use the efficient turning band method basically efficient turning band use Naive bays algorithm and
calculate efficient result and that result is compared with Turning band and efficient turning band method.
The efficient turning band method result show good forecasting quality of Web performance prediction and
forecasting.
Modeling the Adaption Rule in Contextaware Systemsijasuc
Context awareness is increasingly gaining applicability in interactive ubiquitous mobile computing
systems. Each context-aware application has its own set of behaviors to react to context modifications. This
paper is concerned with the context modeling and the development methodology for context-aware systems.
We proposed a rule-based approach and use the adaption tree to model the adaption rule of context-aware
systems. We illustrate this idea in an arithmetic game application.
MODELING THE ADAPTION RULE IN CONTEXTAWARE SYSTEMSijasuc
Context awareness is increasingly gaining applicability in interactive ubiquitous mobile computing
systems. Each context-aware application has its own set of behaviors to react to context modifications. This
paper is concerned with the context modeling and the development methodology for context-aware systems.
We proposed a rule-based approach and use the adaption tree to model the adaption rule of context-aware
systems. We illustrate this idea in an arithmetic game application.
Mobile Location Indexing Based On Synthetic Moving ObjectsIJECEIAES
Today, the number of researches based on the data they move known as mobile objects indexing came out from the traditional static one. There are some indexing approaches to handle the complicated moving positions. One of the suitable ideas is pre-ordering these objects before building index structure. In this paper, a structure, a presorted-nearest index tree algorithm is proposed that allowed maintaining, updating, and range querying mobile objects within the desired period. Besides, it gives the advantage of an index structure to easy data access and fast query along with the retrieving nearest locations from a location point in the index structure. A synthetic mobile position dataset is also proposed for performance evaluation so that it is free from location privacy and confidentiality. The detail experimental results are discussed together with the performance evaluation of KDtree-based index structure. Both approaches are similarly efficient in range searching. However, the proposed approach is especially much more save time for the nearest neighbor search within a range than KD tree-based calculation.
Certain Analysis on Traffic Dataset based on Data Mining AlgorithmsIRJET Journal
The document analyzes a traffic accident dataset using data mining algorithms to identify patterns and relationships that can provide safe driving suggestions. It applies association rule mining, classification using naive Bayes, and k-means clustering. The analysis finds that human factors like being drunk or collision type have a stronger effect on accident fatality than environmental factors. Clustering identifies regions with higher or lower fatality rates. Integrating additional data could enable more testing and safety suggestions.
Use of Hidden Markov Mobility Model for Location Based ServicesIJERA Editor
These days people prefer to use portable and wireless devices such as laptops, mobile phones, They are connected through satellites. As user moves from one point to other, task of updating stored information becomes difficult. Provision of Location based services to users, faces some challenges like limited bandwidth and limited client power. To optimize data accessibility and to minimize access cost, we can store frequently accessed data item in cache of client. So small size of cache is introduced in mobile devices. Data fetched from server is stored on cache. So requested data from user is provided from cache and not from remote server. Question arises that which data should be kept in the cache? Performance of cache majorly depends on the cache replacement policies which select data suitable for eviction from cache. This paper presents use of Hidden Markov Models(HMMs) for prediction of user‟s future location. Then data item irrelevant to this predicted location is fetched out from the cache. The proposed approach clusters location histories according to their location characteristics and also it considers each user‟s previous actions. This results in producing high packet delivery ratio and minimum delay.
Similar to PERFORMANCE EVALUATION OF TRAJECTORY QUERIES ON MULTIPROCESSOR AND CLUSTER (20)
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
This document discusses using social media technologies to promote student engagement in a software project management course. It describes the course and objectives of enhancing communication. It discusses using Facebook for 4 years, then switching to WhatsApp based on student feedback, and finally introducing Slack to enable personalized team communication. Surveys found students engaged and satisfied with all three tools, though less familiar with Slack. The conclusion is that social media promotes engagement but familiarity with the tool also impacts satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The document proposes a blockchain-based digital currency and streaming platform called GoMAA to address issues of piracy in the online music streaming industry. Key points:
- GoMAA would use a digital token on the iMediaStreams blockchain to enable secure dissemination and tracking of streamed content. Content owners could control access and track consumption of released content.
- Original media files would be converted to a Secure Portable Streaming (SPS) format, embedding watermarks and smart contract data to indicate ownership and enable validation on the blockchain.
- A browser plugin would provide wallets for fans to collect GoMAA tokens as rewards for consuming content, incentivizing participation and addressing royalty discrepancies by recording
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This document discusses the importance of verb suffix mapping in discourse translation from English to Telugu. It explains that after anaphora resolution, the verbs must be changed to agree with the gender, number, and person features of the subject or anaphoric pronoun. Verbs in Telugu inflect based on these features, while verbs in English only inflect based on number and person. Several examples are provided that demonstrate how the Telugu verb changes based on whether the subject or pronoun is masculine, feminine, neuter, singular or plural. Proper verb suffix mapping is essential for generating natural and coherent translations while preserving the context and meaning of the original discourse.
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The document discusses automated penetration testing and provides an overview. It compares manual and automated penetration testing, noting that automated testing allows for faster, more standardized and repeatable tests but has limitations in developing new exploits. It also reviews some current automated penetration testing methodologies and tools, including those using HTTP/TCP/IP attacks, linking common scanning tools, a Python-based tool targeting databases, and one using POMDPs for multi-step penetration test planning under uncertainty. The document concludes that automated testing is more efficient than manual for known vulnerabilities but cannot replace manual testing for discovering new exploits.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
The document proposes a new validation method for fuzzy association rules based on three steps: (1) applying the EFAR-PN algorithm to extract a generic base of non-redundant fuzzy association rules using fuzzy formal concept analysis, (2) categorizing the extracted rules into groups, and (3) evaluating the relevance of the rules using structural equation modeling, specifically partial least squares. The method aims to address issues with existing fuzzy association rule extraction algorithms such as large numbers of extracted rules, redundancy, and difficulties with manual validation.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
2. 146 Computer Science & Information Technology (CS & IT)
The analysis comprises querying some points of interests in big data sets. Querying big data can
be time consuming and expensive without the right software and hardware. In this paper, various
databases were proposed to analyse such data. However, there is no single database that fits all
queries. The same holds for hardware infrastructure there is no single hardware platform that fits
all databases. We consider a case of trajectory data of a telecommunication company where
analysing large data volumes trajectory data of mobile users becomes very important. We
evaluate the performance of trajectory queries in order to contribute to business decision support
systems.
Trajectory data represents information that describes the localization of the user in time and
space. In the context of this paper, a telecommunication company wants to optimize the use of
cell antennas and localize different points of interests in order to expand its business. In order to
successfully process trajectory data, it requires a proper choice of databases and hardware that
efficiently respond to different queries.
We use trajectory data that are collected from Telenor Sverige (a telecommunication company
that operates in Sweden). Mobile users' position is tracked every five minutes for the entire week
(Monday to Sunday) from a medium size city. We are interested to know how mobile users move
around the city during the hour, day, and the week. This will give insights about typical behavior
in certain area at certain time. We expect periodic movement in some areas, e.g., at the location
of stores, restaurants’ location during lunch time.
Without loss of generality, we define queries that return points of interests such as nearest cell
location from a certain position of a mobile user. The contribution of this study is to solve
business complex problem that is,
• Define queries that optimize the point of interests, e.g., nearest point of interest, the most
visited place at a certain time, and more.
• Choice of database technology to use for different types of query.
• Choice of hardware infrastructure to use for each of the databases.
This data is modelled as spatio-temporal data where at a given time ݐ a mobile user is located at a
position (,ݔ .)ݕ The location of a mobile user is a triples (,ݔ ,ݕ )ݐ such that user’s position is
represented as a spatial-temporal point with = (ݔ, ݕ, ݐ).
By optimizing points of interests, different types of queries are proposed. They differ in terms of
what are their input and output:
• Distance query which finds points of interests that are located in equal or less than a distance,
e.g., one kilometer from a certain position of a mobile user.
• K-Nearest neighbor query that finds K nearest points of interests from a certain position of a
mobile user.
• Range query that finds points of interests within a space range from a certain position of a
mobile user.
3. Computer Science & Information Technology (CS & IT) 147
• Region query that finds the region that a mobile user frequently passes through at certain time
throughout the week.
The performance of different queries is evaluated on three open sources databases; Cassandra,
MongoDB, and PostgreSQL. We choose to use open source databases for the sake of allowing
the study reproducibility. The hardware configuration is done on a single node, and on multiple
nodes (distributed) in a cluster. The execution time of each of the queries at each database on
different hardware infrastructure is measured. Since the company knows the locations that are the
most, or the least visited during a certain time, in order to avoid overloading and underloading at
such locations, antenna planning will be updated accordingly. For business expansion, a busy
location during lunch time is for e.g., convenient for putting up a restaurant. Moreover, the
performance measurement shows which database is better for which specific query on which
hardware infrastructure, thus contributing to business support systems.
The rest of the paper is organized as follows; Section 2 defines the concepts, Section 3
summarizes the related work, Section 4 describes the configuration and gives the databases
overview, Section 5 presents results and discussions, finally Section 6 draws conclusions.
2. TRAJECTORY DATA
2.1 Definition of Trajectory
Trajectory is a function from a temporal domain to a range of spatial values, i.e., it has a
beginning time and an ending time during which a space has been travelled (see Equation 1)[1].
ൣݐ ݐௗ൧ → ݁ܿܽݏ (1)
A complete trajectory is characterized by a list of triples = (,ݔ ,ݕ ,)ݐ thus a trajectory is defined
as a sequence of positions Ʈ௦
.
Ʈ௦
= ሼଵ, ଶ, … , ሽ (2)
where = (ݔ, ݕ, ݐ) represents a spatio-temporal point, Figure 1 shows such trajectory.
Figure 1. Mobile user’s trajectory as a sequence of triples
4. 148 Computer Science & Information Technology (CS & IT)
In this study, the trajectory data has space extent that is represented by latitude and longitude.
With ݔ represents latitude and ݕ represents longitude, and the time that is represented by ;ݐ i.e., a
mobile user is located at position (,ݔ )ݕ at time ݐ
2.2. Definition of Trajectory Queries
Trajectories queries are historical spatio-temporal data which are the foundation of many
applications such as traffic analysis, mobile user’s behavior, and many others [2], [3].
Trajectories queries make analytics possible, e.g., mobile users’ positions at a certain time. In the
context of location optimization, common trajectory queries that we consider in this study are
following; Distance query, Nearest neighbor query, Range query, and Region query.
Figure 2 describes query types, where ܥ represents different cell-city names, each ܥ is
represented by (ݔ, ݕ) where ݔ is latitude and ݕ is longitude. Distance query returns cell-cities
that are located at a distance from ܥଵ, e.g., within distance L from the position of ܥଵ. The query
returns ܥଶ, ܥଷ , ܥସ, ܥ.
At a given fixed time or time interval, we are retrieving two cell-cities that are the most close to
city ܥଵ, that is K-NN query with ݇ = 2. Given a space range[,ܤ ,]ܧ range query returns the cell-
cities that belong to that space range.
Region query returns the cell-city that a given user frequently visits. e.g., user Bob passes mostly
through cell-city ܥ଼ (see Figure 2).
Figure 2. Visualization for Query Types
2.2.1. Distance Query
Definition: Distance query returns all point of interest (e.g., gas stations) whose distance
(according to a distance metric) from a given position that is less than a threshold [2], [3]. Figure
3 shows inputs to distance query.
5. Computer Science & Information Technology (CS & IT) 149
Figure 1. Distance Query
Example: find cells that are located in less than 1km from a certain mobile user’s position.
In terms of latitude and longitude coordinates, a query that covers the circle of 10 km radius from
a user position at (ݔ, ݕ) = (1.3963, −0.6981) is expressed for different databases as follows;
• In Cassandra
ܵܶܥܧܮܧ cell_city ܯܱܴܨ ݉ݕݐ݈ܾ݅݅ ܹܧܴܧܪ ݁,ݔ݁݀݊݅_݂݊݅(ݎݔ ′ሼ݂݈݅:ݎ݁ݐ ሼ:݁ݕݐ "ܾ,"݈݊ܽ݁ ݉:ݐݏݑ
[ሼ:݁ݕݐ "݃݁,"݁ܿ݊ܽݐݏ݅݀_ ݂݈݅݁݀:place, ݈ܽ:݁݀ݑݐ݅ݐ 1.3963, ݈:݁݀ݑݐ݅݃݊
− 0.6981, ݉ܽ:݁ܿ݊ܽݐݏ݅݀_ݔ "10݇݉"ሽ ] ሽሽ′)
• In MongoDB
ܾ݀. .ݕݐ݈ܾ݅݅ܯ ݂݅݊݀ ( ሼ ݈݊݅ݐܽܿ ∶ ሼ $݊݁ܽݎ
∶ [−0.6981, 1.3963], $݉ܽ:݁ܿ݊ܽݐݏ݅ܦݔ 10 ሽ ሽ, ሼ:ܻܶܫܥ_݈݈݁ܥ 1ሽ)
• In PostgreSQL
ܵܶܥܧܮܧ ݈݈ܿ݁_ܿ݅ݕݐ ܯܱܴܨ ݕݐ݈ܾ݅݅ܯ ܹܧܴܧܪ
arccos (sin൫ݔ൯ ∗ sin()ݔ + cos൫ݔ)൯ ∗ cos ()ݔ ∗ cos (ݕ − (ݕ))) ∗ ܴ <= 10 ; (3)
With ܴ is the radius of earth, ܴ = 6371 km
In order to index latitude and longitude columns so that the database understand the query, the
circle of radius 10 km is bounded by a minimum and a maximum coordinates, let’s say =
(ݔ , ݕ) and ௫ = (ݔ௫ , ݕ௫), then query (3) becomes as follows;
ܵܶܥܧܮܧ ݕݐ݅ܿ_݈݈݁ܥ ܯܱܴܨ ݉ݕݐ݈ܾ݅݅ ܹܧܴܧܪ (ݔ => ݔ ܦܰܣ ݔ ≤ ݔ௫) ܦܰܣ
(ݕ >= ݕ ܦܰܣ ݕ <= ݕ௫) ܦܰܣ
ܽݔ(݊݅ݏ(ݏܿܿݎ) ∗ )ݔ(݊݅ݏ + ܿݔ(ݏ) ∗ ܿ)ݔ(ݏ ∗ ܿݕ(ݏ − (ݕ))) <= ݎ ;
With ݎ is the angular radius of the query circle,
ݎ = ݀݅ݐݎܽ݁/݁ܿ݊ܽݐݏℎ ݏݑ݅݀ܽݎ
ݔ = ݔ − ݎ
ݔ௫ = ݔ + ݎ
ݕ = ݕ − ∆ݕ
ݕ௫ = ݕ + ∆ݕ
∆ݕ = arccos( ( cos()ݎ − sin(ݔ்) ∗ sin()ݔ ) / ( cos(ݔ்) ∗ cos()ݔ ) )
With ݔ் = arcsin(sin(/)ݔcos(.)ݎ More on positions’ angles calculation is found in [4].
6. 150 Computer Science & Information Technology (CS & IT)
2.2.2. k Nearest Neighbor Query
Definition: k-Nearest Neighbor (KNN) Query returns k points of interest which are the closest to
a given position (,ݔ )ݕ [5], [6]; and k results are ordered by proximity. KNN can be bounded with
a distance, in that case, KNN behaves like a distance query, if k is not indicated.
Figure 4 shows inputs to kNN query, where kNN is bounded within a distance.
Figure 4. kNN query
Example: find five nearest cells from a mobile user’s position. A typical query that select 5
nearest cells within 10 km is as follows;
• In Cassandra
ܵܶܥܧܮܧ cell_city ܯܱܴܨ ݕݐ݈ܾ݅݅ܯ ܹܧܴܧܪ
݁,ݔ݁݀݊݅_݂݊݅(ݎݔ ′ሼ ݂݈݅ݎ݁ݐ ∶ ሼ:݁ݕݐ "ܾ,"݈݊ܽ݁ ݉:ݐݏݑ [ሼ:݁ݕݐ "݃݁,"݁ܿ݊ܽݐݏ݅݀_
݂݈݅݁݀: ","݈݁ܿܽ ݈ܽ:݁݀ݑݐ݅ݐ 1.3963,
݈:݁݀ݑݐ݅݃݊ − 0.6981, ݉ܽ:݁ܿ݊ܽݐݏ݅݀_ݔ "10݇݉"ሽ ] ሽሽ′) ܶܫܯܫܮ 5
• In MongoDB
ܾ݀. .ݕݐ݈ܾ݅݅ܯ ݂݅݊݀( ሼ ݈݊݅ݐܽܿ ∶ ሼ $݊݁ܽݎ ∶ [ −0.6981,1.3963], $݉ܽ:݁ܿ݊ܽݐݏ݅ܦݔ 0.10 ሽ ሽ,
ሼcell_city: 1ሽ). ݈݅݉݅)5(ݐ
• In PostgreSQL
ܵܶܥܧܮܧ ݕݐ݅ܿ_݈݈݁ܥ ܯܱܴܨ ݉ݕݐ݈ܾ݅݅ ܹܧܴܧܪ (ݔ => ݔ ܦܰܣ ݔ ≤ ݔ௫) ܦܰܣ
(ݕ >= ݕ ܦܰܣ ݕ <
= ݕ௫) ݏܿܿݎܽܦܰܣ ൬݊݅ݏ൫ݔ൯ ∗ )ݔ(݊݅ݏ + ܿݏ൫ݔ൯ ∗ ܿ)ݔ(ݏ
∗ ܿݏ ቀݕ − ൫ݕ൯ቁ൰ ≤ ݎ ܱܴܴܧܦ ܻܤ ݔ ,ܥܵܣ ݕ ܥܵܣ ܶܫܯܫܮ 5
2.2.3. Range Query
Definition: Range query returns all point of interest (e.g., gas stations) that are located within a
certain space shape (polygon) [2].
Figure 5 shows inputs to range query to find cells that belong to a polygon.
7. Computer Science & Information Technology (CS & IT) 151
Figure 2. Range query
Example: find cells within a space range that is indicated by a polygon from a mobile user
position [longitude, Latitude] coordinates.
A typical query that select cells that are located within a geographical bounding box (polygon
shape), e. g, a triangle from a mobile user position coordinates: [ 12.300398, 57.569256] within
([ 11.300398, 56.569256 ], [12.300398, 58.569256 ]) is as follows;
• In Cassandra
ܵܶܥܧܮܧ cell_city ܯܱܴܨ ݕݐ݈ܾ݅݅ܯ ܹܧܴܧܪ ݁,ݔ݁݀݊݅_݂݊݅(ݎݔ ′ሼ
݂݈݅ݎ݁ݐ ∶ ሼ:݁ݕݐ "ܾ,"݈݊ܽ݁ ݉:ݐݏݑ [ሼ:݁ݕݐ "݃݁,"ݔܾܾ_݂݈݅݁݀: ","݈݁ܿܽ
݉݅݊_݈ܽ:݁݀ݑݐ݅ݐ 11.300398, ݉ܽ:݁݀ݑݐ݅ݐ݈ܽ_ݔ 12.300398
݉݅݊_݈:݁݀ݑݐ݅݃݊ 56.569256, ݉ܽ:݁݀ݑݐ݈݅݃݊_ݔ 58.569256 ሽ ] ሽሽ′)
• In mongoDB
ܾ݀. .ݕݐ݈ܾ݅݅ܯ ݂݅݊݀ (ሼ ݈݊݅ݐܽܿ ∶ ሼ $݃݁ݐܹ݅ℎ݅݊: ሼ
$:݊݃ݕ݈ [ [ 12.300398, 57.569256], [ 11.300398, 56.569256 ],
[12.300398, 58.569256 ] ] ሽሽሽ, ሼݕݐ݅ܥ_݈݈݁ܥሽ)
• PostgreSQL
The following is a typical query range query between two points = (ݔ, ݕ) and
௫ = (ݔ௫, ݕ௫) with ݔ is latitude and y is longitude.
Example: = (1.2393, −1.8184) and ௫ = (1.5532, 0.4221).
ܵܶܥܧܮܧ ݈݈ܿ݁_ܿ݅ݕݐ ܯܱܴܨ ݉ݕݐ݈ܾ݅݅ ܹܧܴܧܪ (ݔ => 1.2393 ܦܰܣ ݔ ≤ 1.5532) ܦܰܣ
(ݕ >= −1.8184 ܦܰܣ ݕ <= 0.4221)
2.2.4. Region Query
Generally, trajectories of mobile users are independent each other, however, they contain
common behavior traits such as passing through a region at a certain regular period, e.g., passing
through the shopping center during lunch time.
Definition: identify the region which is more likely to be passed by a given user at a certain time
based on the many other relevant regions to that user [2]. In the context of this study, the
knowledge about region reveals which cell city that many users mostly pass by, this cell city
might have some point of interests such as stores, high way junction.
Figure 6 shows inputs to region query to find cell city that is the most visited at certain time.
8. 152 Computer Science & Information Technology (CS & IT)
Figure 3. Region query
Example: find the cell city that is frequently passed by the same mobile users during a certain
time every day for the entire week. A typical query that returns the cell city that is the most
visited during interval [12: 10:00,13: 10: 00] is as follows;
• In Cassandra
ܵܶܥܧܮܧ ܶܥܰܫܶܵܫܦ cell_city ܯܱܴܨ ݕݐ݈ܾ݅݅ܯ ݓℎ݁݁ݎ ܶ݅݉݁ = ′12: 10: 00′ ܽ݊݀ ܶ݅݉݁ <
= ′13: 10: 00′ ܷܱܴܲܩ ܾݕ cell_city Oܴܴܧܦ ܻܤ "ܿ"ݐ݊ݑ ܥܵܧܦ
• In MongoDB
ܾ݀. .ݕݐ݈ܾ݅݅ܯ ݂݅݊݀( ሼܶ݅݉݁: ሼ$݃:ݐ ′12: 10: 00′, $݈:ݐ ′13: 10: 00′ሽሽ,
ሼcell_city:1ሽ ). ݀݅.)(ݐܿ݊݅ݐݏ ܿ)(ݐ݊ݑ
• In PostgreSQL
ܵܶܥܧܮܧ cell_city, ܿ)∗(ݐ݊ݑ ݂݉ݎ
݉ݕݐ݈ܾ݅݅ ܹܧܴܧܪ ݐ ≥ 12: 10: 00 ܽ݊݀ ݐ
≤ 13: 10: 00 ܷܱܴܲܩ ܾݕ cell_city Oܴܴܧܦ ܻܤ "ܿ"ݐ݊ݑ ܥܵܧܦ
3. RELATED WORK
In [7], authors propose an approach and implementation of spatio-temporal database systems.
This approach treats time-changing geometries, whether they change in discrete or continuous
steps. The same can be used to tackle spatio-temporal data in the other databases. We rather
evaluate trajectory queries on existing general purpose databases notably Cassandra, PostgreSQL,
and MongoDB. In [8], author describes requirements for database that support location based-
service for spatio-temporal data. A list of ten representative queries for stationary and moving
reference objects are proposed. Some of those queries that are related to this study are given in
section two.
In [9], Dieter studied trajectory moving point object, he explained three scenarios, namely
constrained movement, unconstraint movement and movement in networks. Different techniques
to index and to query these scenarios define their respective processing performance. Author
modelled the trajectory as triples (x,y,t), we use the same model in this study.
In [10], authors introduced querying moving objects (trajectory) in SECONDO, the latter is a
DBMS prototyping environment particularly geared for extension by algebra modules for
nonstandard applications. The querying is done using SQL-like language. In our study, we are
querying moving object using SQL and Not Only SQL (NoSQL) querying languages on top of
different databases. Continuously, authors provide a benchmark on range queries and nearest
9. Computer Science & Information Technology (CS & IT) 153
neighbor queries on SECONDO DBMS for moving data object in Berlin. The moving object data
was generated using computer simulation based on the map of Berlin [11]. This benchmark could
be extended to other queries such as region queries, distance queries, and so on. In our study, we
apply these queries on real world trajectory data, i.e., mobile users’ trajectory from Telenor
Sverige.
In [5], authors introduced a new type of query Reverse Nearest Neighbor (RNN) which is the
opposite to Nearest Neighbor (NN). RNN can be useful in applications where moving objects
agree to provide some kind of service to each other, whenever a service is need it is requested
from the nearest neighbor. An object knows objects that it will serve in the future using RNN.
RNN and NN are relatively represented by distance query in our study. In [12], authors studied
aggregate query language over GIS and no-spatial data stored in a data warehouse. In [13],
authors studied k-nearest search algorithm for historical moving object trajectories, this k-nearest
neighbor is one of the queries that is considered in our study.
In [14], authors presents techniques for indexing and querying moving object trajectories. This
data is represented as three dimension (3D) space, where two dimensions correspond to space and
one dimension corresponds to time. We also represent our data in 3D as (x,y,t), with x,y
represents space whereas t represents time.
Query processing on multiprocessor has been studied in [15], authors implemented an emulator
of parallel DBMS that uses cluster with multiprocessor. This study is different from ours in a
sense that we evaluate query processing on real physical hardware with existing general purpose
databases. Query processing on FPGA and GPU on spatial-temporal data was studied in [16].
Authors present a FPGA and GPU implementation that process complex queries in parallel, the
study did not investigate the performance on various existing databases, the distributed
environment was not also considered, whereas, in our study we investigate query processing on
various databases on top of different computational platforms including cluster. In [17], authors
conducted a survey on mining massive-scale spatio-temporal trajectory data based on parallel
computing platforms such as GPU, MapReduce and FPGA, again existing general purpose
databases were not evaluated. Authors presented a hardware implementation for converting
geohash codes to and from longitude/latitude pairs for Spatio-temporal data [18], the study shows
that longitude and latitude coordinates are the key points for modelling spatio-temporal data. In
our paper, we also use these coordinates for location based querying.
4. DATABASE OVERVIEW AND CONFIGURATION
The development of technology involves big data that is structured and unstructured. The
presence of unstructured data stimulates the invention of new databases, since Relational
Database Management Systems (RDBMS) that uses Structured Query Language (SQL) to deal
with structured data only becomes unable to handle unstructured data. A new data model, Not
Only SQL (NoSQL) was introduced to deal also with unstructured data [19]. Main features of
NoSQL follow CAP theorem (Consistency, Availability, and Partition tolerance). The core idea
of CAP is that a distributed system cannot meet the three distinct needs simultaneously.
According to data models, NoSQL can be relational, key value based, column based, and
document based. In this study we choose three open source databases that have diverse features
of SQL (PostgreSQL) and NoSQL (Cassandra and MongoDB).
10. 154 Computer Science & Information Technology (CS & IT)
Key value data model means that a value corresponds to a key, column based uses tables as the
data model, the data is stored by column, each column is the index of the database, queries are
applied to column, whereby each column is treated one by one. Document based database stores
in JSON or XML format, each document (similar to a row in RDBMS) is indexed and it has a
key.
4.1. Cassandra
Apache Cassandra is an open-source NoSQL column based database. It is a top level Apache
project born at Facebook and built on Amazon’s Dynamo and Google’s BigTable. It is a
distributed database for managing large amounts of structured data across many commodity
servers, while providing highly available service and no single point of failure. In CAP,
Cassandra has availability and partition tolerance (AP) with eventual consistency. Cassandra
offers continuous availability, linear scale performance, operational simplicity and easy data
distribution across multiple data centers and cloud availability zones. Cassandra has a masterless
ring architecture [23]. Keyspace is similar to database in RDBMS, inside keyspace there are
tables which are similar to tables in RDBMS, column and rows are similar to those of RDBMS’
tables. The querying language is Cassandra Query Language (CQL) is almost similar to SQL in
RDBMS [24].
4.2. MongoDB
MongoDB is an open-source NoSQL document database, MongoDB is written in C++.
MongoDB has database, inside a database there are collections, these are like table in RDBMS,
Inside a collection there are documents, these are like a tuple/row in RDBMS, and inside a
document there are fields which are like column in RDBMS [21], [22]. MongoDB is consistent
and partition tolerant.
4.3. PostgreSQL
PostgreSQL is an open source Object RDBMS that has two features according to CAP theorem,
those are availability, i.e., each user can always read and write. PostgreSQL consists of
consistency, i.e., all users have the same view of data. PostgreSQL organises data in column and
rows [20, p. 3].
4.4. Single Node Installation
Two types of server are used,
1. Hardware type 1: Dell powerEdge R320
Operating system: Ubuntu 14.04.3 LTS x86_64
RAM memory: 23 GB RAM
Harddisk size: 279.4GB 0 disk
Processor (Intel(R) Xeon(R) CPU E5-2420 v2) has 12 cores, each core is hyperthreaded into 2
cores, this give 24 virtual cores. These servers are exclusive, i.e., they are only running our
databases.
2. Hardware type2: Fujitsu RX600S5
Operating system: Ubuntu 13.04 LTS X86_64
The RAM memory: 1024 GB.
11. Computer Science & Information Technology (CS & IT) 155
Processor (4x Xeon X7550) has 32 cores, each core is hyperthreaded into 2 cores, this give 64
virtual cores. At the time of experiment this server is running some other work, i.e., it is not
exclusive to our databases only. This affect the execution time of our databases, however the
trends such as variability between queries upon the databases are not affected. Standard deviation
of the execution time keeps the same trends.
4.5. Multiple Nodes Installation
A cluster is made up of 4 nodes, each node is hardware type 1 and it has the same features as the
other.
Cassandra partitions and replicates data across 4 nodes in the cluster (see Figure 7). Since we are
using a small cluster of four nodes with all nodes belong to the same rack and same data center,
the replication strategy is set to simple strategy with four replicas across the cluster.
SimpleStrategy places the first replica on a node that is determined by the partitioner. A
partitioner determines how data is distributed across the nodes in the cluster including replicas.
We configured partitioner as a Murmur3Partitioner, the latter provides faster hashing and
improved performance. And the snitch that informs Cassandra about the network topology is
configured as the simple snitch [27]. All the nodes in the cluster are peers, with one of the nodes
is configured as a seed, the latter bootstraps the gossip process for new nodes joining the cluster.
Each node has the same copy of data as the other, in this study we use Cassandra 3.0.3. The
replication factor equals to number of nodes. Since we have a single data center with no write
activities because we only need to read the given data, we use a consistency level one, i.e., the
closest replica node for the given row is contacted to respond the query.
Cassandra does not natively support spatial indexing but this can be extended via Stratio’s
Cassandra Lucene index. Stratio’s Cassandra Lucene Index is a plugin for Apache Cassandra
derived from Cassandra, it extends its index functionality to provide near real time search such as
ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial and
bitemporal search. We use Stratio’s Cassandra Lucene Index 3.0.4 [28].
Figure 7. Cassandra Structure
MongoDB partitions data across nodes, i.e., MongoDB scales horizontally by dividing and
distributing data over multiple servers that are called shards. Each shard is an independent
database, and collectively, the shards make up a single logical database. Sharding reduces the
12. 156 Computer Science & Information Technology (CS & IT)
number of operations each shard handles. Each shard processes fewer operations as the cluster
grows. As a result, a cluster can increase capacity and throughput horizontally (by adding nodes
in the cluster). Sharding reduces the amount of data that each server needs to store. Each shard
stores less data as the cluster grows [26]. Sharded cluster (contains shards, config servers and
mongos instances). We use three shards, each on a node. In Figure 8, we see config servers that
holds the metadata about the cluster such as the shard location of the data, they must be three
servers. There is also Mongos server that serves as the routing service that process queries
throughout the cluster. Mongos is installed on its own node, whereas config servers and shards
are installed on the same nodes (1, 2, 3) as it shown in the Figure 8. Since we have three shards,
each shard contains a third of the total data. Mongo will be eventually available if we replicate
each shard on different nodes. In this study we install MongoDB 3.0.9. MongoDB has built in
spatial query functions.
Figure 4. MongoDB Structure
Figure 5. PostgreSQL Structure
PostgreSQL is installed on the cluster in master/ slave replication mode (see Figure 9). Nodes
serve each other in pool using pgpool2 [25]. Pgpool 2 provides, load balancing and data
redundancy. In order to keep available the data, each slave holds a copy of data and it is read-
only, there are three slaves, thus three replicas. Whereas in order to keep the consistency of the,
only the master can read and write. Master and pgpool 2 are installed on the same node. In this
13. Computer Science & Information Technology (CS & IT) 157
study we install PostgreSQL version 9.3.11. PostgreSQL does not have explicit spatial query
functions, thus, we have to use mathematical functions in order to query the database using
geographical coordinates.
4.6. Data Description
The mobility or location update is generated when a handset is generating traffic either
downloading or uploading. Mobility is captured in five minutes intervals and include all cells
during those five minutes. Mobility is indicated by number of cells within timeframe and the
distance between those cells.
The data we use in this paper is collected every five minutes for the entire week in a small
medium size city. We have a collection of two millions five hundred ninety three thousands three
hundred sixty records (2,593,360) for different users. Every record has eighteen attributes (18 x
2,593,360). Those attributes are; UserId, SiteId, weekday, Time, ProfileId, SegmentId,
SourceGSM, SourceUMTS, SourceLTE, Easting, Northing, Latitude decimal, Longitude
decimal, Cell municipality, Cell county, Cell city, Cell postcode, Cell address. This data is
populated in Cassandra and PostgreSQL without any transformation. Whereas, in MongoDB,
coordinates attributes (latitude and longitude) were combined into an array location attribute in
order to be able to use built in spatial function in MongoDB.
5. RESULTS AND DISCUSSION
Figures 10, 11, 12, 13 show the execution time with respect to different number of nodes in the
cluster, we present results using logarithmic scale. All the results are the average of ten runs of
each query. More detailed data are given in appendix in Tables 1, 2, 3, 4, 5. Those tables show
the execution time for different queries on Cassandra, MongoDB, and PostgreSQL databases on
four, three, two, and one nodes for hardware type one and type two respectively.
Figure 10. Distance Query Execution Time
14. 158 Computer Science & Information Technology (CS & IT)
Figure 6. K nearest Neighbor Query Execution Time
Figure 7. Range Query Execution Time
Figure 13. Region Query Execution Time
15. Computer Science & Information Technology (CS & IT) 159
It is observed that Cassandra has the shortest execution time for range and region queries,
particularly for region query. Region query has one input which is time, it does not involve
spatial features or geographical shapes, e.g., sphere, near, within. It is clear that Cassandra
outperforms much better than MongoDB and PostgreSQL for general purpose queries. E.g.,
Region query involves time only as input. For queries that contain geographical or specific spatial
features, MongoDB seems to perform almost as Cassandra when the latter is indexed by Stratio
Lucene Index (see Figure 10, 11, 12). In figure 13, MongoDB has longer execution than
Cassandra and PostgreSQL for region query, this is caused by aggregation query process which
seems to be slower in MongoDB.
In all queries, when we run a query for the first time, we observed that for Cassandra, it takes
longer than the next runs. We have to mention that Lucene makes a huge use of caching,
therefore the first query will be especially slow due to the cost of initializing caches [26]. Thus,
we disregard the first run of each query when measuring the performance.
Whereas for MongoDB and PostgreSQL the same query on the same hardware, runs almost with
relatively same execution time. Spatial queries have the longest execution time in PostgreSQL,
the reason is that we have to use mathematical functions to represent geographical locations. This
involves different steps of calculation, thus, making it longer (see Figure 10, 11, 12).
The scalability according to increase of number of nodes is significant for Cassandra and
MongoDB for range query. The reason is that the range query involves a partition of the data
according to range specification, hence the cache is relatively not overloaded. Whereas the
scalability is not very noticed for the other queries which covers the whole data, thus consuming
much cache which results in slowing the execution time. In terms of processing, PostgreSQL
does not exploit the increase of number of nodes, since nodes are used for replication purposes in
order to keep the database available. MongoDB distributes data across shards, in order to provide
high availability, we need to replicate each shard on its own server, e.g., in our case we have
three shards, in order to have a second copy of the whole data, we need three more servers, in
total we need 6 servers for 2 copies. However, for Cassandra, since we have a full copy of data at
each node, i.e., for 4 nodes cluster we have 4 copies of data. This feature makes Cassandra to be
attractive than MongoDB in cases where a number of servers is a constraints. Furthermore if
Mongos fails, the whole database fails, the same holds for PostgreSQL, if the master goes down,
the whole database cannot operate anymore. Whereas, for Cassandra, if any node goes down,
others keep working.
6. CONCLUSIONS
In this study, we evaluated the performance of trajectory queries on Cassandra, MongoDB, and
PostgreSQL on Multiprocessor and cluster environment. The evaluation is conducted on data
collected from a Telecommunication Company. We observed that Cassandra performs much
better than MongoDB and PostgreSQL to handle queries that do not contain special geographical
features such as sphere shape, near coordinates (example of region query that involves time as
input). MongoDB has natively a built in function for spatial queries, this speeds up the query
response time. In order to speed up Cassandra while handling spatial queries, we incorporate
Stratio’s Cassandra Lucene Index which holds spatial indexes. This gives same performance as
using MongoDB and even better for some queries. MongoDB seems to handle aggregate query
slower than Cassandra and PostgreSQL (e.g., region query involves two steps of aggregation).
16. 160 Computer Science & Information Technology (CS & IT)
Since we are using open source databases, the choice of which database to use depends merely on
the needs and preferences, for instance MongoDB is well documented comparing to Cassandra.
MongoDB uses XML language that is understood by internet, thus if one would like to work with
different data traffic over internet MongoDB is a good choice. From developer perspective, it is
easier to implement and integrate plugins to Cassandra than MongoDB. Cassandra seems to be
updated every couple of weeks, this tick-tock releases are not immediately compatible with some
plugin as it is the case in this paper, we have to use Cassandra 3.0.3 in order to be able to use
Stratio’s Cassandra Lucene Index 3.0.4, while at the moment, the current release is 3.4. One
would choose to use PostgreSQL if relational database features is important to handle the data.
In terms of servers, if there is a constraint of number of servers, Cassandra is more preferable
since it economically uses a less number of servers comparing to what MongoDB will require to
provide same features.
APPENDIX
In tables 1, 2, 3, 4, 5, E.time is the average execution time of ten runs, Stdev is the standard
deviation.
Table 1. Query processing time (in seconds) on 4 nodes installation (Dell powerEdge R320)
Query types Cassandra MongoDB PostgreSQL
E. time Stdev E. time Stdev E. time Stdev
Distance Q 0.036 0.077 0.024 0.011 0.79 0.0005
K-n Neighbors Q 0.029 0.005 0.024 0.0107 0.881 0.015
Range Q 0.008 0.008 0.021 1.83E-18 0.621 0.001
Region Q 0.045 0.011 1.562 0.030 1.221 0.001
Table 2. Query processing time (in seconds) on 3 nodes installation (Dell powerEdge R320)
Query types Cassandra MongoDB PostgreSQL
E. time Stdev E. time Stdev E. time Stdev
Distance Q 0.073 0.013 0.039 0.010 0.666 0.0005
K-n Neighbors Q 0.018 0.006 0.039 7.31E-18 0.886 0.015
Range Q 0.0130 0.024 0.04 0.011 0.666 0.001
Region Q 0.0515 0.021 1.593 0.030 1.222 0.001
17. Computer Science & Information Technology (CS & IT) 161
Table 1. Query processing time (in seconds) on 2 nodes installation (Dell powerEdge R320).
Query types Cassandra MongoDB PostgreSQL
E. time Stdev E. time Stdev E. time Stdev
Distance Q 0.060 0.008 0.059 0.011 0.766 0.0008
K-n Neighbors Q 0.031 0.025 0.059 0.017 0.822 0.0007
Range Q 0.0147 0.002 0.045 7.31E-18 0.611 0.0006
Region Q 0.0518 0.024 1.633 0.030 1.225 0.001
Table 4. Query processing time (in seconds) on a single node installation (Dell powerEdge R320).
Query types Cassandra MongoDB PostgreSQL
E. time Stdev E. time Stdev E. time Stdev
Distance Q 0.017 0.076 0.001 2.2857E-19 0.789 0.0008
K-n Neighbors Q 0.012 0.007 0.001 2.29E-19 0.882 0.0007
Range Q 0.028 0.023 0.048 0.000422 0.621 0.0006
Region Q 0.054 0.019 2.526 0.066 1.225 0.0016
Table 5. Query processing time (in seconds) on a single node installation (Fujitsu RX600S5).
Query types Cassandra MongoDB PostgreSQL
E. time Stdev E. time Stdev E. time Stdev
Distance Q 1.121 0.001 1.579 0.298 2.243 0.181
K-n Neighbors Q 1.012 0.012 1.432 0.089 2.363 0.001
Range Q 1.432 0.001 1.654 0.068 2.154 0.001
Region Q 2.132 0.002 4.260 0.257 3.268 0.0009
ACKNOWLEDGEMENTS
This work is part of the research project "Scalable resource-efficient systems for big data
analytics" funded by the Knowledge Foundation (grant: 20140032) in Sweden. We also thank
HPI-FSOC, and Telenor Sverige.
REFERENCES
[1] S. Spaccapietra, C. Parent, M. L. Damiani, J. A. de Macedo, F. Porto, and C. Vangenot, “A
conceptual view on trajectories,” Data Knowl. Eng., vol. 65, no. 1, pp. 126–146, 2008.
[2] Y. Zheng and X. Zhou, Computing with spatial trajectories. Springer Science & Business Media,
2011.
18. 162 Computer Science & Information Technology (CS & IT)
[3] N. Pelekis and Y. Theodoridis, Mobility data management and exploration. Springer, 2014.
[4] Jan, philip Matuschek, “Finding Points Within a Distance of a Latitude/Longitude Using Bounding
Coordinates.” [Online]. Available:
http://janmatuschek.de/LatitudeLongitudeBoundingCoordinates#SQLQueries. [Accessed: 07-Mar-
2016].
[5] R. Benetis, C. S. Jensen, G. Karčiauskas, and S. Šaltenis, “Nearest neighbor and reverse nearest
neighbor queries for moving objects,” in Database Engineering and Applications Symposium, 2002.
Proceedings. International, 2002, pp. 44–53.
[6] E. Frentzos, K. Gratsias, N. Pelekis, and Y. Theodoridis, “Nearest neighbor search on moving object
trajectories,” in Advances in Spatial and Temporal Databases, Springer, 2005, pp. 328–345.
[7] M. Erwig, R. H. Gu, M. Schneider, M. Vazirgiannis, and others, “Spatio-temporal data types: An
approach to modeling and querying moving objects in databases,” GeoInformatica, vol. 3, no. 3, pp.
269–296, 1999.
[8] Y. Theodoridis, “Ten benchmark database queries for location-based services,” Comput. J., vol. 46,
no. 6, pp. 713–725, 2003.
[9] D. Pfoser, “Indexing the trajectories of moving objects,” IEEE Data Eng Bull, vol. 25, no. 2, pp. 3–9,
2002.
[10] V. T. De Almeida, R. H. Güting, and T. Behr, “Querying moving objects in secondo,” in null, 2006,
p. 47.
[11] C. Düntgen, T. Behr, and R. H. Güting, “BerlinMOD: a benchmark for moving object databases,”
VLDB J., vol. 18, no. 6, pp. 1335–1368, 2009.
[12] L. I. Gómez, B. Kuijpers, and A. A. Vaisman, “Aggregation languages for moving object and places
of interest,” in Proceedings of the 2008 ACM symposium on Applied computing, 2008, pp. 857–862.
[13] Y.-J. Gao, C. Li, G.-C. Chen, L. Chen, X.-T. Jiang, and C. Chen, “Efficient k-nearest-neighbor search
algorithms for historical moving object trajectories,” J. Comput. Sci. Technol., vol. 22, no. 2, pp.
232–244, 2007.
[14] D. Pfoser, C. S. Jensen, Y. Theodoridis, and others, “Novel approaches to the indexing of moving
object trajectories,” in Proceedings of VLDB, 2000, pp. 395–406.
[15] K. Y. Besedin and P. S. Kostenetskiy, “Simulating of query processing on multiprocessor database
systems with modern coprocessors,” in Information and Communication Technology, Electronics and
Microelectronics (MIPRO), 2014 37th International Convention on, 2014, pp. 1614–1616.
[16] R. Moussalli, I. Absalyamov, M. R. Vieira, W. Najjar, and V. J. Tsotras, “High performance FPGA
and GPU complex pattern matching over spatio-temporal streams,” GeoInformatica, vol. 19, no. 2,
pp. 405–434, Aug. 2014.
[17] P. Huang and B. Yuan, “Mining Massive-Scale Spatiotemporal Trajectories in Parallel: A Survey,” in
Trends and Applications in Knowledge Discovery and Data Mining, Springer, 2015, pp. 41–52.
19. Computer Science & Information Technology (CS & IT)
[18] R. Moussalli, M. Srivatsa, and S. Asaad, “Fast and Flexible Conversion of Geohash Codes to and
from Latitude/Longitude Coordinates,” in Field
(FCCM), 2015 IEEE 23rd Annual International Symposium on, 2015, pp. 179
[19] J. Han, E. Haihong, G. Le, and J. Du, “Survey on NoSQL database,” in Pervasive computing and
applications (ICPCA), 2011 6th international conference on, 2011, pp. 363
[20] “What is Apache Cassandra?,” Planet Cassandra, 18
http://www.planetcassandra.org/what
[21] “CQL.” [Online]. Available:
http://docs.datastax.com/en//cassandra/2.0/cassandra/cql
[22] tutorialspoint.com, “MongoDB Overview,” www.tutorialspoint.com. [Online]. Available:
http://www.tutorialspoint.com/mongodb/mongodb_overview.htm. [Accessed: 23
[23] “MongoDB for GIANT Ideas,” MongoDB. [Onlin
Available: https://www.mongodb.com/. [Accessed: 23
[24] J. Worsley and J. D. Drake, Practical PostgreSQL. O’Reilly Media, Inc., 2002.
[25] “Data distribution and replication.” [Online]. Available:
https://docs.datastax.com/en/cassandra/2.
html. [Accessed: 25-Feb-2016].
[26] “Stratio/cassandra-lucene-index,” GitHub. [Online]. Available: https://github.com/Stratio/cassandra
lucene-index. [Accessed: 23-Mar
[27] “Sharding Introduction — MongoDB Manual 3.2,”
https://github.com/mongodb/docs/blob/master/source/core/sharding
Available: https://docs.mongodb.org/manual/core/sharding
[28] “Distributed PostgreSQL, http://www.postgresql.org/docs/9.1/static/high
AUTHORS
Christine Niyizamwiyitira is currently a PhD student in Computer science at Blekinge
Institute of Technology (BTH) in Sweden in Computer Science and Engineering
Department. She completed her masters in 2010 in computer engineering from Korea
university of Technology (KUT) in South Korea. She works at University of Rwanda
as assistant lecturer. Her research interests includes Real time systems, cloud
computing, high performance computing, Database performance, and Voice based
application. Her current Research focuses on Scheduling of real time systems on
Virtual Machines (uniprocessor & multiprocessor) and Big data processing.
Lars Lundberg is a professor in Computer Systems Engineering at the Department of
Computer Science and Engineering at Blekinge Institute of Technology in Sweden. He
has a M.Sc. in Computer Science from Linköping University (1986) and a Ph.D. in
Computer Engineering from Lund University (1993). His research interests include
parallel and cluster computing, real
Lundberg's current work focuses on performance and availability aspects.
Computer Science & Information Technology (CS & IT)
R. Moussalli, M. Srivatsa, and S. Asaad, “Fast and Flexible Conversion of Geohash Codes to and
from Latitude/Longitude Coordinates,” in Field-Programmable Custom Computing Machines
M), 2015 IEEE 23rd Annual International Symposium on, 2015, pp. 179–186.
J. Han, E. Haihong, G. Le, and J. Du, “Survey on NoSQL database,” in Pervasive computing and
applications (ICPCA), 2011 6th international conference on, 2011, pp. 363–366.
“What is Apache Cassandra?,” Planet Cassandra, 18-Jun-2015. [Online]. Available:
http://www.planetcassandra.org/what-is-apache-cassandra/. [Accessed: 23-Feb-2016].
http://docs.datastax.com/en//cassandra/2.0/cassandra/cql.html. [Accessed: 23-Feb-2016].
tutorialspoint.com, “MongoDB Overview,” www.tutorialspoint.com. [Online]. Available:
http://www.tutorialspoint.com/mongodb/mongodb_overview.htm. [Accessed: 23-Feb-2016].
“MongoDB for GIANT Ideas,” MongoDB. [Online].
Available: https://www.mongodb.com/. [Accessed: 23-Feb-2016].
J. Worsley and J. D. Drake, Practical PostgreSQL. O’Reilly Media, Inc., 2002.
“Data distribution and replication.” [Online]. Available:
https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeAbout_c.
2016].
index,” GitHub. [Online]. Available: https://github.com/Stratio/cassandra
Mar-2016].
MongoDB Manual 3.2,”
https://github.com/mongodb/docs/blob/master/source/core/sharding-introduction.txt. [Online].
Available: https://docs.mongodb.org/manual/core/sharding-introduction/. [Accessed: 24
http://www.postgresql.org/docs/9.1/static/high-availability.html.”
is currently a PhD student in Computer science at Blekinge
Institute of Technology (BTH) in Sweden in Computer Science and Engineering
e completed her masters in 2010 in computer engineering from Korea
university of Technology (KUT) in South Korea. She works at University of Rwanda
as assistant lecturer. Her research interests includes Real time systems, cloud
computing, Database performance, and Voice based
application. Her current Research focuses on Scheduling of real time systems on
Virtual Machines (uniprocessor & multiprocessor) and Big data processing.
is a professor in Computer Systems Engineering at the Department of
Computer Science and Engineering at Blekinge Institute of Technology in Sweden. He
has a M.Sc. in Computer Science from Linköping University (1986) and a Ph.D. in
m Lund University (1993). His research interests include
parallel and cluster computing, real-time systems and software engineering. Professor
Lundberg's current work focuses on performance and availability aspects.
163
R. Moussalli, M. Srivatsa, and S. Asaad, “Fast and Flexible Conversion of Geohash Codes to and
Programmable Custom Computing Machines
J. Han, E. Haihong, G. Le, and J. Du, “Survey on NoSQL database,” in Pervasive computing and
2015. [Online]. Available:
2016].
tutorialspoint.com, “MongoDB Overview,” www.tutorialspoint.com. [Online]. Available:
2016].
0/cassandra/architecture/architectureDataDistributeAbout_c.
index,” GitHub. [Online]. Available: https://github.com/Stratio/cassandra-
introduction.txt. [Online].
introduction/. [Accessed: 24-Feb-2016].
availability.html.”