The document summarizes research on performing spatio-textual similarity joins. It discusses:
1) Developing a filter-and-refine framework to efficiently find similar object pairs from two datasets using signatures.
2) Generating spatial and textual signatures for objects and building inverted indexes on the signatures to find candidate pairs.
3) Refining the candidate pairs to obtain the final result pairs that satisfy spatial and textual similarity thresholds.
Ranking spatial data by quality preferences pptSaurav Kumar
A spatial preference query ranks objects based on the qualities of features in their spatial neighborhood. For example, using a real estate agency database of flats for lease, a customer may want to rank the flats with respect to the appropriateness of their location, defined after aggregating the qualities of other features (e.g., restaurants, cafes, hospital, market, etc.) within their spatial neighborhood. Such a neighborhood concept can be specified by the user via different functions. It can be an explicit circular region within a given distance from the flat. Another intuitive definition is to assign higher weights to the features based on their proximity to the flat. In this paper, we formally define spatial preference queries and propose appropriate indexing techniques and search algorithms for them. Extensive evaluation of our methods on both real and synthetic data reveals that an optimized branch-and-bound solution is efficient and robust with respect to different parameters
Ranking Preferences to Data by Using R-TreesIOSR Journals
This document discusses algorithms for efficiently processing top-k spatial preference queries in databases containing spatial and non-spatial data. It defines top-k spatial preference queries as ranking objects based on quality and features in their nearest neighborhoods. It presents the branch and bound and feature join algorithms for computing the top-k results without having to calculate scores for all objects. It also discusses using R-trees to index spatial data and feature data to accelerate query processing.
Research Inventy : International Journal of Engineering and Scienceinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Aggregation of data by using top k spatial query preferencesAlexander Decker
This document summarizes a research paper on efficient techniques for processing top-k spatial preference queries in a database. It discusses how such queries allow users to rank spatial objects based on the aggregated qualities of nearby features. It proposes two algorithms - branch-and-bound and feature join - to efficiently process these queries by pruning search space. The paper also studies extensions of the algorithms for different aggregate functions and for queries using an influence score to weight nearby features.
Analysis of different similarity measures: SimrankAbhishek Mungoli
SimRank exploits object-to-object relationships very well and finds out the similarity between two objects.
We have used it in our project to find similar reasearch papers from DBLP dataset (DBLP Dataset provides a comprehensive list of research papers in computer science domain).
SimRank is a generic approach and its basic idea can also be applied to other domain of interests as well.
Spatial databases are used to store geographic information. Querying on such databases are : range queries, nearest neighbor queries and spatial joins. Many indexing techniques are used for faster retrieval of data out of which r-trees are mainly efficient. Other indexing techniques are quad-trees, grid files etc. Spatial data is used in GIS applications.
This document discusses clustering of uncertain data objects. It first provides background on clustering uncertain data and challenges involved. It then reviews various existing approaches for clustering uncertain data, including using soft classifiers and probabilistic databases. The document proposes combining k-means clustering with Voronoi diagrams and indexing techniques to improve the performance and efficiency of clustering uncertain datasets. It outlines a plan to integrate k-means with Voronoi diagrams and indexing to reduce execution time and increase clustering performance and results for uncertain data. Finally, it concludes that combining clustering with indexing approaches can better handle uncertain data clustering challenges.
Safeguarding Abila through Multiple Data PerspectivesParang Saraf
This document describes a visual analytics system developed to analyze multiple datasets related to the disappearance of employees from an organization. The system allows analysis of unstructured news articles, email headers, GPS tracking data, financial transactions, and streaming microblog data through different interactive interfaces. It was created as a solution to the 2014 VAST Grand Challenge to help law enforcement investigate the case by combining insights from all the different data sources.
Ranking spatial data by quality preferences pptSaurav Kumar
A spatial preference query ranks objects based on the qualities of features in their spatial neighborhood. For example, using a real estate agency database of flats for lease, a customer may want to rank the flats with respect to the appropriateness of their location, defined after aggregating the qualities of other features (e.g., restaurants, cafes, hospital, market, etc.) within their spatial neighborhood. Such a neighborhood concept can be specified by the user via different functions. It can be an explicit circular region within a given distance from the flat. Another intuitive definition is to assign higher weights to the features based on their proximity to the flat. In this paper, we formally define spatial preference queries and propose appropriate indexing techniques and search algorithms for them. Extensive evaluation of our methods on both real and synthetic data reveals that an optimized branch-and-bound solution is efficient and robust with respect to different parameters
Ranking Preferences to Data by Using R-TreesIOSR Journals
This document discusses algorithms for efficiently processing top-k spatial preference queries in databases containing spatial and non-spatial data. It defines top-k spatial preference queries as ranking objects based on quality and features in their nearest neighborhoods. It presents the branch and bound and feature join algorithms for computing the top-k results without having to calculate scores for all objects. It also discusses using R-trees to index spatial data and feature data to accelerate query processing.
Research Inventy : International Journal of Engineering and Scienceinventy
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Aggregation of data by using top k spatial query preferencesAlexander Decker
This document summarizes a research paper on efficient techniques for processing top-k spatial preference queries in a database. It discusses how such queries allow users to rank spatial objects based on the aggregated qualities of nearby features. It proposes two algorithms - branch-and-bound and feature join - to efficiently process these queries by pruning search space. The paper also studies extensions of the algorithms for different aggregate functions and for queries using an influence score to weight nearby features.
Analysis of different similarity measures: SimrankAbhishek Mungoli
SimRank exploits object-to-object relationships very well and finds out the similarity between two objects.
We have used it in our project to find similar reasearch papers from DBLP dataset (DBLP Dataset provides a comprehensive list of research papers in computer science domain).
SimRank is a generic approach and its basic idea can also be applied to other domain of interests as well.
Spatial databases are used to store geographic information. Querying on such databases are : range queries, nearest neighbor queries and spatial joins. Many indexing techniques are used for faster retrieval of data out of which r-trees are mainly efficient. Other indexing techniques are quad-trees, grid files etc. Spatial data is used in GIS applications.
This document discusses clustering of uncertain data objects. It first provides background on clustering uncertain data and challenges involved. It then reviews various existing approaches for clustering uncertain data, including using soft classifiers and probabilistic databases. The document proposes combining k-means clustering with Voronoi diagrams and indexing techniques to improve the performance and efficiency of clustering uncertain datasets. It outlines a plan to integrate k-means with Voronoi diagrams and indexing to reduce execution time and increase clustering performance and results for uncertain data. Finally, it concludes that combining clustering with indexing approaches can better handle uncertain data clustering challenges.
Safeguarding Abila through Multiple Data PerspectivesParang Saraf
This document describes a visual analytics system developed to analyze multiple datasets related to the disappearance of employees from an organization. The system allows analysis of unstructured news articles, email headers, GPS tracking data, financial transactions, and streaming microblog data through different interactive interfaces. It was created as a solution to the 2014 VAST Grand Challenge to help law enforcement investigate the case by combining insights from all the different data sources.
This document presents a new link-based approach for improving categorical data clustering through cluster ensembles. It transforms categorical data matrices into numerical representations to apply graph partitioning techniques. The approach uses a Weighted Triple-Quality similarity algorithm to construct the representation and measure cluster similarity. An experimental evaluation shows the link-based method outperforms traditional categorical clustering algorithms and benchmark ensemble techniques on several real datasets in terms of accuracy, normalized mutual information, and adjusted rand index.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
The document proposes using an A* algorithm along with a relational framework to more efficiently calculate shortest paths in graph data stored in a relational database. The system initializes a source node, then iteratively selects the next frontier node and expands paths until the target node is found. Experimental results on road network data show the proposed approach has faster execution time than bidirectional search, especially on larger datasets containing over 500,000 records. The approach requires more memory than bidirectional search but is more efficient than other shortest path algorithms.
The document proposes an extension to the M-tree family of index structures called M*-tree. M*-tree improves upon M-tree by maintaining a nearest-neighbor graph within each node. The nearest-neighbor graph stores, for each entry in a node, a reference and distance to its nearest neighbor among the other entries in that node. This additional structure allows for more efficient filtering of non-relevant subtrees during search queries through the use of "sacrifice pivots". The experiments showed that M*-tree can perform searches significantly faster than M-tree while keeping construction costs low.
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
This document discusses using k-means clustering to partition datasets that have been generated through horizontal aggregation of data from multiple database tables. It provides background on horizontal aggregation techniques like pivot tables and describes the k-means clustering algorithm. The algorithm is applied as an example to cluster a sample dataset into two groups. The document concludes that k-means clustering can effectively partition large datasets produced by horizontal aggregations to facilitate further data mining analysis.
Scalable Keyword Cover Search using Keyword NNE and Inverted IndexingIRJET Journal
This document discusses scalable keyword cover search using keyword nearest neighbor expansion (keyword-NNE) and inverted indexing. It proposes a more efficient algorithm called keyword-NNE to address the performance issues of existing baseline algorithms for closest keyword search as the number of query keywords increases. Keyword-NNE significantly reduces the number of candidate keyword covers generated compared to the baseline. The algorithm and inverted indexing techniques are analyzed and shown to outperform alternatives through experiments on real datasets.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
- The document introduces the DatumTron In-Memory Graph Database, which represents data as a directed acyclic graph (DAG) of "datums" connected by "is" links. This allows data to be manipulated in generic ways and supports concepts like inheritance, time, code, and inference natively in the database.
- Data is represented as "katums" which are datums with an attached object, linked together to form the Datum Universe graph. Primitive data types are represented as objects attached to katums.
- The DatumTron API allows adding, removing, and querying data in the graph through functions like get, is, isnot, and find to create, link, unlink
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
This document summarizes a research paper that presents a task-decomposition based anomaly detection system for analyzing massive and highly volatile session data from the Science Information Network (SINET), Japan's academic backbone network. The system uses a master-worker design with dynamic task scheduling to process over 1 billion sessions per day. It discriminates incoming and outgoing traffic using GPU parallelization and generates histograms of traffic volumes over time. Long short-term memory (LSTM) neural networks detect anomalies like spikes in incoming traffic volumes. The experiment analyzed SINET data from February 27 to March 8, 2021, detecting some anomalies while processing 500-650 gigabytes of daily session data.
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
The document describes the PM-tree, a new metric access method that combines the M-tree with pivot-based approaches to improve efficiency of similarity search in multimedia databases. The PM-tree enhances M-tree routing and ground entries by including pivot-based information like hyper-ring regions defined by pivot objects and distances. This reduces the volume of metric regions described by entries, tightly bounding indexed objects and improving retrieval performance. Algorithms for building and querying the PM-tree are presented, showing how pivot distances are used to prune irrelevant regions during search.
Relaxing global-as-view in mediated data integration from linked dataAlessandro Adamou
- Mediated data integration systems present data from multiple sources in a unified view using mappings between a global schema and local source schemas (GAV or LAV mappings)
- In GAV systems, adding new data sources requires defining new mappings, which can be computationally expensive
- The authors propose using Linked Data principles to allow for a more gradual, "pay-as-you-go" approach where the global schema and mappings emerge iteratively through intermediate schemas and endomappings
- They demonstrate this approach on a real-world urban open data integration project that queries multiple data sources accessible via SPARQL or custom APIs
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
This document proposes an efficient approach for processing subgraph matching queries with set similarity (SMS2 queries) in large graph databases. The approach uses a "filter-and-refine" framework with offline indexing and online query processing. In the filtering phase, it builds an inverted lattice index of frequent element set patterns and encodes vertices as signatures. It then applies set similarity and structure-based pruning techniques. In the refinement phase, it uses a dominating set-based subgraph matching algorithm to find matching subgraphs guided by a dominating set selection method. Experimental results show the proposed approach outperforms state-of-the-art methods by an order of magnitude.
Performance Evaluation of Trajectory Queries on Multiprocessor and Clustercsandit
In this study, we evaluate the performance of traje
ctory queries that are handled by Cassandra,
MongoDB, and PostgreSQL. The evaluation is conducte
d on a multiprocessor and a cluster.
Telecommunication companies collect a lot of data f
rom their mobile users. These data must be
analysed in order to support business decisions, su
ch as infrastructure planning. The optimal
choice of hardware platform and database can be dif
ferent from a query to another. We use data
collected from Telenor Sverige, a telecommunication
company that operates in Sweden. These
data are collected every five minutes for an entire
week in a medium sized city. The execution
time results show that Cassandra performs much bett
er than MongoDB and PostgreSQL for
queries that do not have spatial features. Statio’s
Cassandra Lucene index incorporates a
geospatial index into Cassandra, thus making Cassan
dra to perform similarly as MongoDB to
handle spatial queries. In four use cases, namely,
distance query, k-nearest neigbhor query,
range query, and region query, Cassandra performs m
uch better than MongoDB and
PostgreSQL for two cases, namely range query and re
gion query. The scalability is also good for
these two use cases.
A new clutering approach for anomaly intrusion detectionIJDKP
Recent advances in technology have made our work easier compare to earlier times. Computer network is
growing day by day but while discussing about the security of computers and networks it has always been a
major concerns for organizations varying from smaller to larger enterprises. It is true that organizations
are aware of the possible threats and attacks so they always prepare for the safer side but due to some
loopholes attackers are able to make attacks.
Intrusion detection is one of the major fields of research and researchers are trying to find new algorithms
for detecting intrusions. Clustering techniques of data mining is an interested area of research for detecting
possible intrusions and attacks. This paper presents a new clustering approach for anomaly intrusion
detection by using the approach of K-medoids method of clustering and its certain modifications. The
proposed algorithm is able to achieve high detection rate and overcomes the disadvantages of K-means
algorithm.
The efficient and effective monitoring of mobile networks is vital given the number of users who rely on
such networks and the importance of those networks. The purpose of this paper is to present a monitoring
scheme for mobile networks based on the use of rules and decision tree data mining classifiers to upgrade
fault detection and handling. The goal is to have optimisation rules that improve anomaly detection. In
addition, a monitoring scheme that relies on Bayesian classifiers was also implemented for the purpose of
fault isolation and localisation. The data mining techniques described in this paper are intended to allow a
system to be trained to actually learn network fault rules. The results of the tests that were conducted
allowed for the conclusion that the rules were highly effective to improve network troubleshooting.
The challenges with respect to mining frequent items over data streaming engaging variable window size
and low memory space are addressed in this research paper. To check the varying point of context change
in streaming transaction we have developed a window structure which will be in two levels and supports in
fixing the window size instantly and controls the heterogeneities and assures homogeneities among
transactions added to the window. To minimize the memory utilization, computational cost and improve the
process scalability, this design will allow fixing the coverage or support at window level. Here in this
document, an incremental mining of frequent item-sets from the window and a context variation analysis
approach are being introduced. The complete technology that we are presenting in this document is named
as Mining Frequent Item-sets using Variable Window Size fixed by Context Variation Analysis (MFI-VWSCVA).
There are clear boundaries among frequent and infrequent item-sets in specific item-sets. In this
design we have used window size change to represent the conceptual drift in an information stream. As it
were, whenever there is a problem in setting window size effectively the item-set will be infrequent. The
experiments that we have executed and documented proved that the algorithm that we have designed is
much efficient than that of existing.
Prediction of quality features in iberian ham by applying data mining on data...IJDKP
This paper aims to predict quality features of Iberian hams by using non-destructive methods of analysis
and data mining. Iberian hams were analyzed by Magnetic Resonance Imaging (MRI) and Computer
Vision Techniques (CVT) throughout their ripening process and physico-chemical parameters from them
were also measured. The obtained data were used to create an initial database. Deductive techniques of
data mining (multiple linear regression) were used to estimate new data, allowing the insertion of new
records in the database. Predictive techniques of data mining were applied (multiple linear regression) on
MRI-CVT data, achieving prediction equations of weight, moisture and lipid content. Finally, data from
prediction equations were compared to data determined by physical-chemical analysis, obtaining high
correlation coefficients in most cases. Therefore, data mining, MRI and CVT are suitable tools to estimate
quality traits of Iberian hams. This would improve the control of the ham processing in a non-destructive
way.
The problem considered is that of finding frequent subpaths of a database of paths in a fixed undirected
graph. This problem arises in applications such as predicting congestion in network and vehicular traffic.
An algorithm, called AFS, based on the classic frequent itemset mining algorithm Apriori is developed, but
with significantly improved efficiency over Apriori from exponential in transaction size to quadratic through exploiting the underlying graph structure. This efficiency makes AFS feasible for practical input path sizes. It is also proved that a natural generalization of the frequent subpaths problem is not amenable to any solution quicker than Apriori.
This document presents a new link-based approach for improving categorical data clustering through cluster ensembles. It transforms categorical data matrices into numerical representations to apply graph partitioning techniques. The approach uses a Weighted Triple-Quality similarity algorithm to construct the representation and measure cluster similarity. An experimental evaluation shows the link-based method outperforms traditional categorical clustering algorithms and benchmark ensemble techniques on several real datasets in terms of accuracy, normalized mutual information, and adjusted rand index.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
The document proposes using an A* algorithm along with a relational framework to more efficiently calculate shortest paths in graph data stored in a relational database. The system initializes a source node, then iteratively selects the next frontier node and expands paths until the target node is found. Experimental results on road network data show the proposed approach has faster execution time than bidirectional search, especially on larger datasets containing over 500,000 records. The approach requires more memory than bidirectional search but is more efficient than other shortest path algorithms.
The document proposes an extension to the M-tree family of index structures called M*-tree. M*-tree improves upon M-tree by maintaining a nearest-neighbor graph within each node. The nearest-neighbor graph stores, for each entry in a node, a reference and distance to its nearest neighbor among the other entries in that node. This additional structure allows for more efficient filtering of non-relevant subtrees during search queries through the use of "sacrifice pivots". The experiments showed that M*-tree can perform searches significantly faster than M-tree while keeping construction costs low.
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
This document discusses using k-means clustering to partition datasets that have been generated through horizontal aggregation of data from multiple database tables. It provides background on horizontal aggregation techniques like pivot tables and describes the k-means clustering algorithm. The algorithm is applied as an example to cluster a sample dataset into two groups. The document concludes that k-means clustering can effectively partition large datasets produced by horizontal aggregations to facilitate further data mining analysis.
Scalable Keyword Cover Search using Keyword NNE and Inverted IndexingIRJET Journal
This document discusses scalable keyword cover search using keyword nearest neighbor expansion (keyword-NNE) and inverted indexing. It proposes a more efficient algorithm called keyword-NNE to address the performance issues of existing baseline algorithms for closest keyword search as the number of query keywords increases. Keyword-NNE significantly reduces the number of candidate keyword covers generated compared to the baseline. The algorithm and inverted indexing techniques are analyzed and shown to outperform alternatives through experiments on real datasets.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
- The document introduces the DatumTron In-Memory Graph Database, which represents data as a directed acyclic graph (DAG) of "datums" connected by "is" links. This allows data to be manipulated in generic ways and supports concepts like inheritance, time, code, and inference natively in the database.
- Data is represented as "katums" which are datums with an attached object, linked together to form the Datum Universe graph. Primitive data types are represented as objects attached to katums.
- The DatumTron API allows adding, removing, and querying data in the graph through functions like get, is, isnot, and find to create, link, unlink
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
This document summarizes a research paper that presents a task-decomposition based anomaly detection system for analyzing massive and highly volatile session data from the Science Information Network (SINET), Japan's academic backbone network. The system uses a master-worker design with dynamic task scheduling to process over 1 billion sessions per day. It discriminates incoming and outgoing traffic using GPU parallelization and generates histograms of traffic volumes over time. Long short-term memory (LSTM) neural networks detect anomalies like spikes in incoming traffic volumes. The experiment analyzed SINET data from February 27 to March 8, 2021, detecting some anomalies while processing 500-650 gigabytes of daily session data.
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
The document describes the PM-tree, a new metric access method that combines the M-tree with pivot-based approaches to improve efficiency of similarity search in multimedia databases. The PM-tree enhances M-tree routing and ground entries by including pivot-based information like hyper-ring regions defined by pivot objects and distances. This reduces the volume of metric regions described by entries, tightly bounding indexed objects and improving retrieval performance. Algorithms for building and querying the PM-tree are presented, showing how pivot distances are used to prune irrelevant regions during search.
Relaxing global-as-view in mediated data integration from linked dataAlessandro Adamou
- Mediated data integration systems present data from multiple sources in a unified view using mappings between a global schema and local source schemas (GAV or LAV mappings)
- In GAV systems, adding new data sources requires defining new mappings, which can be computationally expensive
- The authors propose using Linked Data principles to allow for a more gradual, "pay-as-you-go" approach where the global schema and mappings emerge iteratively through intermediate schemas and endomappings
- They demonstrate this approach on a real-world urban open data integration project that queries multiple data sources accessible via SPARQL or custom APIs
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
This document proposes an efficient approach for processing subgraph matching queries with set similarity (SMS2 queries) in large graph databases. The approach uses a "filter-and-refine" framework with offline indexing and online query processing. In the filtering phase, it builds an inverted lattice index of frequent element set patterns and encodes vertices as signatures. It then applies set similarity and structure-based pruning techniques. In the refinement phase, it uses a dominating set-based subgraph matching algorithm to find matching subgraphs guided by a dominating set selection method. Experimental results show the proposed approach outperforms state-of-the-art methods by an order of magnitude.
Performance Evaluation of Trajectory Queries on Multiprocessor and Clustercsandit
In this study, we evaluate the performance of traje
ctory queries that are handled by Cassandra,
MongoDB, and PostgreSQL. The evaluation is conducte
d on a multiprocessor and a cluster.
Telecommunication companies collect a lot of data f
rom their mobile users. These data must be
analysed in order to support business decisions, su
ch as infrastructure planning. The optimal
choice of hardware platform and database can be dif
ferent from a query to another. We use data
collected from Telenor Sverige, a telecommunication
company that operates in Sweden. These
data are collected every five minutes for an entire
week in a medium sized city. The execution
time results show that Cassandra performs much bett
er than MongoDB and PostgreSQL for
queries that do not have spatial features. Statio’s
Cassandra Lucene index incorporates a
geospatial index into Cassandra, thus making Cassan
dra to perform similarly as MongoDB to
handle spatial queries. In four use cases, namely,
distance query, k-nearest neigbhor query,
range query, and region query, Cassandra performs m
uch better than MongoDB and
PostgreSQL for two cases, namely range query and re
gion query. The scalability is also good for
these two use cases.
A new clutering approach for anomaly intrusion detectionIJDKP
Recent advances in technology have made our work easier compare to earlier times. Computer network is
growing day by day but while discussing about the security of computers and networks it has always been a
major concerns for organizations varying from smaller to larger enterprises. It is true that organizations
are aware of the possible threats and attacks so they always prepare for the safer side but due to some
loopholes attackers are able to make attacks.
Intrusion detection is one of the major fields of research and researchers are trying to find new algorithms
for detecting intrusions. Clustering techniques of data mining is an interested area of research for detecting
possible intrusions and attacks. This paper presents a new clustering approach for anomaly intrusion
detection by using the approach of K-medoids method of clustering and its certain modifications. The
proposed algorithm is able to achieve high detection rate and overcomes the disadvantages of K-means
algorithm.
The efficient and effective monitoring of mobile networks is vital given the number of users who rely on
such networks and the importance of those networks. The purpose of this paper is to present a monitoring
scheme for mobile networks based on the use of rules and decision tree data mining classifiers to upgrade
fault detection and handling. The goal is to have optimisation rules that improve anomaly detection. In
addition, a monitoring scheme that relies on Bayesian classifiers was also implemented for the purpose of
fault isolation and localisation. The data mining techniques described in this paper are intended to allow a
system to be trained to actually learn network fault rules. The results of the tests that were conducted
allowed for the conclusion that the rules were highly effective to improve network troubleshooting.
The challenges with respect to mining frequent items over data streaming engaging variable window size
and low memory space are addressed in this research paper. To check the varying point of context change
in streaming transaction we have developed a window structure which will be in two levels and supports in
fixing the window size instantly and controls the heterogeneities and assures homogeneities among
transactions added to the window. To minimize the memory utilization, computational cost and improve the
process scalability, this design will allow fixing the coverage or support at window level. Here in this
document, an incremental mining of frequent item-sets from the window and a context variation analysis
approach are being introduced. The complete technology that we are presenting in this document is named
as Mining Frequent Item-sets using Variable Window Size fixed by Context Variation Analysis (MFI-VWSCVA).
There are clear boundaries among frequent and infrequent item-sets in specific item-sets. In this
design we have used window size change to represent the conceptual drift in an information stream. As it
were, whenever there is a problem in setting window size effectively the item-set will be infrequent. The
experiments that we have executed and documented proved that the algorithm that we have designed is
much efficient than that of existing.
Prediction of quality features in iberian ham by applying data mining on data...IJDKP
This paper aims to predict quality features of Iberian hams by using non-destructive methods of analysis
and data mining. Iberian hams were analyzed by Magnetic Resonance Imaging (MRI) and Computer
Vision Techniques (CVT) throughout their ripening process and physico-chemical parameters from them
were also measured. The obtained data were used to create an initial database. Deductive techniques of
data mining (multiple linear regression) were used to estimate new data, allowing the insertion of new
records in the database. Predictive techniques of data mining were applied (multiple linear regression) on
MRI-CVT data, achieving prediction equations of weight, moisture and lipid content. Finally, data from
prediction equations were compared to data determined by physical-chemical analysis, obtaining high
correlation coefficients in most cases. Therefore, data mining, MRI and CVT are suitable tools to estimate
quality traits of Iberian hams. This would improve the control of the ham processing in a non-destructive
way.
The problem considered is that of finding frequent subpaths of a database of paths in a fixed undirected
graph. This problem arises in applications such as predicting congestion in network and vehicular traffic.
An algorithm, called AFS, based on the classic frequent itemset mining algorithm Apriori is developed, but
with significantly improved efficiency over Apriori from exponential in transaction size to quadratic through exploiting the underlying graph structure. This efficiency makes AFS feasible for practical input path sizes. It is also proved that a natural generalization of the frequent subpaths problem is not amenable to any solution quicker than Apriori.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
Data mining techniques are used to retrieve the knowledge from large databases that helps the
organizations to establish the business effectively in the competitive world. Sometimes, it violates privacy
issues of individual customers. In this paper we proposes An algorithm ,addresses the problem of privacy
issues related to the individual customers and also propose a transformation technique based on a Walsh-
Hadamard transformation (WHT) and Rotation. The WHT generates an orthogonal matrix, it transfers
entire data into new domain but maintain the distance between the data records these records can be
reconstructed by applying statistical based techniques i.e. inverse matrix, so this problem is resolved by
applying Rotation transformation. In this work, we increase the complexity to unauthorized persons for
accessing original data of other organizations by applying Rotation transformation. The experimental
results show that, the proposed transformation gives same classification accuracy like original data set. In
this paper we compare the results with existing techniques such as Data perturbation like Simple Additive
Noise (SAN) and Multiplicative Noise (MN), Discrete Cosine Transformation (DCT), Wavelet and First
and Second order sum and Inner product Preservation (FISIP) transformation techniques. Based on
privacy measures the paper concludes that proposed transformation technique is better to maintain the
privacy of individual customers.
Accurate time series classification using shapeletsIJDKP
Time series data are sequences of values measured over time. One of the most recent approaches to
classification of time series data is to find shapelets within a data set. Time series shapelets are time series
subsequences which represent a class. In order to compare two time series sequences, existing work uses
Euclidean distance measure. The problem with Euclidean distance is that it requires data to be
standardized if scales differ. In this paper, we perform classification of time series data using time series
shapelets and used Mahalanobis distance measure. The Mahalanobis distance is a descriptive statistic
that provides a relative measure of a data point's distance (residual) from a common point. The
Mahalanobis distance is used to identify and gauge similarity of an unknown sample set to a known one. It
differs from Euclidean distance in that it takes into account the correlations of the data set and is scaleinvariant.
We show that Mahalanobis distance results in more accuracy than Euclidean distance measure
Emotion Detection is one of the most emerging issues in human computer interaction. A sufficient amount
of work has been done by researchers to detect emotions from facial and audio information whereas
recognizing emotions from textual data is still a fresh and hot research area. This paper presented a
knowledge based survey on emotion detection based on textual data and the methods used for this purpose.
At the next step paper also proposed a new architecture for recognizing emotions from text document.
Proposed architecture is composed of two main parts, emotion ontology and emotion detector algorithm.
Proposed emotion detector system takes a text document and the emotion ontology as inputs and produces
one of the six emotion classes (i.e. love, joy, anger, sadness, fear and surprise) as the output.
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"IJDKP
Clustering is one of the data mining techniques that have been around to discover business intelligence by grouping objects into clusters using a similarity measure. Clustering is an unsupervised learning process that has many utilities in real time applications in the fields of marketing, biology, libraries, insurance, city-planning, earthquake studies and document clustering. Latent trends and relationships among data objects can be unearthed using clustering algorithms. Many clustering algorithms came into existence. However, the quality of clusters has to be given paramount importance. The quality objective is to achieve
highest similarity between objects of same cluster and lowest similarity between objects of different clusters. In this context, we studied two widely used clustering algorithms such as the K-Means and Fuzzy K-Means. K-Means is an exclusive clustering algorithm while the Fuzzy K-Means is an overlapping clustering algorithm. In this paper we prove the hypothesis “Fuzzy K-Means is better than K-Means for Clustering” through both literature and empirical study. We built a prototype application to demonstrate the differences between the two clustering algorithms. The experiments are made on diabetes dataset
obtained from the UCI repository. The empirical results reveal that the performance of Fuzzy K-Means is better than that of K-means in terms of quality or accuracy of clusters. Thus, our empirical study proved the hypothesis “Fuzzy K-Means is better than K-Means for Clustering”.
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
Enhancement techniques for data warehouse staging areaIJDKP
This document discusses techniques for enhancing the performance of data warehouse staging areas. It proposes two algorithms: 1) A semantics-based extraction algorithm that reduces extraction time by pruning useless data using semantic information. 2) A semantics-based transformation algorithm that similarly aims to reduce transformation time. It also explores three scheduling techniques (FIFO, minimum cost, round robin) for loading data into the data warehouse and experimentally evaluates their performance. The goal is to enhance each stage of the ETL process to maximize overall performance.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
Comparison between riss and dcharm for mining gene expression dataIJDKP
Since the rapid advance of microarray technology, gene expression data are gaining recent interest to
reveal biological information about genes functions and their relation to health. Data mining techniques
are effective and efficient in extracting useful patterns. Most of the current data mining algorithms suffer
from high processing time while generating frequent itemsets. The aim of this paper is to provide a
comparative study of two Closed Frequent Itemsets algorithms (CFI), dCHARM and RISS. They are
examined with high dimension data specifically gene expression data. Nine experiments are conducted with
different number of genes to examine the performance of both algorithms. It is found that RISS outperforms
dCHARM in terms of processing time..
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
The document summarizes a proposed methodology that integrates associative classification and neural networks for improved classification accuracy. It begins by introducing association rule mining and associative classification. It then describes using chi-squared analysis and the Gini index for attribute selection and rule pruning to generate a reduced set of rules. These rules are used to train a backpropagation neural network classifier. The methodology is tested on datasets from a public repository, demonstrating improved accuracy over traditional associative classification alone. Future work to integrate optical neural networks is also proposed.
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
The document presents a new hybrid stemming algorithm for Arabic text categorization. It combines three existing stemming approaches - root-based (Khoja stemmer), stem-based (Larkey stemmer), and statistical (N-gram). The hybrid approach first normalizes text, removes stop words and prefixes/suffixes. It then uses bigram similarity and the Dice measure to find valid roots by comparing word bigrams to a root dictionary. The algorithm is evaluated using naive Bayesian and SVM classifiers in an Arabic text categorization system, and is found to outperform the individual stemming methods. The proposed stemmer improves the performance of Arabic text categorization.
A potential objective of every financial organization is to retain existing customers and attain new
prospective customers for long-term. The economic behaviour of customer and the nature of the
organization are controlled by a prescribed form called Know Your Customer (KYC) in manual banking.
Depositor customers in some sectors (business of Jewellery/Gold, Arms, Money exchanger etc) are with
high risk; whereas in some sectors (Transport Operators, Auto-delear, religious) are with medium risk;
and in remaining sectors (Retail, Corporate, Service, Farmer etc) belongs to low risk. Presently, credit risk
for counterparty can be broadly categorized under quantitative and qualitative factors. Although there are
many existing systems on customer retention as well as customer attrition systems in bank, these rigorous
methods suffers clear and defined approach to disburse loan in business sector. In the paper, we have used
records of business customers of a retail commercial bank in the city including rural and urban area of
(Tangail city) Bangladesh to analyse the major transactional determinants of customers and predicting of a
model for prospective sectors in retail bank. To achieve this, data mining approach is adopted for
analysing the challenging issues, where pruned decision tree classification technique has been used to
develop the model and finally tested its performance with Weka result. Moreover, this paper attempts to
build up a model to predict prospective business sectors in retail banking.
For a human body to function properly it is essential to have a certain amount of body fat. Fat serves to
manage body temperature, pads and protects the organs. Fat is the fundamental type of the body's vitality
stockpiling. It is important to have a healthy amount of body fat. Overabundance of fat quotient can build
danger of genuine wellbeing issues. Anthropometry is a broadly accessible and basic strategy for the
appraisal of body composition. Anthropometry measures are weight, height, Body Mass Index (BMI),
waist, boundary, biceps, skinfold etc. The human fat percentage is figured by taking anthropometric
variables. We proposed a methodology to determine the body fat percentage using R programming and
regression formula. We analyzed 10 anthropometric variables and 3 demographic variables. Our study
shows that the impact of certain variables has an edge over other in predicting body fat percentage.
The document proposes an improved clustering algorithm for social network analysis. It combines BSP (Business System Planning) clustering with Principal Component Analysis (PCA) to group social network objects into classes based on their links and attributes. Specifically, it applies PCA before BSP clustering to reduce the dimensionality of the social network data and retain only the most important variables for clustering. This improves the BSP clustering results by focusing on the key information in the social network.
This document discusses clustering of uncertain data objects. It first provides background on clustering uncertain data and challenges in doing so. It then proposes combining k-means clustering with Voronoi diagrams to improve the performance of k-means when clustering uncertain data. Specifically, it suggests using k-means to generate clusters and Voronoi diagrams to answer nearest neighbor queries, in order to minimize computation time. Finally, it concludes that integrating clustering algorithms with indexing methods can effectively cluster uncertain data objects.
This document summarizes a research paper that proposes a new semi-supervised dimensionality reduction algorithm called Semi-supervised Discriminant Analysis based on Data Structure (SGLS). SGLS aims to learn a low-dimensional embedding of high-dimensional data by integrating both global and local data structures, while also utilizing pairwise constraints to maximize discrimination. The algorithm formulates an optimization problem that incorporates a regularization term to preserve local geometry, while maximizing the distance between cannot-link pairs and minimizing the distance between must-link pairs in the embedding space. Experimental results on benchmark datasets demonstrate the effectiveness of SGLS compared to other semi-supervised dimensionality reduction methods.
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
PERFORMANCE EVALUATION OF SQL AND NOSQL DATABASE MANAGEMENT SYSTEMS IN A CLUSTERijdms
In this study, we evaluate the performance of SQL and NoSQL database management systems namely;
Cassandra, CouchDB, MongoDB, PostgreSQL, and RethinkDB. We use a cluster of four nodes to run the
database systems, with external load generators.The evaluation is conducted using data from Telenor
Sverige, a telecommunication company that operates in Sweden. The experiments are conducted using
three datasets of different sizes.The write throughput and latency as well as the read throughput and
latency are evaluated for four queries; namely distance query, k-nearest neighbour query, range query, and
region query. For write operations Cassandra has the highest throughput when multiple nodes are used,
whereas PostgreSQL has the lowest latency and the highest throughput for a single node. For read
operations MongoDB has the lowest latency for all queries. However, Cassandra has the highest
throughput for reads. The throughput decreasesas the dataset size increases for both write and read, for
both sequential as well as random order access. However, this decrease is more significant for random
read and write. In this study, we present the experience we had with these different database management
systems including setup and configuration complexity
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSINGijdms
Tracing moving objects have turned out to be essential in our life and have a lot of uses like: GPS guide,
traffic monitor based administrations and location-based services. Tracking the changing places of objects
has turned into important issues. The moving entities send their positions to the server through a system
and large amount of data is generated from these objects with high frequent updates so we need an index
structure to retrieve information as fast as possible. The index structure should be adaptive, dynamic to
monitor the locations of objects and quick to give responses to the inquiries efficiently. The most wellknown
kinds of queries strategies in moving objects databases are Rang, Point and K-Nearest Neighbour
and inquiries. This study uses R-tree method to get detailed range query results efficiently. But using R-tree
only will generate much overlapping and coverage between MBR. So R-tree by combining with Gridpartition
index is used because grid-index can reduce the overlap and coverage between MBR. The query
performance will be efficient by using these methods. We perform an extensive experimental study to
compare the two approaches on modern hardware.
Query Optimization Techniques in Graph Databasesijdms
Graph databases (GDB) have recently been arisen to overcome the limits of traditional databases for
storing and managing data with graph-like structure. Today, they represent a requirementfor many
applications that manage graph-like data,like social networks.Most of the techniques, applied to optimize
queries in graph databases, have been used in traditional databases, distribution systems,… or they are
inspired from graph theory. However, their reuse in graph databases should take care of the main
characteristics of graph databases, such as dynamic structure, highly interconnected data, and ability to
efficiently access data relationships. In this paper, we survey the query optimization techniques in graph
databases. In particular,we focus on the features they have in
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
The document summarizes a research paper that proposes a framework called DRSP (Dimension Reduction for Similarity Matching and Pruning) for time series data streams. DRSP addresses the challenges of large streaming data size by:
1) Performing dimension reduction using a Multi-level Segment Mean technique to compactly represent the data while retaining crucial information.
2) Incorporating a similarity matching technique to analyze if new data objects match existing streams.
3) Applying a pruning technique to filter out non-relevant data object pairs and join only relevant pairs.
The framework aims to reduce storage and computation costs for similarity matching on large time series data streams.
1) The document discusses a review of semantic approaches for nearest neighbor search. It describes using an ontology to add a semantic layer to an information retrieval system to relate concepts using query words.
2) A technique called spatial inverted index is proposed to locate multidimensional information and handle nearest neighbor queries by finding the hospitals closest to a given address.
3) Several semantic approaches are described including using clustering measures, specificity measures, link analysis, and relation-based page ranking to improve search and interpret hidden concepts behind keywords.
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
This document proposes a data model for managing large point cloud data while integrating semantics. It presents a conceptual model composed of three interconnected meta-models to efficiently store and manage point cloud data, and allow the injection of semantics. A prototype is implemented using Python and PostgreSQL to combine semantic and spatial concepts for queries on indoor point cloud data captured with a terrestrial laser scanner.
Improving search time for contentment based image retrieval via, LSH, MTRee, ...IOSR Journals
This document proposes a new index structure called LSH-LUBMTree to improve search time for content-based image retrieval using the Earth Mover's Distance metric. LSH-LUBMTree combines Locality Sensitive Hashing (LSH) and the LUBMTree index. Images hashed to the same bucket via LSH are then stored in the LUBMTree to reduce false positives and accelerate search time. Experimental results show LSH-LUBMTree performs better than standard LSH in terms of search time by leveraging advantages of both LSH and LUBMTree indexing.
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...Computer Science Journals
With the increased number of web databases, major part of deep web is one of the bases of database. In several search engines, encoded data in the returned resultant pages from the web often comes from structured databases which are referred as Web databases (WDB).
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...ijcsitcejournal
This paper proposes a semi-structured information retrieval model based on a new method for calculation
of similarity. We have developed CASISS (Calculation of Similarity of Semi-Structured documents)
method to quantify how two given texts are similar. This new method identifies elements of semi-structured
documents using elements descriptors. Each semi-structured document is pre-processed before the
extraction of a set of descriptors for each element, which characterize the contents of elements.It can be
used to increase the accuracy of the information retrieval process by taking into account not only the
presence of query terms in the given document but also the topology (position continuity) of these terms.
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsArti Parab Academics
This document discusses geographic information systems (GIS). It defines GIS as hardware and software used to process, store, and transfer geographic data. It describes how GIS has evolved from using analog data and manual processing to increased use of digital data, computers, and software. It also discusses key GIS concepts like spatial data capture and analysis, data storage and management, and data presentation.
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtapVikas Jagtap
The data that indicates the earth location (latitude & longitude, or height & depth ) of these rendered objects is known as spatial data.
When the map is rendered, objects of this spatial data are used to project the location of the objects on 2-Dimentional piece of paper.
The spatial data management systems are designed to make the storage, retrieval, & manipulation of spatial data (i.e points, lines and polygons) easier and natural to users, such as GIS.
While typical databases can understand various numeric and character types of data, additional functionality needs to be added for databases to process spatial data types.
These are typically called geometry or feature.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
EFFICIENT SCHEMA BASED KEYWORD SEARCH IN RELATIONAL DATABASESIJCSEIT Journal
Keyword search in relational databases allows user to search information without knowing database
schema and using structural query language (SQL). In this paper, we address the problem of generating
and evaluating candidate networks. In candidate network generation, the overhead is caused by raising the
number of joining tuples for the size of minimal candidate network. To reduce overhead, we propose
candidate network generation algorithms to generate a minimum number of joining tuples according to the
maximum number of tuple set. We first generate a set of joining tuples, candidate networks (CNs). It is
difficult to obtain an optimal query processing plan during generating a number of joins. We also develop a
dynamic CN evaluation algorithm (D_CNEval) to generate connected tuple trees (CTTs) by reducing the
size of intermediate joining results. The performance evaluation of the proposed algorithms is conducted
on IMDB and DBLP datasets and also compared with existing algorithms.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.4, July 2015
DOI : 10.5121/ijdkp.2015.5402 21
SPATIO-TEXTUAL SIMILARITY JOIN
Ch Shylaja and Supreethi K.P
Department of Computer Engineering,
Jawaharlal Nehru Technological University, Hyderabad, India
ABSTRACT
Data mining is the process of discovering interesting patterns and knowledge from large amounts of data.
Spatial databases store large space related data, such as maps, preprocessed remote sensing or medical
imaging data.
Modern mobile phones and mobile devices are equipped with GPS devices; this is the reason for the
Location based services to gain significant attention. These Location based services generate large
amounts of spatio- textual data which contain both spatial location and textual description. The spatio-
textual objects have different representations because of deviations in GPS or due to different user
descriptions. This calls for the need of efficient methods to integrate spatio-textual data. Spatio-textual
similarity join meets this need. Spatio-textual similarity join: Given two sets of spatio-textual data, it finds
all the similar pairs. Filter and refine framework will be developed to device the algorithms. The prefix
filter technique will be extended to generate spatial and textual signatures and inverted indexes will be
built on top of these signatures. Candidate pairs will be found using these indexes. Finally the candidate
pairs will be refined to get the result. MBR-prefix based signature will be used to prune dissimilar objects.
Hybrid signature will be used to support spatial and textual pruning simultaneously.
KEYWORDS
Spatio-textual similarity join, inverted index, candidate pairs, hybrid signature.
1. INTRODUCTION
Location based services have gained significant attention because of the omnipresence of GPS in
smart phones and mobile devices. LBS are general class of computer program level services that
use location data to control features. LBS are used in a variety of contexts such as health, indoor
objects search, entertainment, work, personal life etc.
LBS generate large amounts of spatio-textual data which contain both geographical location and
textual description. The spatio-textual objects may have different representations. The difference
textual representation may be because of deviations in GPS and different user description. Spatio-
textual similarity join correlates the spatio-textual data from different sources. Given two input
collections of sets, similarity join identifies all pairs of sets, one from each collection that has
high similarity.
Similarity join is an important operation for reconciling different representations of an entity.
Two objects are said to be similar if their spatial and textual similarity is greater than the given
threshold. There are various ways to quantify the spatial similarity and textual similarity. As an
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.4, July 2015
22
example of a spatial join, consider one data set describing parking lots and other describing
movie theaters of a city, using the predicate ‘next to’, a spatial join between these data sets will
provide an answer to the query: “find all movie theaters that are adjacent to a parking lot”.
Consider two collections of objects R = {r1, r2 . . . , rn} and S = {s1, s2 . . . , sm}. Each object r
(or s) includes a spatial region Mr and textual description Tr. In this paper we use Minimum
Bounding Rectangle (MBR) to capture the spatial information, denoted by Mr = [rbl, rtr], where
rbl = (rbl.x, rbl.y) is the bottom-left point and rtr = (rtr.x, rtr.y) is the top-right point. We use a set
of tokens to capture the textual description, denoted by Tr = {t1, t2, . . . , tv}, which describes an
object (e.g., {Hotel, Pizza, Subway}) or users’ interests (e.g., {Seaside, Sandwich, Delivery}). As
tokens may have different importance, we assign each token ti with a weight w (ti) (e.g., inverse
document frequency idf). To quantify the similarity between two objects, we use the well-known
Jaccard as an example to evaluate the spatial similarity (SJac) and textual similarity (TJac). Our
techniques can be easily extended to support other similarity functions.
Definition 1 (Spatial Jaccard). Given two objects r and s, their spatial Jaccard similarity (Sjac) is
defined as:
Sୟୡሺr, sሻ =
|౨|∩|౩|
|౨|ା|౩|ି|౨∩౩|
…… (1)
Where | · | is the size of an MBR.
Definition 2 (Textual Jaccard). Given two objects r and s, their textual Jaccard similarity (Tjac) is
defined as:
Tୟୡሺr, sሻ =
∑ ୵ሺ୲ሻ౪∈౨∩౩
∑ ୵ሺ୲ሻ౪∈౨∪౩
…… (2)
Where w (t) is the weight of token t.
Two objects r and s are similar if they satisfy (1) Spatial constraint: their spatial Jaccard
similarity is larger than a spatial similarity threshold τs, i.e., SJac(r, s) > τs; and
(2)Textual constraint: their textual Jaccard similarity is larger than a textual similarity threshold
τt, i.e., TJac(r, s) > τt.
2. RELATED WORK
Previous methods for spatial similarity join and textual similarity join are available however
spatio-textual similarity join are not. The following are some of the previous methods used for
spatial join and textual join.
2.1. Set-similarity joins
Given two input collections of sets, a set-similarity join (SSJoin) [2] identifies all pairs of sets,
one from each collection, that have high similarity. Recent work has identified SSJoin as a useful
primitive operator in data cleaning. SSJoin algorithms have two important features: They are
exact, i.e., they always produce the correct answer, and they carry precise performance
guarantees. A general-purpose data cleaning system is faced with the daunting task of supporting
a large number of similarity joins with different similarity functions. Recent work [3] has
identified set-similarity join (SSJoin) as a powerful primitive for supporting (string-) similarity
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.4, July 2015
23
joins involving many common similarity functions. This algorithm can be implemented on top of
a regular DBMS with a very little coding effort.
Figure 1: Jaccard SSJoin implementation
2.2. Spatial joins using R-trees
Spatial joins are one of the most important operations for combining spatial objects of several
relations. R-trees are very suitable for supporting spatial queries and the R*-tree is one of the
most efficient members of the R-tree family [4] . In order to support spatial queries such as “find
all objects which intersect a given window”, spatial access methods cluster objects on disks
according to their spatial location in space. Consequently, if objects are close together in space,
they are stored close together on disk with high probability. An R-tree[5] is a B+-tree like access
method that stores multidimensional rectangles as complete objects without clipping them or
transforming them to higher dimensional points. An example for an R-tree is given in Figure 1.
The tree consists of three data pages and a directory page. Note, that the rectangles of the
directory page are the minimum bounding rectangles of those rectangles that are stored in the
corresponding child node. The query window is depicted by the gray colored rectangle. First, the
query is performed against the root of the R-tree where rectangle r and t intersect the window.
Thus, the corresponding two data pages are read into memory and their entries are checked for a
common intersection with the window. Eventually, rectangle al is found to be an answer of the
window query.
Figure 2: example of an R-tree
2.3. Spatial hash joins
Spatial hash-joins [6] shows how to apply hash-join paradigm to spatial joins, and define a new
frame work for spatial hash joins. Spatial partition functions [13] have two components: a set of
bucket extents and an assignment function, which may map a data item into multiple buckets.
Furthermore, the partition functions for the two input datasets may be different. Relational hash
joins yield excellent performance, particularly for relations that are large compared to buffer
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.4, July 2015
24
sizes. Like relational hash-joins, our method partitions its input into buckets during the partition
phase and then joins the buckets to produce join results in the join phase. However, unlike
relational joins, a partition function in our framework comprises two components: a set of bucket
extents and an assignment function. The assignment function may map a data item into multiple
buckets, and the partition functions for the two datasets may differ.
2.4. Size separation spatial join
S3 join [7] introduces a new algorithm to compute the spatial join of two or more spatial data
sets, when indexes are not available on them. Size Separation Spatial Join (S3 J), is a
generalization of the relational Sort Merge Join algorithm. S3 J is designed so that no replication
of the spatial entities is necessary, whereas previous approaches have required replication. The
algorithm does not rely on statistical information from the data sets involved to efficiently
perform the join and for a range of distributions offers a guaranteed worst case performance
independent of the spatial statistics of the data sets.
2.5. SEAL: Spatio-textual similarity search:
Many modern LBS applications generate a new kind of spatio-textual data, regions-of-interest
(ROIs), containing region-based spatial information and textual description. To satisfy search
requirements on ROIs, SEAL [8] introduces a new research problem, called spatio-textual
similarity search: Given a set of ROIs and a query ROI, SEAL finds the ROIs which are similar
to the query by considering spatial overlap and textual similarity. Spatio-textual similarity search
can satisfy users’ information needs in various real applications. A filter and verification
framework is proposed to prune dissimilar objects and visit small amount of objects that may be
similar to the given query.
2.6. Partition based spatial merge join
PBSM (Partition Based Spatial–Merge) [10] is an algorithm for performing spatial join operation.
This algorithm is especially effective when neither of the inputs to the join have an index on the
joining attribute. Such a situation could arise if both inputs to the join are intermediate results in a
complex query or in a parallel environment where the inputs must be dynamically redistributed.
The PBSM algorithm partitions the inputs into manageable chunks, and joins them using a
computational geometry based plane-sweeping technique.
There are various partitioning strategies for spatio-textual similarity join [14]. One approach is to
start with a spatial data structure ( either grids or quad trees), traverse regions and apply a below
algorithms for identifying similar pairs of textual documents called All-Pairs [11].This is called
the local approach. An alternative approach is to construct a global index but partition postings
spatially (either by grid or by quadtree, linearized by z-ordering) and modify the All-Pairs
algorithm to prune candidates based on distance. This is called as the global approach. Together,
this yields four combinations: local grid, local quadtree, global grid, and global quadtree.
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.4, July 2015
25
Figure 3: QuadTree Example
The following are the different similarity functions that are been traditionally used in similarity
joins. They are Jaccard similarity, Cosine similarity, Edit distance, generalized edit distance. It is
well known that no single similarity function is universally applicable across all domains and
scenarios.
2.7. Jaccard similarity
Jaccard coefficient [9] is a commonly used measure of overlap of two sets A and B.
Jaccard (A, B) =
|∩|
|∪|
; Jaccard (A, A) =1; Jaccard (A, B) = 0 if A∩B = 0; A and B do not have
to be of the same size. Jaccard similarity always assigns a number between 0 and 1. The issues
with Jaccard similarity are (1) It does not consider term frequency (2) Rare terms in a collection
are more informative than frequent terms.
2.8. Cosine similarity
Cosine similarity measures the angle between two vectors and is calculated by dividing the inner
product of two vectors by multiplication of vectors’ length. The cosine similarity between
documents di and dj is formulated as follows:
simୡ୭ୱ൫d୧, d୨൯ =
ୢഠ
ሬሬሬሬԦ∙ୢഡ
ሬሬሬԦ
ฮୢഠ
ሬሬሬሬԦฮ∙ฮୢഡ
ሬሬሬԦฮ
=
∑ ୵ౡ×୵ౠౡ
ౡసభ
ට∑ ୵మ
ౡ
ౡసభ ×ට∑ ୵మ
ౠౡ
ౡసభ
……..(3)
The cosine similarity values range between (0, 1), where a cosine similarity value of 0 means that
the documents are unrelated and a cosine similarity value close to 1 means that the documents are
closely related. It is obvious, that in order to have cosine similarity greater than 0, documents
should have some words in common. When all of the words and their associated weights are
same in two different documents, then the cosine similarity between these two documents is
equal to 1.
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.4, July 2015
26
2.9. Edit distance
Edit distance measures the minimum number of edit operations (insertion, deletion, and
substitution) to transform one string to another. Edit distance [3] has two distinctive advantages
over alternative distance or similarity measure: (a) it reflects the ordering of tokens in the string;
and (b) it allows non-trivial alignment. These properties make edit distance a good measure in
many application domains, e.g., to capture typographical errors for text documents, and to
capture similarities for Homologous proteins or genes. Similarity joins with edit distance
thresholds poses serious algorithmic challenges. Computing edit distance is more expensive
(O(n2)) than alternative distance or similarity measure (usually O(n)).
3. PROPOSED SYSTEM
A mark based channel and-refine structure will be built. In the first place spatial and textual
marks for the items will be created and altered lists will be manufactured to evade excess
reckonings. At that point, these marks will be utilized to discover hopeful sets whose marks are
comparative enough. A few calculations will be proposed by arranging spatial and text based
marks in diverse ways. Also, two viable systems will be proposed to produce top notch marks.
The first chooses a subregion as a signature for each one article. The subregion chose in this
technique is minimized. The second technique is a mixture signature which incorporates spatial
data and literary portrayals. Additionally compelling pruning methods to enhance the execution
will be created.
1) A channel and refine system will be investigated and propose effective calculations which can
prune vast quantities of different objects
.
2) MBR-Prefix based mark which utilizes subregions of articles as marks to backing spatial
pruning will be added and subregions will be minimized.
3) a cross breed signature will be proposed by incorporating spatial furthermore text based
pruning. To streamline the half and half signature, we display the issue as a token part issue
and demonstrate it is NP-complete.
Given two sets of spatio-textual objects, signature based similarity join method is used to find the
similar objects from the given sets. For this, a signature based filter and refine framework will be
developed. First spatial and textual signatures will be developed. Inverted indexes will be built on
top of these signatures. Then candidate pairs will be generated using these inverted indexes.
Finally the candidate pairs will be refined to get the final result.
3.1 Generating spatial signature:
Prefix filtering technique can be utilized to generate spatial signatures as follows.
Given an object r, girds in Gr will be sorted first. Then signature for r can be developed by
deleting last k grids which satisfy ∑ หܯห ∩ |ܯ|
|ீೝ|
ୀ|ீೝ|ିାଵ ≤ ߬௦ ∙ |ܯ| and ∑ หܯห ∩
|ீೝ|
ୀ|ீೝ|ି
|ܯ| > ߬௦ ∙ |ܯ|, denoted by SIGS(r). If r and s are similar, then they at least share one common
grid in their spatial signatures.
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.4, July 2015
27
3.2 Generating textual signature:
The basic idea is that if the similarity of two token sets is larger than a given threshold, they
should share enough common tokens. Then the index can be reduced by cutting off some
common tokens. For simplicity, suppose w(ti) = 1. Consider two token sets Tr and Ts. Since |Tr ∩
Ts| is an integer, according to [6], if TJac(Tr, Ts) > τt,
then |ܶ ∩ ܶ௦| > ߬௧ × |ܶ| ≥ ߬ہ௧ × |ܶ|ۂ + 1 | Based on this property, we first sort all the tokens
according to a global ordering, e.g., idf. Then for each token set T, we generate its prefix by
deleting last ߬ہ௧ × |ܶ|ۂ tokens. If any two objects have common tokens in their prefixes, we add
them into the candidate set. Next we extend the prefix filter technique to the case w(ti)≠ 1 (the
weight may not be an integer). We first sort tokens in the descending order of their weights. Since
|ܶ ∩ ܶ௦| > ߬௧ × ∑ ݓሺݐሻ
|்ೝ|
ୀଵ then for each token set T, we generate its prefix by deleting the last k
tokens which satisfy
∑ ݓሺݐሻ ≤ ߬௧ × ∑ ݓሺݐሻ
|்|
ୀଵ
|்|
ୀ|்|ିାଵ and ∑ ݓሺݐሻ > ߬௧ × ∑ ݓሺݐሻ
|்|
ୀଵ
|்|
ୀ|்|ି
3.3 Generating inverted indexes and candidate pairs
Inverted indexes will be developed on top of these signatures. The generated signatures of the
object pairs will be used to get the candidate pairs. The objects will be scanned sequentially and
an inverted index for the signatures of the visited objects will be maintained. These inverted
indexes will be used to get the candidate pairs. For each grid g ∈SIGs(r), is used to probe the
corresponding spatial inverted index and add all the objects in these lists to spatial candidate set
Cs. Meanwhile, for each token t ∈SIGT(r), corresponding textual inverted list will be scanned
and objects to textual candidate set CT will be added. Then the intersection of CS and CT will be
taken as the candidates.
3.4 Algorithms used
Algorithm 1:
Input: R: An object set; τs: Spatial threshold; τt: Textual threshold
Output: P: a set of similar pairs
1. Begin
2. I ← empty index;
3. C ← empty candidate set;
4. For each object r ∈ R do
5. SIG(r) ← SigGeneration(r, τs, τt);
6. C← CandidateFiltering(SIG(r), I);
7. P ← Refinement(C);
8. End
Algorithm 1includes three steps namely filter, Index update and refine. This algorithm can be
used to avoid enumerating every object pair.
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.4, July 2015
28
Algorithm 2:
INPUT: Set collections R and S and threshold γ
1. Begin
2. For each r ∈R, generate signature-set Sign(r)
3. For each s ∈S, generate signature-set Sign(s)
4. Generate all candidate pairs (r, s), r ∈R, and s ∈S satisfying Sign(r) ∩ Sign ≠φ
5. Output any candidate pair (r, s) satisfying Sim(r, s) ≥ γ.
6. End
Algorithm 2 is used to generate candidate pairs of the signatures build on objects.
4. IMPLEMENTATION
Observe that the form of predicate that is considered here involves multi-set intersection when
any R.A (or S.A) group contains multiple values on the R.B attributes. In order to be able to
implement them using standard relational operators, these multi-sets will be converted into sets;
each value in R:B and S:B will be converted into an ordered pair containing an ordinal number to
distinguish it from its duplicates. Thus, for example, the multi-set {1, 1, 2} would be converted to
{<1, 1>, <1, 2>, <1, 3>} Since set intersections can be implemented using joins, the conversion
enables to perform multi-set intersections using joins.
The two sets of spatio-textual objects that are considered are (1) The twitter users, modeled as
spatio-textual objects (2) message set which consists of advertisements. The spatio-textual
similarity join is performed between these two sets.
Twitter allows its users to upload their location along with posting tweets. Hence these users are
be modeled as spatio-textual objects. Meanwhile twitter messages like advertisements are also
modeled as spatio-textual objects. To deliver messages to relevant users, spatio-textual similarity
join is used.
The twitter user tweets are collected by creating a widget of respective user. These tweets are
parsed and unique terms are identified. The user uploads his location along with posting tweets.
The message set consists of advertisements with their spatial location and textual description
about the advertisements is considered. Spatio-textual similarity is applied for the spatial data and
textual data in both the sets and similar object pairs are found. Based on the similar object pairs
the messages are delivered to the relevant users.
To implement this, the following system specifications are required: The hardware requirements
are 512MB DD RAM, 40GB hard disk, Pentium IV 2.4 GHz processor and the software
requirements are Window7 Operating System, NetBeans 7.4 IDE, JAVA, MySql Database,
Apache Tomcat server, TWITTER API and Google API.
5. RESULTS
The objects in the advertisement set that are spatially close to the user and textually similar to the
user’s tweets are recommended advertisements to the user. These advertisements will be
delivered to the user.
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.4, July 2015
29
6. CONCLUSION
A new research problem called spatio-textual similarity join will be studied in this paper. A
prefix-filter based framework will be deviced and several possible solutions will be proposed. To
achieve higher performance, an MBRPrefix based filtering technique will be developed which
can prune large number of dissimilar objects. Hybrid signature integrating spatial information and
textual descriptions will be proposed. A token partition problem will be modeled Jaccard
similarity will be used to prune spatial data and Cosine similarity will be used to prune textual
data.
REFERENCES
[1] J.Han, M.Kamber, and J.Pei, Data Mining. Waltham, MA: USA, 2012.
[2] A. Arasu, V. Ganti, and R. Kaushik, “Efficient Exact Set-Similarity Joins,” Proc. 32nd Int. Conf.
Very Large Data Bases, pp. 918–929, 2006.
[3] S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive operator for similarity joins in data cleaning,”
Proc. - Int. Conf. Data Eng., vol. 2006, p. 5, 2006.
[4] T. Brinkhoff, H.-P. Kriegel, and B. Seeger, “Hans-Peter of Computer Brinkhoff Kriegel 1
Introductiort,” Computer (Long. Beach. Calif)., pp. 237–246, 1993.
[5] A. Guttman, “R-trees: A Dynamic Index Structure for Spatial Searching,” Proc. 1984 ACM SIGMOD
Int. Conf. Manag. Data - SIGMOD ’84, pp. 47–57, 1984.
[6] M.-L. Lo and C. V. Ravishankar, “Spatial hash-joins,” ACM SIGMOD Rec., vol. 25, pp. 247–258,
1996.
[7] N. Koudas and K. C. Sevcik, “Size separation spatial join,” ACM SIGMOD Rec., vol. 26, pp. 324–
335, 1997.
[8] J. Fan, G. Li, L. Zhou, S. Chen, and J. Hu, “SEAL : Spatio-Textual Similarity Search,” Proc. VLDB
Endow., vol. 5, pp. 824–835, 2012.
[9] S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu, “Using of Jaccard Coefficient for
Keywords Similarity,” Int. MultiConference Eng. Comput. Sci., vol. I, pp. 380–384, 2013.
[10] Jignesh M, Patel David J. DeWitt, "Partition Based Spatial Merge Join", ACM SIGMOD vol. 25 pp
259-270, 1996.
[11] R. Bayardo, Yiming Ma, Ramakrishnan Srikant,"Scaling Up All Pair Similarity Search", ACM , pp
131-140, 2007
[12] Sitong Liu, Guoliang Li, and Jianhua Feng,"Prefix Filter Based Method For Spatio-Textual Similarity
Join",IEEE, vol 26, Issue 10, pp 2354-2367, 2013.
[13] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka, “Application of hash to data base machine and its
architecture, ” New Generation Computing, vol. 1, no. 1, pp. 66-74, 1983
[14] Jinfeng Rao, Jimmy Lin, and Hanan Samet,"Partition Strategies for Spatio Textual Similarity Join",
ACM SIGSPATIAL InternationalWorkshop on Analytics for Big Geospatial Data (BigSpatial) 2014
ISBN 978-1-4503-3132-6.