The document describes the PM-tree, a new metric access method that combines the M-tree with pivot-based approaches to improve efficiency of similarity search in multimedia databases. The PM-tree enhances M-tree routing and ground entries by including pivot-based information like hyper-ring regions defined by pivot objects and distances. This reduces the volume of metric regions described by entries, tightly bounding indexed objects and improving retrieval performance. Algorithms for building and querying the PM-tree are presented, showing how pivot distances are used to prune irrelevant regions during search.
Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. Spatial databases store large space related data, such as maps, preprocessed remote sensing or medical imaging data.
Modern mobile phones and mobile devices are equipped with GPS devices; this is the reason for the Location based services to gain significant attention. These Location based services generate large amounts of spatio- textual data which contain both spatial location and textual description. The spatiotextual objects have different representations because of deviations in GPS or due to different user descriptions. This calls for the need of efficient methods to integrate spatio-textual data. Spatio-textual similarity join meets this need. Spatio-textual similarity join: Given two sets of spatio-textual data, it finds all the similar pairs. Filter and refine framework will be developed to device the algorithms. The prefix
filter technique will be extended to generate spatial and textual signatures and inverted indexes will be
built on top of these signatures. Candidate pairs will be found using these indexes. Finally the candidate pairs will be refined to get the result. MBR-prefix based signature will be used to prune dissimilar objects. Hybrid signature will be used to support spatial and textual pruning simultaneously.
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. Spatial databases store large space related data, such as maps, preprocessed remote sensing or medical imaging data.
Modern mobile phones and mobile devices are equipped with GPS devices; this is the reason for the Location based services to gain significant attention. These Location based services generate large amounts of spatio- textual data which contain both spatial location and textual description. The spatiotextual objects have different representations because of deviations in GPS or due to different user descriptions. This calls for the need of efficient methods to integrate spatio-textual data. Spatio-textual similarity join meets this need. Spatio-textual similarity join: Given two sets of spatio-textual data, it finds all the similar pairs. Filter and refine framework will be developed to device the algorithms. The prefix
filter technique will be extended to generate spatial and textual signatures and inverted indexes will be
built on top of these signatures. Candidate pairs will be found using these indexes. Finally the candidate pairs will be refined to get the result. MBR-prefix based signature will be used to prune dissimilar objects. Hybrid signature will be used to support spatial and textual pruning simultaneously.
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
Hex-Cell is an interconnection network that has attractive features like the embedding capability of topological structures; such as; bus, ring, tree and mesh topologies. In this paper, we present two algorithms for embedding bus and ring topologies onto Hex-Cell interconnection network. We use three metrics to evaluate our proposed algorithms: dilation, congestion, and expansion. Our evaluation results
show that the congestion of our two proposed algorithms is equal to one; and the dilation is equal to 2d-1 for the first algorithm and 1 for the second.
AN EFFECT OF USING A STORAGE MEDIUM IN DIJKSTRA ALGORITHM PERFORMANCE FOR IDE...ijcsit
The graph model is used widely for representing connected objects within a specific area. These objects are defined as nodes; where the connection is represented as arc called edges. The shortest path between two nodes is one of the most focus researchers’ attentions. Many algorithms are developed with different structured approach for reducing the shortest path cost. The most widely used algorithm is Dijkstra algorithm. This algorithm has been represented with various structural developments in order to reduce the shortest path cost. This paper highlights the idea of using a storage medium to store the solution path from Dijkstra algorithm, then, uses it to find the implicit path in an ideal time cost. The performance of Dijkstra algorithm using an appropriate data structure is improved. Our results emphasize that the searching time through the given data structure is reduced within different graphs models.
Parallel Key Value Pattern Matching Modelijsrd.com
Mining frequent itemsets from the huge transactional database is an important task in data mining. To find frequent itemsets in databases involves big decision in data mining for the purpose of extracting association rules. Association rule mining is used to find relationships among large datasets. Many algorithms were developed to find those frequent itemsets. This work presents a summarization and new model of parallel key value pattern matching model which shards a large-scale mining task into independent, parallel tasks. It produces a frequent pattern showing their capabilities and efficiency in terms of time consumption. It also avoids the high computational cost. It discovers the frequent item set from the database.
Spatial database are becoming more and more popular in recent years. There is more and more
commercial and research interest in location-based search from spatial database. Spatial keyword search
has been well studied for years due to its importance to commercial search engines. Specially, a spatial
keyword query takes a user location and user-supplied keywords as arguments and returns objects that are
spatially and textually relevant to these arguments. Geo-textual index play an important role in spatial
keyword querying. A number of geo-textual indices have been proposed in recent years which mainly
combine the R-tree and its variants and the inverted file. This paper propose new index structure that
combine K-d tree and inverted file for spatial range keyword query which are based on the most spatial
and textual relevance to query point within given range.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Query Optimization Techniques in Graph Databasesijdms
Graph databases (GDB) have recently been arisen to overcome the limits of traditional databases for
storing and managing data with graph-like structure. Today, they represent a requirementfor many
applications that manage graph-like data,like social networks.Most of the techniques, applied to optimize
queries in graph databases, have been used in traditional databases, distribution systems,… or they are
inspired from graph theory. However, their reuse in graph databases should take care of the main
characteristics of graph databases, such as dynamic structure, highly interconnected data, and ability to
efficiently access data relationships. In this paper, we survey the query optimization techniques in graph
databases. In particular,we focus on the features they have in
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSINGijdms
Tracing moving objects have turned out to be essential in our life and have a lot of uses like: GPS guide,
traffic monitor based administrations and location-based services. Tracking the changing places of objects
has turned into important issues. The moving entities send their positions to the server through a system
and large amount of data is generated from these objects with high frequent updates so we need an index
structure to retrieve information as fast as possible. The index structure should be adaptive, dynamic to
monitor the locations of objects and quick to give responses to the inquiries efficiently. The most wellknown
kinds of queries strategies in moving objects databases are Rang, Point and K-Nearest Neighbour
and inquiries. This study uses R-tree method to get detailed range query results efficiently. But using R-tree
only will generate much overlapping and coverage between MBR. So R-tree by combining with Gridpartition
index is used because grid-index can reduce the overlap and coverage between MBR. The query
performance will be efficient by using these methods. We perform an extensive experimental study to
compare the two approaches on modern hardware.
This article got published in the Software Developer's Journal's February Edition.
It describes the use of MapReduce paradigm to design Clustering algorithms and explain three algorithms using MapReduce.
- K-Means Clustering
- Canopy Clustering
- MinHash Clustering
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
An artificial immune system algorithm for solving the uncapacitated single all...IJECEIAES
The present paper deals with a variant of hub location problems (HLP): the uncapac- itated single allocation p-Hub median problem (USApHMP). This problem consists to jointly locate hub facilities and to allocate demand nodes to these selected facilities. The objective function is to minimize the routing of demands between any origin and destination pair of nodes. This problem is known to be NP-hard. Based on the artificial immune systems (AIS) framework, this paper develops a new approach to efficiently solve the USApHMP. The proposed approach is in the form of a clonal selection algorithm (CSA) that uses appropriate encoding schemes of solutions and maintains their feasibility. Comprehensive experiments and comparison of the proposed approach with other existing heuristics are conducted on benchmark from civil aeronautics board, Australian post, PlanetLab and Urand data sets. The results obtained allow to demonstrate the validity and the effectiveness of our approach. In terms of solution quality, the results obtained outperform the best-known solutions in the literature.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
The Science Information Network (SINET) is a Japanese academic backbone network for more than 800 universities and research institutions. The characteristic of SINET traffic is that it is enormous and highly variable. In this paper, we present a task-decomposition based anomaly detection of massive and highvolatility session data of SINET. Three main features are discussed: Tash scheduling, Traffic discrimination, and Histogramming. We adopt a task-decomposition based dynamic scheduling method to handle the massive session data stream of SINET. In the experiment, we have analysed SINET traffic from 2/27 to 3/8 and detect some anomalies by LSTM based time-series data processing.
A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining.
A talk about Virtual Reality. The terminology, how it works what is available and what is coming in the future. When I do this talk, paired with showing hands on Oculus Rift and Google Cardboard.
Hex-Cell is an interconnection network that has attractive features like the embedding capability of topological structures; such as; bus, ring, tree and mesh topologies. In this paper, we present two algorithms for embedding bus and ring topologies onto Hex-Cell interconnection network. We use three metrics to evaluate our proposed algorithms: dilation, congestion, and expansion. Our evaluation results
show that the congestion of our two proposed algorithms is equal to one; and the dilation is equal to 2d-1 for the first algorithm and 1 for the second.
AN EFFECT OF USING A STORAGE MEDIUM IN DIJKSTRA ALGORITHM PERFORMANCE FOR IDE...ijcsit
The graph model is used widely for representing connected objects within a specific area. These objects are defined as nodes; where the connection is represented as arc called edges. The shortest path between two nodes is one of the most focus researchers’ attentions. Many algorithms are developed with different structured approach for reducing the shortest path cost. The most widely used algorithm is Dijkstra algorithm. This algorithm has been represented with various structural developments in order to reduce the shortest path cost. This paper highlights the idea of using a storage medium to store the solution path from Dijkstra algorithm, then, uses it to find the implicit path in an ideal time cost. The performance of Dijkstra algorithm using an appropriate data structure is improved. Our results emphasize that the searching time through the given data structure is reduced within different graphs models.
Parallel Key Value Pattern Matching Modelijsrd.com
Mining frequent itemsets from the huge transactional database is an important task in data mining. To find frequent itemsets in databases involves big decision in data mining for the purpose of extracting association rules. Association rule mining is used to find relationships among large datasets. Many algorithms were developed to find those frequent itemsets. This work presents a summarization and new model of parallel key value pattern matching model which shards a large-scale mining task into independent, parallel tasks. It produces a frequent pattern showing their capabilities and efficiency in terms of time consumption. It also avoids the high computational cost. It discovers the frequent item set from the database.
Spatial database are becoming more and more popular in recent years. There is more and more
commercial and research interest in location-based search from spatial database. Spatial keyword search
has been well studied for years due to its importance to commercial search engines. Specially, a spatial
keyword query takes a user location and user-supplied keywords as arguments and returns objects that are
spatially and textually relevant to these arguments. Geo-textual index play an important role in spatial
keyword querying. A number of geo-textual indices have been proposed in recent years which mainly
combine the R-tree and its variants and the inverted file. This paper propose new index structure that
combine K-d tree and inverted file for spatial range keyword query which are based on the most spatial
and textual relevance to query point within given range.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Query Optimization Techniques in Graph Databasesijdms
Graph databases (GDB) have recently been arisen to overcome the limits of traditional databases for
storing and managing data with graph-like structure. Today, they represent a requirementfor many
applications that manage graph-like data,like social networks.Most of the techniques, applied to optimize
queries in graph databases, have been used in traditional databases, distribution systems,… or they are
inspired from graph theory. However, their reuse in graph databases should take care of the main
characteristics of graph databases, such as dynamic structure, highly interconnected data, and ability to
efficiently access data relationships. In this paper, we survey the query optimization techniques in graph
databases. In particular,we focus on the features they have in
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSINGijdms
Tracing moving objects have turned out to be essential in our life and have a lot of uses like: GPS guide,
traffic monitor based administrations and location-based services. Tracking the changing places of objects
has turned into important issues. The moving entities send their positions to the server through a system
and large amount of data is generated from these objects with high frequent updates so we need an index
structure to retrieve information as fast as possible. The index structure should be adaptive, dynamic to
monitor the locations of objects and quick to give responses to the inquiries efficiently. The most wellknown
kinds of queries strategies in moving objects databases are Rang, Point and K-Nearest Neighbour
and inquiries. This study uses R-tree method to get detailed range query results efficiently. But using R-tree
only will generate much overlapping and coverage between MBR. So R-tree by combining with Gridpartition
index is used because grid-index can reduce the overlap and coverage between MBR. The query
performance will be efficient by using these methods. We perform an extensive experimental study to
compare the two approaches on modern hardware.
This article got published in the Software Developer's Journal's February Edition.
It describes the use of MapReduce paradigm to design Clustering algorithms and explain three algorithms using MapReduce.
- K-Means Clustering
- Canopy Clustering
- MinHash Clustering
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
An artificial immune system algorithm for solving the uncapacitated single all...IJECEIAES
The present paper deals with a variant of hub location problems (HLP): the uncapac- itated single allocation p-Hub median problem (USApHMP). This problem consists to jointly locate hub facilities and to allocate demand nodes to these selected facilities. The objective function is to minimize the routing of demands between any origin and destination pair of nodes. This problem is known to be NP-hard. Based on the artificial immune systems (AIS) framework, this paper develops a new approach to efficiently solve the USApHMP. The proposed approach is in the form of a clonal selection algorithm (CSA) that uses appropriate encoding schemes of solutions and maintains their feasibility. Comprehensive experiments and comparison of the proposed approach with other existing heuristics are conducted on benchmark from civil aeronautics board, Australian post, PlanetLab and Urand data sets. The results obtained allow to demonstrate the validity and the effectiveness of our approach. In terms of solution quality, the results obtained outperform the best-known solutions in the literature.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
The Science Information Network (SINET) is a Japanese academic backbone network for more than 800 universities and research institutions. The characteristic of SINET traffic is that it is enormous and highly variable. In this paper, we present a task-decomposition based anomaly detection of massive and highvolatility session data of SINET. Three main features are discussed: Tash scheduling, Traffic discrimination, and Histogramming. We adopt a task-decomposition based dynamic scheduling method to handle the massive session data stream of SINET. In the experiment, we have analysed SINET traffic from 2/27 to 3/8 and detect some anomalies by LSTM based time-series data processing.
A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining.
A talk about Virtual Reality. The terminology, how it works what is available and what is coming in the future. When I do this talk, paired with showing hands on Oculus Rift and Google Cardboard.
This ppt is html for beginners and html made easy for them to get the basic idea of html.
Html for beginners. A basic information of html for beginners. A more depth coverage of html and css will be covered in the future presentations. visit my sites http://technoexplore.blogspot.com and http://hotjobstuff.blogspot.com for some other important presentations.
This is a presentation slides for "Mosquito Attack" iPhone application.
You can get it via the following link:
http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewSoftware?id=310403489&mt=8
This is an expanded version of the Makers all around you presentation. It includes new material and information with a longer time frame for presentation.
понятие как форма мышления. определение понятияklushnikovaea
Данный урок проводится в 6 классе при изучении раздела темы «Понятие – как форма мышления. Определение понятия» по образовательной программе Информатика 6 класс Босова Л.Л. в рамках раздела «Человек и информация».
На уроке учащиеся закрепляют полученные знания о содержании и объеме понятия, отношениями между понятиями, об общих подходах к сравнению понятий, получают знания об одном из приемов определения понятия через род и видовое отличие, выполняя задания, предложенные учителем в режиме интерактивной доски.
Учащиеся выполняют практическую работу в графическом редакторе Paint. В ходе ее выполнения учащиеся углубляют представления о возможностях редактирования графических изображений.
Trajectory Segmentation and Sampling of Moving Objects Based On Representativ...ijsrd.com
Moving Object Databases (MOD), although ubiquitous, still call for methods that will be able to understand, search, analyze, and browse their spatiotemporal content. In this paper, we propose a method for trajectory segmentation and sampling based on the representativeness of the (sub) trajectories in the MOD. In order to find the most representative sub trajectories, the following methodology is proposed. First, a novel global voting algorithm is performed, based on local density and trajectory similarity information. This method is applied for each segment of the trajectory, forming a local trajectory descriptor that represents line segment representativeness. The sequence of this descriptor over a trajectory gives the voting signal of the trajectory, where high values correspond to the most representative parts. Then, a novel segmentation algorithm is applied on this signal that automatically estimates the number of partitions and the partition borders, identifying homogenous partitions concerning their representativeness. Finally, a sampling method over the resulting segments yields the most representative sub trajectories in the MOD. Our experimental results in synthetic and real MOD verify the effectiveness of the proposed scheme, also in comparison with other sampling techniques.
Convincing a customer is always considered as a challenging task in every business. But when it comes to online business, this task becomes even more difficult. Online retailers try everything possible to gain the trust of the customer. One of the solutions is to provide an area for existing users to leave their comments. This service can effectively develop the trust of the customer however normally the customer comments about the product in their native language using Roman script. If there are hundreds of comments this makes difficulty even for the native customers to make a buying decision. This research proposes a system which extracts the comments posted in Roman Urdu, translate them, find their polarity and then gives us the rating of the product. This rating will help the native and non-native customers to make buying decision efficiently from the comments posted in Roman Urdu.
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
Specific objective to discover some novel information from a set of documents initially retrieved in response to some query. Clustering sentences level text, effective use and update is still an open research issue, especially in domain of text mining. Since most existing system uses pattern belong to a single cluster. But here we can use patterns belongs to all cluster with different degree of membership. Since sentences of those documents we would expect at least one of the clusters to be closely related to the concepts described by the query term. This paper presents a Novel Fuzzy Clustering Algorithm that operates on relational input data (i.e. data in the form of square matrix of pair wise similarities between data objects).
Robust Block-Matching Motion Estimation of Flotation Froth Using Mutual Infor...CSCJournals
In this paper, we propose a new method for the motion estimation of flotation froth using mutual information with a bin size of two as the block matching similarity metric. We also use three-step search and new-three-step-search as a search strategy. Mean sum of absolute difference (MAD) is widely considered in blocked based motion estimation. The minimum bin size selection of the proposed similarity metric also makes the computational cost of mutual information similar to MAD. Experimental results show that the proposed motion estimation technique improves the motion estimation accuracy in terms of peak signal-to-noise ratio of the reconstructed frame. The computational cost of the proposed method is almost the same as the standard machine vision methods used for the motion estimation of flotation froth.
Web users and content are increasingly being geo-positioned, and increased focus is being given
to serving local content in response to web queries. This development calls for spatial keyword queries that take
into account both the locations and textual descriptions of content. We study the efficient, joint processing of
multiple top-k spatial keyword queries. Such joint processing is attractive during high query loads and also
occurs when multiple queries are used to obfuscate a user’s true query. To propose a novel algorithm and index
structure for the joint processing of top-k spatial keyword queries. Empirical studies show that the proposed
solution is efficient on real datasets. They also offer analytical studies on synthetic datasets to demonstrate the
efficiency of the proposed solution.
c language faq's contains the frequently asked questions in c language and useful for all c programmers. This is an pdf document of Sterve_summit who is one of the best writer for c language
sctp protocol in unix and linux explained in simple and useful way in this presentation of sctp protocol. for related presentations visit www.technoexplore.blogspot.com
Here are some interview tips for cracking the interview. During this recession period it is very important.
visit my sites http://technoexplore.blogspot.com and http://hotjobstuff.blogspot.com for some other important presentations.
queue datastructure made easy and datastructure explained in java. visit http://technoexplore.blogspot.com and http://hotjobstuff.blogspot.com for some other important presentations.
Html for beginners. A basic information of html for beginners. A more depth coverage of html and css will be covered in the future presentations. visit my sites http://technoexplore.blogspot.com and http://hotjobstuff.blogspot.com for some other important presentations.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Key Trends Shaping the Future of Infrastructure.pdf
Trees Information
1. PM-tree: Pivoting Metric Tree for Similarity
Search in Multimedia Databases
Tom´ˇ Skopal1 , Jaroslav Pokorn´2 , and V´clav Sn´ˇel1
as y a as
ˇ
1
Department of Computer Science, VSB–Technical University of Ostrava,
Czech Republic {tomas.skopal, vaclav.snasel}@vsb.cz
2
Department of Software Engineering, Charles University in Prague,
Czech Republic pokorny@ksi.ms.mff.cuni.cz
Abstract. In this paper we introduce the Pivoting M-tree (PM-tree),
a metric access method combining M-tree with the pivot-based approach.
While in M-tree a metric region is represented by a hyper-sphere, in
PM-tree the shape of a metric region is determined by intersection of
the hyper-sphere and a set of hyper-rings. The set of hyper-rings for each
metric region is related to a fixed set of pivot objects. As a consequence,
the shape of a metric region bounds the indexed objects more tightly
which, in turn, significantly improves the overall efficiency of similarity
search. We present basic algorithms on PM-tree and two cost models for
range query processing. Finally, the PM-tree efficiency is experimentally
evaluated on large synthetic as well as real-world datasets.
Keywords: PM-tree, M-tree, pivot-based methods, efficient similarity search
1 Introduction
The volume of various multimedia collections worldwide rapidly increases and
the need for an efficient content-based similarity search in large multimedia
databases becomes stronger. Since a multimedia document is modelled by an
object (usually a vector) in a feature space U, the whole collection of documents
(the multimedia database) can be represented as a dataset S ⊂ U. A similarity
function is often modelled using a metric, i.e. a distance function d satisfying
reflexivity, positivity, symmetry, and triangular inequality.
Given a metric space M = (U, d), the metric access methods (MAMs) [4]
organize (or index) objects in dataset S just using the metric d. The MAMs try
to recognize a metric structure hidden in S and exploit it for an efficient search.
Common to all MAMs is that during search process the triangular inequality of
d allows to discard some irrelevant subparts of the metric structure.
2 M-tree
Among many of metric access methods developed so far, the M-tree [5, 8] (and
its modifications Slim-tree [11], M+ -tree [14]) is still the only indexing technique
suitable for an efficient similarity search in large multimedia databases.
2. The M-tree is based on a hierarchical organization of data objects Oi ∈ S
according to a given metric d. Like other dynamic and paged trees, the M-tree
structure consists of a balanced hierarchy of nodes. The nodes have a fixed capac-
ity and a utilization threshold. Within M-tree hierarchy the objects are clustered
into metric regions. The leaf nodes contain ground entries of the indexed data
objects while routing entries (stored in the inner nodes) describe the metric
regions. A ground entry looks like:
grnd(Oi ) = [Oi , oid(Oi ), d(Oi , Par(Oi ))]
where Oi ∈ S is an indexed data object, oid(Oi ) is an identifier of the original
DB object (stored externally), and d(Oi , Par(Oi )) is a precomputed distance
between Oi and the data object of its parent routing entry. A routing entry
looks like:
rout(Oi ) = [Oi , ptr(T (Oi )), rOi , d(Oi , Par(Oi ))]
where Oi ∈ S is a data object, ptr(T (Oi )) is pointer to the covering subtree,
and rOi is the covering radius. The routing entry determines a hyper-spherical
metric region in M where the object Oi is a center of that region and rOi is
a radius bounding the region. The precomputed value d(Oi , Par(Oi )) is used
for optimizing most of the M-tree algorithms. In Figure 1 a metric region and
Fig. 1. A routing entry and its metric region in the M-tree structure
its appropriate routing entry rout(Oi ) in an inner node are presented. For a
hierarchy of metric regions (routing entries rout(Oi ) respectively) the following
condition must be satisfied:
All data objects stored in leaves of covering subtree T (Oi ) of rout(Oi ) must be
spatially located inside the region defined by rout(Oi ).
Formally, having a rout(Oi ) then ∀Oj ∈ T (Oi ), d(Oi , Oj ) ≤ rOi . If we realize,
such a condition is very weak since there can be constructed many M-trees of the
same object content but of different hierarchy. The most important consequence
is that many regions on the same M-tree level may overlap. An example in
Figure 2 shows several data objects partitioned among (possibly overlapping)
metric regions and the appropriate M-tree.
3. Fig. 2. Hierarchy of metric regions and the appropriate M-tree
2.1 Similarity Queries
The structure of M-tree was designed to natively support similarity queries
(proximity queries actually). Given a query object Q, a similarity/proximity
query returns objects Oi ∈ S close to Q.
In the context of similarity search we distinguish two main types of queries. A
range query rq(Q, rQ , S) is specified as a hyper-spherical query region defined by
a query object Q and a query radius rQ . The purpose of a range query is to return
all the objects Oi ∈ S satisfying d(Q, Oi ) ≤ rQ . A k-nearest neighbours query
(k-NN query) knn(Q, k, S) is specified by a query object Q and a number k.
A k-NN query returns the first k nearest objects to Q. Technically, a k-NN
query can be implemented using a range query with dynamic query radius [8].
During a similarity query processing the M-tree hierarchy is being traversed
down. Only if a routing entry rout(Oi ) (its metric region respectively) overlaps
the query region, the covering subtree T (Oi ) of rout(Oi ) is relevant to the query
and thus further processed.
2.2 Retrieval Efficiency
The retrieval efficiency of an M-tree (i.e. the performance of a query evalua-
tion) is highly dependent on the overall volume3 of the metric regions described
by routing entries. The larger metric region volumes the higher probability of
overlap with a query region.
Recently, we have introduced two algorithms [10] leading to reduction of the
overall volume of metric regions. The first method, the multi-way dynamic in-
sertion, finds the most suitable leaf for each object to be inserted. The second
post-processing method, the generalized slim-down algorithm, tries to ”horizon-
tally” (i.e. separately for each M-tree level) redistribute all entries among more
suitable nodes.
3
We consider only an imaginary volume since there exists no universal notion of
volume in general metric spaces. However, without loss of generality, we can say
that a hyper-sphere volume grows if its covering radius increases.
4. 3 Pivoting M-tree
Each metric region of M-tree is described by a bounding hyper-sphere (defined
by a center object and a covering radius). However, the shape of hyper-spherical
region is far from optimal since it does not bound the data objects tightly to-
gether thus the region volume is too large. In other words, relatively to the
hyper-sphere volume, there is only ”few” objects spread inside the hyper-sphere
and a huge proportion of an empty space4 is covered. Consequently, for hyper-
spherical regions of large volumes the query processing becomes less efficient.
In this section we introduce an extension of M-tree, called Pivoting M-tree
(PM-tree), exploiting pivot-based ideas for metric region volume reduction.
3.1 Pivot-based Methods
Similarity search realized by pivot-based methods (e.g. AESA, LAESA) [4, 7]
follows a single general idea. A set of p objects {P1 , ..., Pt , ..., Pp } ⊂ S is selected,
called pivots (or vantage points). The dataset S (of size n) is preprocessed so as
to build a table of n · p entries, where all the distances d(Oi , Pt ) are stored for
every Oi ∈ S and every pivot Pt . When a range query rq(Q, rQ , S) is processing,
we compute d(Q, Pt ) for every pivot Pt and then try to discard such Oi that
|d(Oi , Pt ) − d(Q, Pt )| > rQ . The objects Oi which cannot be eliminated by this
rule have to be directly compared against Q.
The simple sequential pivot-based approach is suitable especially for appli-
cations where the distance d is considered expensive to compute. However, it is
obvious that the whole table of n · p entries must be sequentially loaded during
a query processing which significantly increases the disk access costs. Moreover,
each non-discarded object (i.e. an object required to be directly compared) must
be processed which means further disk access as well as computation costs.
There were developed also hierarchical pivot-based structures, e.g. the vp-tree
[13] (vantage point tree) or the mvp-tree [2] (multi vp-tree). Unfortunately, these
structures are not suitable for similarity search in large multimedia databases
since they are static (i.e. they are built in top-down manner while the whole
dataset must be available at construction time) and they are not paged (i.e. a
secondary memory management is rather complicated for them).
3.2 Structure of PM-tree
Since PM-tree is an extension of M-tree we just describe the new facts instead
of a comprehensive definition. To exploit advantages of both, the M-tree and
the pivot-based approach, we have enhanced the routing and ground entries by
a pivot-based information.
First of all, a set of p pivots Pt ∈ S must be selected. This set is fixed for
all the lifetime of a particular PM-tree index. Furthermore, a routing entry in a
4
The uselessly indexed empty space is sometimes referred as the ”dead space” [1].
5. PM-tree inner node is defined as:
routP M (Oi ) = [Oi , ptr(T (Oi )), rOi , d(Oi , Par(Oi )), HR]
The HR attribute is an array of phr hyper-rings (phr ≤ p) where the t-th hyper-
ring HR[t] is the smallest interval covering distances between the pivot Pt and
each of the objects stored in leaves of T (Oi ), i.e. HR[t] = HR[t].min, HR[t].max
where HR[t].min = min({d(Oj , Pt )}) and HR[t].max = max({d(Oj , Pt )}), for
∀Oj ∈ T (Oi ). Similarly, for a PM-tree leaf we define a ground entry as:
grndP M (Oi ) = [Oi , oid(Oi ), d(Oi , Par(Oi )), PD]
The PD attribute stands for an array of ppd pivot distances (ppd ≤ p) where the
t-th distance PD[t] = d(Oi , Pt ).
Since each hyper-ring region (Pt , HR[t]) defines a metric region containing
all the objects stored in T (Oi ), an intersection of all the hyper-rings and the
hyper-sphere forms a metric region bounding all the objects in T (Oi ) as well.
Due to the intersection with hyper-sphere the PM-tree metric region is always
smaller than the original M-tree region defined just by a hyper-sphere. For a
comparison of an M-tree region and an equivalent PM-tree region see Figure 3.
The numbers phr and ppd (both fixed for a PM-tree index lifetime) allow us to
specify the ”amount of pivoting”. Obviously, using a suitable phr > 0 and ppd > 0
the PM-tree can be tuned to achieve an optimal performance (see Section 5).
Fig. 3. (a) Region of M-tree (b) Reduced region of PM-tree (using three pivots)
3.3 Building the PM-tree
In order to keep HR and PD arrays up-to-date, the original M-tree construction
algorithms [8, 10] must be adjusted. We have to mention the adjusted algorithms
still preserve the logarithmic time complexity.
6. Object Insertion. After a data object Oi is inserted into a leaf the HR arrays
of all routing entries in the insertion path must be updated by values d(Oi , Pt ),
∀t ≤ phr . For the leaf node in insertion path the PD array of the new ground
entry must be updated by values d(Oi , Pt ), ∀t ≤ ppd .
Node Splitting. After a node is split a new HR array of the left new routing
entry is created by merging all appropriate intervals HR[t] (or computing HR in
case of a leaf split) stored in routing entries (ground entries respectively) of the
left new node. A new HR array of the right new routing entry is created similarly.
3.4 Query Processing
Before processing a similarity query the distances d(Q, Pt ), ∀t ≤ max(phr , ppd )
have to be computed. During a query processing the PM-tree hierarchy is being
traversed down. Only if the metric region of a routing entry rout(Oi ) is over-
lapping the query region (Q, rQ ), the covering subtree T (Oi ) is relevant to the
query and thus further processed. A routing entry is relevant to the query in
case that the query region overlaps all the hyper-rings stored in HR. Hence,
prior to the standard hyper-sphere overlap check (used by M-tree), the overlap
of hyper-rings HR[t] against the query region is checked as follows (note that no
additional distance computation is needed):
phr
d(Q, Pt ) − rQ ≤ HR[t].max ∧ d(Q, Pt ) + rQ ≥ HR[t].min
t=1
If the above condition is false, the subtree T (Oi ) is not relevant to the query
and thus can be discarded from further processing. On the leaf level an irrelevant
ground entry is determined such that the following condition is not satisfied:
ppd
|d(Q, Pt ) − PD[t]| ≤ rQ
t=1
In Figure 3 a range query situation is illustrated. Although the M-tree metric
region cannot be discarded (see Figure 3a), the PM-tree region can be safely
ignored since the hyper-ring HR[2] is not overlapped (see Figure 3b).
The hyper-ring overlap condition can be integrated into the original M-tree
range query as well as k-NN query algorithms. In case of range query the ad-
justment is straightforward – the hyper-ring overlap condition is combined with
the original hyper-sphere overlap condition. However, the optimal M-tree k-NN
query algorithm (based on priority queue heuristics) must be redesigned which
is a subject of our future research.
3.5 Object-to-pivot Distance Representation
In order to minimize storage volume of HR and PD arrays in PM-tree nodes,
a short representation of object-to-pivot distance is required. We can represent
7. interval HR[t] by two 4-byte reals and a pivot distance PD[t] by one 4-byte real.
However, when (a part of) the dataset is known in advance we can approximate
the 4-byte representation by a 1-byte code. For this reason a distance distribu-
tion histogram for each pivot is created by random sampling of objects from
the dataset and comparing them against the pivot. Then a distance interval
dmin , dmax is computed so that most of the histogram distances fall into the
interval, see an example in Figure 4 (the d+ value is an (estimated) maximum
distance of a bounded metric space M).
Fig. 4. Distance distribution histogram, 90% distances in interval dmin , dmax
Distance values in HR and PD are scaled into interval dmin , dmax as 1-byte
codes. Using 1-byte codes the storage savings are considerable. As an example,
for phr = 50 together with using 4-byte distances, the hyper-rings stored in an
inner node having capacity 30 entries will consume 30 · 50 · 2 · 4 = 12000 bytes
while by using 1-byte codes the hyper-rings will take 30 · 50 · 2 · 1 = 3000 bytes.
3.6 Selecting the Pivots
The methods of selecting an optimal set of pivots have been intensively studied
[9, 3] while, in general, we can say that a set of pivots is optimal such that dis-
tances among pivots are maximal (close pivots give almost the same information)
and the pivots are located outside the data clusters.
In the context of PM-tree the optimal set of pivots causes that the M-tree
hyper-spherical region is effectively ”chopped off” by hyper-rings so that the
smallest overall volume of PM-tree regions (considering the volume of intersec-
tion of hyper-rings and the hyper-sphere) is obtained.
In experiments presented in Section 5 we have used a cheap but effective
method which samples N groups of p pivots from the dataset S at random. The
group is selected for which the sum of distances among pivots is maximal.
8. 4 Range Query Cost Models
In this section we present a node-based and a level-based cost models for range
query processing in PM-tree, allowing to predict the PM-tree retrieval perfor-
mance. Since PM-tree is an extension of M-tree, we have extended the original
cost models developed for M-tree [6]. Like the M-tree cost models, the PM-tree
cost models are conditioned by the following assumptions:
– The only information used is (an estimate of) the distance distribution of ob-
jects in a given dataset since no information about data distribution is known.
– A biased query model is considered, i.e. the distribution of query objects is
equal to that of data objects.
– The dataset is supposed to have high ”homogeneity of viewpoints” (for de-
tails we refer to [6, 8]).
The basic tool used in the cost models is a probability estimation that two
hyper-spheres overlap, i.e. (using triangular inequality of d)
Pr{spheres (O1 , rO1 ) and (O2 , rO2 ) overlapped} = Pr{d(O1 , O2 ) ≤ rO1 + rO2 }
where O1 , O2 are center objects and rO1 , rO2 are radii of the hyper-spheres.
For this purpose the overall distance distribution function is used, defined as:
F (x) = Pr{d(Oi , Oj ) ≤ x}, ∀Oi , Oj ∈ U
and also the relative distance distribution function is used, defined as:
FOk (x) = Pr{d(Ok , Oi ) ≤ x}, Ok ∈ U, ∀Oi ∈ U
For an approximate F(Ok ) evaluation a set O of s objects Oi ∈ S is sampled.
The F is computed using the s × s matrix of pairwise distances between objects
in O. For the FOk evaluation only the vector of s distances d(Oi , Ok ) is needed.
4.1 Node-based Cost Model
In the node-based cost model (NB-CM) a probability of access to each PM-tree
node is predicted. Basically, a node N is accessed if its metric region (described
by the parent routing entry of N) is overlapping the query hyper-sphere (Q, rQ ):
Pr{node N is accessed} = Pr{metric region of N is overlapped by (Q, rQ )}
Specifically, a PM-tree node N is accessed if its metric region (defined by a
hyper-sphere and phr hyper-rings) overlaps the query hyper-sphere:
phr
Pr{N is accessed} = Pr{hyper-sphere is inter.}· Pr{t-th hyper-ring is inter.}
t=1
9. and finally (for query radius rQ and the parent routing entry of N )
phr
Pr{N accessed} ≈ F (rN +rQ )· FPt (HRN [t].max+rQ )·(1−FPt (rQ −HRN [t].min))
t=1
To determine the estimated disk access costs (DAC) for a range query, it is
sufficient to sum the above probabilities over all the m nodes in the PM-tree:
phr
m
FPt (HRNi [t].max+rQ )·(1−FPt (rQ −HRNi [t].min))
DAC = F (rNi +rQ )·
t=1
i=1
The computation costs (CC) are estimated considering the probability that
a node is accessed multiplied by the number of its entries, e(Ni ), thus obtaining
phr
m
FPt (HRNi [t].max+rQ )·(1−FPt (rQ −HRNi [t].min))
CC = e(Ni )·F (rNi +rQ )·
t=1
i=1
4.2 Level-based Cost Model
The problem with NB-CM is that maintaining statistics for every node is very
time consuming when the PM-tree index is large. To overcome this, we consider a
simplified level-based cost model (LB-CM) which uses only average information
collected for each level of the PM-tree. For each level l of the tree (l = 1 for root
level, l = L for leaf level), LB-CM uses this information: ml (the number of nodes
at level l), rl (the average value of covering radius considering all the nodes at
level l), HRl [i].min and HRl [i].max (the average information about hyper-rings
considering all the nodes at level l). Given these statistics, the number of nodes
accessed by a range query can be estimated as
phr
L
DAC ≈ ml · F (rl + rQ ) · FPt (HRl [t].max + rQ ) · (1 − FPt (rQ − HRl [t].min))
t=1
l=1
Similarly, we can estimate computation costs as
phr
L
CC ≈ ml+1 ·F (rl +rQ )· FPt (HRl [t].max+rQ )·(1−FPt (rQ −HRl [t].min))
t=1
l=1
def
where mL+1 = n is the number of indexed objects.
4.3 Experimental Evaluation
In order to evaluate accuracy of the presented cost models we have made several
experiments on a synthetic dataset. The dataset consisted of 10,000 10-dimensional
tuples (embedded inside unitary hyper-cube) uniformly distributed among 100
√
+
L2 -spherical clusters of diameter d (where d+ = 10). The labels ”PM-tree(x,y)”
10
in the graphs below are described in Section 5.
10. Fig. 5. Number of pivots, query sel. 200 objs. (a) Disk access costs (b)
Computation costs
The first set of experiments investigated the accuracy of estimates according to
the increasing number of pivots used by the PM-tree. The range query selectivity
(the average number of objects in the query result) was set to 200. In Figure
5a the estimated DAC as well as the real DAC are presented. The relative error
of NB-CM estimates is below 0.2. Surprisingly, the relative error of LB-CM
estimates is smaller than for NB-CM, below 0.15. The estimates of computation
costs, presented in Figure 5b, are even more accurate than for the DAC estimates,
below 0.05 (for NB-CM) and 0.04 (for LB-CM).
The second set of experiments was focused on the accuracy of estimates according
to the increasing query selectivity. The relative error of NB-CM DAC estimates
(see Figure 6a) is below 0.1. Again, the relative error of LB-CM estimates is very
small, below 0.02. The error of computation costs (see Figure 6b) is below 0.07
(for NB-CM) and 0.05 (for LB-CM).
5 Experimental Results
In order to evaluate the overall PM-tree performance we present some results
of experiments made on large synthetic as well as real-world vector datasets. In
most of the experiments the retrieval efficiency of range query processing was
examined. The query objects were randomly selected from the respective dataset
while each particular query test consisted of 1000 range queries of the same
query selectivity. The results were averaged. Euclidean (L2 ) metric was used.
The experiments were aimed to compare PM-tree with M-tree – a comparison
with other MAMs was out of scope of this paper.
11. Fig. 6. Query selectivity: (a) Disk access costs (b) Computation costs
Abbreviations in Figures. Each label of form ”PM-tree(x,y)” stands for
a PM-tree index where phr = x and ppd = y. A label ”<index> + SlimDown”
denotes an index subsequently post-processed using the slim-down algorithm
(for details about the slim-down algorithm we refer to [10]).
5.1 Synthetic Datasets
For the first set of experiments a collection of 8 synthetic vector datasets of
increasing dimensionality (from D = 4 to D = 60) was generated. Each dataset
(embedded inside unitary hyper-cube) consisted of 100,000 D-dimensional tuples
uniformly distributed within 1000 L2 -spherical uniformly distributed clusters.
√
+
The diameter of each cluster was d where d+ = D. These datasets were
10
indexed by PM-tree (for various phr and ppd ) as well as by M-tree. Some statistics
about the created indices are described in Table 1 (for explanation see [10]).
Table 1. PM-tree index statistics (synthetic datasets)
Construction methods: SingleWay + MinMax (+ SlimDown)
Dimensionalities: 4,8,16,20,30,40,50,60 Inner node capacities: 10 – 28
Index file sizes: 4.5 MB – 55 MB Leaf node capacities: 16 – 36
Pivot file sizes5 : 2 KB – 17 KB Avg. node utilization: 66%
Node (disk page) sizes: 1 KB (D = 4, 8), 2 KB (D = 16, 20), 4 KB (D ≥ 30)
5
Access costs to the pivot files, storing pivots Pt and the scaling intervals for all pivots
(see Section 3.5), were not considered because of their negligible sizes.
12. Fig. 7. Construction costs (30D indices): (a) Disk access costs (b) Computation costs
Index construction costs (for 30-dimensional indices) according to the increasing
number of pivots are presented in Figure 7. The disk access costs for PM-tree
indices with up to 8 pivots are similar to those of M-tree index (see Figure 7a).
For PM-tree(128, 0) and PM-tree(128, 28) indices the DAC are about 1.4 times
higher than for the M-tree index. The increasing trend of computation costs
(see Figure 7b) depends mainly on the p object-to-pivot distance computations
made during each object insertion – additional computations are needed after
leaf splitting in order to create HR arrays of the new routing entries.
Fig. 8. Number of pivots (30-dim. indices, query selectivity 50 objs.): (a) DAC (b) CC
13. In Figure 8 the range query costs (for 30-dimensional indices and query selec-
tivity 50 objects) according to the number of pivots are presented. The DAC
rapidly decrease with the increasing number of pivots. The PM-tree(128, 0) and
PM-tree(128, 28) indices need only 27% of DAC spent by the M-tree index.
Moreover, the PM-tree is superior even after the slim-down algorithm post-
processing, e.g. the ”slimmed” PM-tree(128, 0) index needs only 23% of DAC
spent by the ”slimmed” M-tree index (and only 6.7% of DAC spent by the or-
dinary M-tree). The decreasing trend of computation costs is even more steep
than for DAC, the PM-tree(128, 28) index needs only 5.5% of the M-tree CC.
Fig. 9. Dimensionality (query selectivity 50 objects): (a) Disk access costs (b) Com-
putation costs
The influence of increasing dimensionality D is depicted in Figure 9. Since
the disk page sizes for different indices vary, the DAC as well as the CC are
related (in percent) to the DAC (CC resp.) of M-tree indices. For 8 ≤ D ≤ 40
the DAC stay approximately fixed, for D > 40 the DAC slightly increase.
5.2 Image Database
For the second set of experiments a collection of about 10,000 web-crawled images
[12] was used. Each image was converted into a 256-level gray scale and a fre-
quency histogram was extracted. For indexing the histograms (256-dimensional
vectors actually) were used together with the euclidean metric. The statistics
about image indices are described in Table 2.
14. Table 2. PM-tree index statistics (image database)
Construction methods: SingleWay + MinMax (+ SlimDown)
Dimensionality: 256 Inner node capacities: 10 – 31
Index file sizes: 16 MB – 20 MB Leaf node capacities: 29 – 31
Pivot file sizes: 4 KB – 1 MB Avg. node utilization: 67%
Node (disk page) size: 32 KB
In Figure 10a the DAC for increasing number of pivots are presented. We can
see that e.g. the ”slimmed” PM-tree(1024,50) index consumes only 42% of DAC
spent by the ”slimmed” M-tree index. The computation costs (see Figure 10b)
for p ≤ 64 decrease (down to 36% of M-tree CC). However, for p > 64 the overall
computation costs grow since the number of necessarily computed query-to-pivot
distances (i.e. p distance computations for each query) is proportionally too large.
Nevertheless, this fact is dependent on the database size – obviously, for 100,000
objects (images) the proportion of p query-to-pivot distance computations would
be smaller when compared with the overall computation costs.
Fig. 10. Number of pivots (query selectivity 50 objects): (a) DAC (b) CC
Finally, the costs according to the increasing range query selectivity are pre-
sented in Figure 11. The disk access costs stay below 73% of M-tree DAC (below
58% in case of ”slimmed” indices) while the computation costs stay below 43%
(49% respectively).
15. Fig. 11. Query selectivity: (a) Disk access costs (b) Computation costs
5.3 Summary
The experiments on synthetic datasets and mainly on the real dataset have
demonstrated the general benefits of PM-tree. The index construction (object
insertion respectively) is dynamic and still preserves the logarithmic time com-
plexity. For suitably high phr and ppd the index size growth is minor. This is
true especially for high-dimensional datasets (e.g. the 256-dimensional image
dataset) where the size of pivoting information stored in ground/routing entries
is negligible when compared with the size of the data object (i.e. vector) it-
self. A particular (but not serious) limitation of the PM-tree is that a part of
the dataset must be known in advance (for the choice of pivots and when used
object-to-pivot distance distribution histograms).
Furthermore, the PM-tree could serve as a constructionally much cheaper
alternative to the slim-down algorithm on M-tree – the above presented experi-
mental results have shown that the retrieval performance of PM-tree (with suffi-
ciently high phr , ppd ) is comparable or even better than an equivalent ”slimmed”
M-tree. Finally, a combination of PM-tree and the slim-down algorithm makes
the PM-tree a very efficient metric access method.
6 Conclusions and Outlook
In this paper the Pivoting M-tree (PM-tree) was introduced. The PM-tree com-
bines M-tree hierarchy of metric regions with the idea of pivot-based methods.
The result is a flexible metric access method providing even more efficient simi-
larity search than the M-tree. Two cost models for range query processing were
proposed and evaluated. Experimental results on synthetic as well as real-world
datasets have shown that PM-tree is more efficient when compared with the
M-tree.
16. In the future we plan to develop new PM-tree construction algorithms ex-
ploiting the pivot-based information. Second, an optimal PM-tree k-NN query
algorithm has to be designed and a cost model formulated. Finally, we would like
to modify several spatial access methods by utilizing the pivoting information,
in particular the R-tree family.
ˇ
This research has been partially supported by grant Nr. GACR 201/00/1031 of
the Grant Agency of the Czech Republic.
References
1. C. B¨hm, S. Berchtold, and D. Keim. Searching in High-Dimensional Spaces –
o
Index Structures for Improving the Performance of Multimedia Databases. ACM
Computing Surveys, 33(3):322–373, 2001.
¨
2. T. Bozkaya and M. Ozgsoyoglu. Distance-based indexing for high-dimensional
metric spaces. In Proceedings of the 1997 ACM SIGMOD International Conference
on Management of Data, Tuscon, AZ, pages 357–368, 1997.
3. B. Bustos, G. Navarro, and E. Ch´vez. Pivot selection techniques for proximity
a
searching in metric spaces. Pattern Recognition Letters, 24(14):2357–2366, 2003.
4. E. Ch´vez, G. Navarro, R. Baeza-Yates, and J. Marroqu´ Searching in Metric
a ın.
Spaces. ACM Computing Surveys, 33(3):273–321, 2001.
5. P. Ciaccia, M. Patella, and P. Zezula. M-tree: An Efficient Access Method for
Similarity Search in Metric Spaces. In Proceedings of the 23rd Athens Intern.
Conf. on VLDB, pages 426–435. Morgan Kaufmann, 1997.
6. P. Ciaccia, M. Patella, and P. Zezula. A cost model for similarity queries in metric
spaces. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Sym-
posium on Principles of Database Systems, June 1-3, 1998, Seattle, Washington,
pages 59–68. ACM Press, 1998.
7. L. Mico, J. Oncina, and E. Vidal. A new version of the nearest-neighbor approxi-
mating and eliminating search (aesa) with linear preprocessing-time and memory
requirements. Pattern Recognition Letters, 15:9–17, 1994.
8. M. Patella. Similarity Search in Multimedia Databases. PhD thesis, Dipartmento
di Elettronica Informatica e Sistemistica, Bologna, 1999.
9. M. Shapiro. The choice of reference points in best-match file searching. Commu-
nications of the ACM, 20(5):339–343, 1977.
10. T. Skopal, J. Pokorn´, M. Kr´tk´, and V. Sn´ˇel. Revisiting M-tree Building
y ay as
Principles. In ADBIS 2003, LNCS 2798, Springer-Verlag, Dresden, Germany,
2003.
11. C. Traina Jr., A. Traina, B. Seeger, and C. Faloutsos. Slim-Trees: High perfor-
mance metric trees minimizing overlap between nodes. Lecture Notes in Computer
Science, 1777, 2000.
12. WBIIS project: Wavelet-based Image Indexing and Searching, Stanford University,
http://wang.ist.psu.edu/.
13. P. N. Yanilos. Data Structures and Algorithms for Nearest Neighbor Search in
General Metric Spaces. In Proceedings of Fourth Annual ACM/SIGACT-SIAM
Symposium on Discrete Algorithms - SODA, pages 311–321, 1993.
14. X. Zhou, G. Wang, J. Y. Xu, and G. Yu. M+-tree: A New Dynamical Multidi-
mensional Index for Metric Spaces. In Proceedings of the Fourteenth Australasian
Database Conference - ADC’03, Adelaide, Australia, 2003.