Scalable and efficient cluster based framework for multidimensional indexingeSAT Journals
Abstract Indexing high dimensional data has its utility in many real world applications. Especially the information retrieval process is dramatically improved. The existing techniques could overcome the problem of “Curse of Dimensionality” of high dimensional data sets by using a technique known as Vector Approximation-File which resulted in sub-optimal performance. When compared with VA-File clustering results in more compact data set as it uses inter-dimensional correlations. However, pruning of unwanted clusters is important. The existing pruning techniques are based on bounding rectangles, bounding hyper spheres have problems in NN search. To overcome this problem Ramaswamy and Rose proposed an approach known as adaptive cluster distance bounding for high dimensional indexing which also includes an efficient spatial filtering. In this paper we implement this high-dimensional indexing approach. We built a prototype application to for proof of concept. Experimental results are encouraging and the prototype can be used in real time applications. Index Terms–Clustering, high dimensional indexing, similarity measures, and multimedia databases
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Outlier detection is very interesting, useful and challenging problem in the field of data mining. Because of
sparse data clustering algorithm which are based on distance will not work to find outliers in spatial data.
Problem of finding irregular feature in spatial data need to be explore. Many existing approaches have
been proposed to overcome the problem of outlier detection in spatial Geographic data. In this paper an
efficient clustering and density based outlier detection framework has been proposed. The process of
outlier detection has been categorized into two steps in the first step data has been clustered together based
on any density based DBSCAN algorithm and in the second stage outlier detection is performed using LOF.
The purpose is to perform clustering and outlier mining simultaneously to improve feasibility of framework.
To verify the efficiency and robustness of proposed method, comparative study of proposed approach and
several existing approaches are presented in detail, various simulation results demonstrate the
effectiveness of the proposed approach.
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHIJCI JOURNAL
In this paper we investigate colocation mining problem in the context of uncertain data. Uncertain data is a
partially complete data. Many of the real world data is Uncertain, for example, Demographic data, Sensor
networks data, GIS data etc.,. Handling such data is a challenge for knowledge discovery particularly in
colocation mining. One straightforward method is to find the Probabilistic Prevalent colocations (PPCs).
This method tries to find all colocations that are to be generated from a random world. For this we first
apply an approximation error to find all the PPCs which reduce the computations. Next find all the
possible worlds and split them into two different worlds and compute the prevalence probability. These
worlds are used to compare with a minimum probability threshold to decide whether it is Probabilistic
Prevalent colocation (PPCs) or not. The experimental results on the selected data set show the significant
improvement in computational time in comparison to some of the existing methods used in colocation
mining.
Robust Block-Matching Motion Estimation of Flotation Froth Using Mutual Infor...CSCJournals
In this paper, we propose a new method for the motion estimation of flotation froth using mutual information with a bin size of two as the block matching similarity metric. We also use three-step search and new-three-step-search as a search strategy. Mean sum of absolute difference (MAD) is widely considered in blocked based motion estimation. The minimum bin size selection of the proposed similarity metric also makes the computational cost of mutual information similar to MAD. Experimental results show that the proposed motion estimation technique improves the motion estimation accuracy in terms of peak signal-to-noise ratio of the reconstructed frame. The computational cost of the proposed method is almost the same as the standard machine vision methods used for the motion estimation of flotation froth.
Scalable and efficient cluster based framework for multidimensional indexingeSAT Journals
Abstract Indexing high dimensional data has its utility in many real world applications. Especially the information retrieval process is dramatically improved. The existing techniques could overcome the problem of “Curse of Dimensionality” of high dimensional data sets by using a technique known as Vector Approximation-File which resulted in sub-optimal performance. When compared with VA-File clustering results in more compact data set as it uses inter-dimensional correlations. However, pruning of unwanted clusters is important. The existing pruning techniques are based on bounding rectangles, bounding hyper spheres have problems in NN search. To overcome this problem Ramaswamy and Rose proposed an approach known as adaptive cluster distance bounding for high dimensional indexing which also includes an efficient spatial filtering. In this paper we implement this high-dimensional indexing approach. We built a prototype application to for proof of concept. Experimental results are encouraging and the prototype can be used in real time applications. Index Terms–Clustering, high dimensional indexing, similarity measures, and multimedia databases
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Outlier detection is very interesting, useful and challenging problem in the field of data mining. Because of
sparse data clustering algorithm which are based on distance will not work to find outliers in spatial data.
Problem of finding irregular feature in spatial data need to be explore. Many existing approaches have
been proposed to overcome the problem of outlier detection in spatial Geographic data. In this paper an
efficient clustering and density based outlier detection framework has been proposed. The process of
outlier detection has been categorized into two steps in the first step data has been clustered together based
on any density based DBSCAN algorithm and in the second stage outlier detection is performed using LOF.
The purpose is to perform clustering and outlier mining simultaneously to improve feasibility of framework.
To verify the efficiency and robustness of proposed method, comparative study of proposed approach and
several existing approaches are presented in detail, various simulation results demonstrate the
effectiveness of the proposed approach.
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHIJCI JOURNAL
In this paper we investigate colocation mining problem in the context of uncertain data. Uncertain data is a
partially complete data. Many of the real world data is Uncertain, for example, Demographic data, Sensor
networks data, GIS data etc.,. Handling such data is a challenge for knowledge discovery particularly in
colocation mining. One straightforward method is to find the Probabilistic Prevalent colocations (PPCs).
This method tries to find all colocations that are to be generated from a random world. For this we first
apply an approximation error to find all the PPCs which reduce the computations. Next find all the
possible worlds and split them into two different worlds and compute the prevalence probability. These
worlds are used to compare with a minimum probability threshold to decide whether it is Probabilistic
Prevalent colocation (PPCs) or not. The experimental results on the selected data set show the significant
improvement in computational time in comparison to some of the existing methods used in colocation
mining.
Robust Block-Matching Motion Estimation of Flotation Froth Using Mutual Infor...CSCJournals
In this paper, we propose a new method for the motion estimation of flotation froth using mutual information with a bin size of two as the block matching similarity metric. We also use three-step search and new-three-step-search as a search strategy. Mean sum of absolute difference (MAD) is widely considered in blocked based motion estimation. The minimum bin size selection of the proposed similarity metric also makes the computational cost of mutual information similar to MAD. Experimental results show that the proposed motion estimation technique improves the motion estimation accuracy in terms of peak signal-to-noise ratio of the reconstructed frame. The computational cost of the proposed method is almost the same as the standard machine vision methods used for the motion estimation of flotation froth.
TERRIAN IDENTIFICATION USING CO-CLUSTERED MODEL OF THE SWARM INTELLEGENCE & S...cscpconf
A digital image is nothing more than data -- numbers indicating variations of red, green, and
blue at a particular location on a grid of pixels. Clustering is the process of assigning data
objects into a set of disjoint groups called clusters so that objects in each cluster are more
similar to each other than objects from different clusters. Clustering techniques are applied in
many application areas such as pattern recognition, data mining, machine learning, etc.
Clustering algorithms can be broadly classified as Hard, Fuzzy, Possibility, and Probabilistic .Kmeans
is one of the most popular hard clustering algorithms which partitions data objects into k
clusters where the number of clusters, k, is decided in advance according to application
purposes. This model is inappropriate for real data sets in which there are no definite boundaries
between the clusters. After the fuzzy theory introduced by Lotfi Zadeh, the researchers put the
fuzzy theory into clustering. Fuzzy algorithms can assign data object partially to multiple
clusters. The degree of membership in the fuzzy clusters depends on the closeness of the data
object to the cluster centers. The most popular fuzzy clustering algorithm is fuzzy c-means (FCM)
which introduced by Bezdek in 1974 and now it is widely used. Fuzzy c-means clustering is an
effective algorithm, but the random selection in center points makes iterative process falling into
the local optimal solution easily. For solving this problem, recently evolutionary algorithms such
as genetic algorithm (GA), simulated annealing (SA), ant colony optimization (ACO) , and particle swarm optimization (PSO) have been successfully applied.
Data Hiding Method With High Embedding Capacity CharacterCSCJournals
Recently, the data hiding method based on the high embedding capacity by using improved EMD method was proposed by Kuo et al.[6]. They claimed that their scheme can not only hide a great deal of secret data but also keep high safety and good image quality. However, in their scheme, the sender and the receiver must share the synchronous random secret seed before they transmit the stego-image each other. Otherwise, they can not recover the correct secret information from the stego-image. In this paper we propose an improved scheme based on EMD and LSB matching method to overcome the above problem, in other words, the sender does not share the synchronous random secret seed the receiver before the stego-image is transmitted. Observing the experimental results, they show that our proposed scheme acquires high embedding capacity and acceptable stego-image quality.
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSIONAM Publications,India
The main aim of this paper is to develop a new dynamic indexing structure to support very large datasets and high dimensionality. This new structure is tree based used to facilitate efficient access. It is highly adaptable to any type of applications. The newly developed structure is based on nearest neighbors’ method with exception of linearly scan the very large datasets. The NewTree surely minimizes adverse effect of the curse of dimensionality. It means that the most existing indexing techniques degrade rapidly when dimensionality goes higher. The major drawback here is the retrieval of subsets from the huge storage system. The NewTree structure can handle very efficiently and effectively during adding new data. When the new data are added and the shape of the structure does not change. The performance of the newly developed structure can be evaluated with SR Tree, existing indexing structure. The results clearly show that the efficiency of the newly developed structure is superior in both time complexity and memory complexity than SR Tree.
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
Clustering a large sparse and large scale data is an open research in the data mining. To discover the significant information through clustering algorithm stands inadequate as most of the data finds to be non actionable. Existing clustering technique is not feasible to time varying data in high dimensional space. Hence Subspace clustering will be answerable to problems in the clustering through incorporation of domain knowledge and parameter sensitive prediction. Sensitiveness of the data is also predicted through thresholding mechanism. The problems of usability and usefulness in 3D subspace clustering are very important issue in subspace clustering. . The Solutions is highly helpful benefit for police departments and law enforcement organisations to better understand stock issues and provide insights that will enable them to track activities, predict the likelihood. Also determining the correct dimension is inconsistent and challenging issue in subspace clustering .In this thesis, we propose Centroid based Subspace Forecasting Framework by constraints is proposed, i.e. must link and must not link with domain knowledge. Unsupervised Subspace clustering algorithm with inbuilt process like inconsistent constraints correlating to dimensions has been resolved through singular value decomposition. Principle component analysis is been used in which condition has been explored to estimate the strength of actionable to be particular attributes and utilizing the domain knowledge to refinement and validating the optimal centroids dynamically. An experimental result proves that proposed framework outperforms other competition subspace clustering technique in terms of efficiency, Fmeasure, parameter insensitiveness and accuracy. G. Raj Kamal | A. Deepika | D. Pavithra | J. Mohammed Nadeem | V. Prasath Kumar "Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31374.pdf Paper Url :https://www.ijtsrd.com/computer-science/data-miining/31374/principle-component-analysis-based-on-optimal-centroid-selection-model-for-subspace-clustering-model/g-raj-kamal
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
Balistrocchi, M., Metulini, R., Carpita, M., and Ranzi, R.: Dynamic maps of human exposure to floods based on mobile phone data, Nat. Hazards Earth Syst. Sci. Discuss., https://doi.org/10.5194/nhess-2020-201, in press, 2020
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
TERRIAN IDENTIFICATION USING CO-CLUSTERED MODEL OF THE SWARM INTELLEGENCE & S...cscpconf
A digital image is nothing more than data -- numbers indicating variations of red, green, and
blue at a particular location on a grid of pixels. Clustering is the process of assigning data
objects into a set of disjoint groups called clusters so that objects in each cluster are more
similar to each other than objects from different clusters. Clustering techniques are applied in
many application areas such as pattern recognition, data mining, machine learning, etc.
Clustering algorithms can be broadly classified as Hard, Fuzzy, Possibility, and Probabilistic .Kmeans
is one of the most popular hard clustering algorithms which partitions data objects into k
clusters where the number of clusters, k, is decided in advance according to application
purposes. This model is inappropriate for real data sets in which there are no definite boundaries
between the clusters. After the fuzzy theory introduced by Lotfi Zadeh, the researchers put the
fuzzy theory into clustering. Fuzzy algorithms can assign data object partially to multiple
clusters. The degree of membership in the fuzzy clusters depends on the closeness of the data
object to the cluster centers. The most popular fuzzy clustering algorithm is fuzzy c-means (FCM)
which introduced by Bezdek in 1974 and now it is widely used. Fuzzy c-means clustering is an
effective algorithm, but the random selection in center points makes iterative process falling into
the local optimal solution easily. For solving this problem, recently evolutionary algorithms such
as genetic algorithm (GA), simulated annealing (SA), ant colony optimization (ACO) , and particle swarm optimization (PSO) have been successfully applied.
Data Hiding Method With High Embedding Capacity CharacterCSCJournals
Recently, the data hiding method based on the high embedding capacity by using improved EMD method was proposed by Kuo et al.[6]. They claimed that their scheme can not only hide a great deal of secret data but also keep high safety and good image quality. However, in their scheme, the sender and the receiver must share the synchronous random secret seed before they transmit the stego-image each other. Otherwise, they can not recover the correct secret information from the stego-image. In this paper we propose an improved scheme based on EMD and LSB matching method to overcome the above problem, in other words, the sender does not share the synchronous random secret seed the receiver before the stego-image is transmitted. Observing the experimental results, they show that our proposed scheme acquires high embedding capacity and acceptable stego-image quality.
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSIONAM Publications,India
The main aim of this paper is to develop a new dynamic indexing structure to support very large datasets and high dimensionality. This new structure is tree based used to facilitate efficient access. It is highly adaptable to any type of applications. The newly developed structure is based on nearest neighbors’ method with exception of linearly scan the very large datasets. The NewTree surely minimizes adverse effect of the curse of dimensionality. It means that the most existing indexing techniques degrade rapidly when dimensionality goes higher. The major drawback here is the retrieval of subsets from the huge storage system. The NewTree structure can handle very efficiently and effectively during adding new data. When the new data are added and the shape of the structure does not change. The performance of the newly developed structure can be evaluated with SR Tree, existing indexing structure. The results clearly show that the efficiency of the newly developed structure is superior in both time complexity and memory complexity than SR Tree.
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
Clustering a large sparse and large scale data is an open research in the data mining. To discover the significant information through clustering algorithm stands inadequate as most of the data finds to be non actionable. Existing clustering technique is not feasible to time varying data in high dimensional space. Hence Subspace clustering will be answerable to problems in the clustering through incorporation of domain knowledge and parameter sensitive prediction. Sensitiveness of the data is also predicted through thresholding mechanism. The problems of usability and usefulness in 3D subspace clustering are very important issue in subspace clustering. . The Solutions is highly helpful benefit for police departments and law enforcement organisations to better understand stock issues and provide insights that will enable them to track activities, predict the likelihood. Also determining the correct dimension is inconsistent and challenging issue in subspace clustering .In this thesis, we propose Centroid based Subspace Forecasting Framework by constraints is proposed, i.e. must link and must not link with domain knowledge. Unsupervised Subspace clustering algorithm with inbuilt process like inconsistent constraints correlating to dimensions has been resolved through singular value decomposition. Principle component analysis is been used in which condition has been explored to estimate the strength of actionable to be particular attributes and utilizing the domain knowledge to refinement and validating the optimal centroids dynamically. An experimental result proves that proposed framework outperforms other competition subspace clustering technique in terms of efficiency, Fmeasure, parameter insensitiveness and accuracy. G. Raj Kamal | A. Deepika | D. Pavithra | J. Mohammed Nadeem | V. Prasath Kumar "Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31374.pdf Paper Url :https://www.ijtsrd.com/computer-science/data-miining/31374/principle-component-analysis-based-on-optimal-centroid-selection-model-for-subspace-clustering-model/g-raj-kamal
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
Balistrocchi, M., Metulini, R., Carpita, M., and Ranzi, R.: Dynamic maps of human exposure to floods based on mobile phone data, Nat. Hazards Earth Syst. Sci. Discuss., https://doi.org/10.5194/nhess-2020-201, in press, 2020
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
Exploring temporal graph data with Python: a study on tensor decomposition o...André Panisson
Tensor decompositions have gained a steadily increasing popularity in data mining applications. Data sources from sensor networks and Internet-of-Things applications promise a wealth of interaction data that can be naturally represented as multidimensional structures such as tensors. For example, time-varying social networks collected from wearable proximity sensors can be represented as 3-way tensors. By representing this data as tensors, we can use tensor decomposition to extract community structures with their structural and temporal signatures.
The current standard framework for working with tensors, however, is Matlab. We will show how tensor decompositions can be carried out using Python, how to obtain latent components and how they can be interpreted, and what are some applications of this technique in the academy and industry. We will see a use case where a Python implementation of tensor decomposition is applied to a dataset that describes social interactions of people, collected using the SocioPatterns platform. This platform was deployed in different settings such as conferences, schools and hospitals, in order to support mathematical modelling and simulation of airborne infectious diseases. Tensor decomposition has been used in these scenarios to solve different types of problems: it can be used for data cleaning, where time-varying graph anomalies can be identified and removed from data; it can also be used to assess the impact of latent components in the spreading of a disease, and to devise intervention strategies that are able to reduce the number of infection cases in a school or hospital. These are just a few examples that show the potential of this technique in data mining and machine learning applications.
FINGERPRINT CLASSIFICATION BASED ON ORIENTATION FIELDijesajournal
ABSTRACT
This paper introduces an effective method of fingerprint classification based on discriminative feature gathering from orientation field. A nonlinear support vector machines (SVMs) is adopted for the classification. The orientation field is estimated through a pixel-Wise gradient descent method and the percentage of directional block classes is estimated. These percentages are classified into four-dimensional vector considered as a good feature that can be combined with an accurate singular point to classify the fingerprint into one of five classes. This method shows high classification accuracy relative to other spatial domain classifiers.
John McGaughey, CEO/President of Mira Geoscience offers his thoughts and the practices of integrated geophysical interpretation at the 3D Interest Group
K-means Clustering Method for the Analysis of Log Dataidescitation
Clustering analysis method is one of the main
analytical methods in data mining; the method of clustering
algorithm will influence the clustering results directly. This
paper discusses the standard k-means clustering algorithm
and analyzes the shortcomings of standard k-means
algorithm. This paper also focuses on web usage mining to
analyze the data for pattern recognition. With the help of k-
means algorithm, pattern is identified.
Textural Feature Extraction of Natural Objects for Image ClassificationCSCJournals
The field of digital image processing has been growing in scope in the recent years. A digital image is represented as a two-dimensional array of pixels, where each pixel has the intensity and location information. Analysis of digital images involves extraction of meaningful information from them, based on certain requirements. Digital Image Analysis requires the extraction of features, transforms the data in the high-dimensional space to a space of fewer dimensions. Feature vectors are n-dimensional vectors of numerical features used to represent an object. We have used Haralick features to classify various images using different classification algorithms like Support Vector Machines (SVM), Logistic Classifier, Random Forests Multi Layer Perception and Naïve Bayes Classifier. Then we used cross validation to assess how well a classifier works for a generalized data set, as compared to the classifications obtained during training.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
On the High Dimentional Information Processing in Quaternionic Domain and its...IJAAS Team
There are various high dimensional engineering and scientific applications in communication, control, robotics, computer vision, biometrics, etc.; where researchers are facing problem to design an intelligent and robust neural system which can process higher dimensional information efficiently. The conventional real-valued neural networks are tried to solve the problem associated with high dimensional parameters, but the required network structure possesses high complexity and are very time consuming and weak to noise. These networks are also not able to learn magnitude and phase values simultaneously in space. The quaternion is the number, which possesses the magnitude in all four directions and phase information is embedded within it. This paper presents a well generalized learning machine with a quaternionic domain neural network that can finely process magnitude and phase information of high dimension data without any hassle. The learning and generalization capability of the proposed learning machine is presented through a wide spectrum of simulations which demonstrate the significance of the work.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
36. [1] Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. In Library of Congress.
•
•
•
•
•
[2] Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666..
- 36 -
37. [3] Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27-64.
•
k
min ∑ ∑ w jl
i =1 j∈Ci
l∉Ci
where k is the number of clusters
•
•
•
•
•
[4] Boutin, F., & Hascoet, M. (2004, July). Cluster validity indices for graph partitioning. In Information Visualisation, 2004. IV 2004. Proceedings.
Eighth International Conference on (pp. 376-381). IEEE.
[5] Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
[6] Patkar, S. B., & Narayanan, H. (2003, January). An efficient practical heuristic for good ratio-cut partitioning. In VLSI Design, 2003.
Proceedings. 16th International Conference on (pp. 64-69). IEEE.
- 37 -
38. •
•
•
•
•
[7] Sibson, R. (1973), “SLINK: an optimally efficient algorithm for the single-link cluster method”, The Computer Journal, Vol. 116, No. 1, pp. 30-34.
[8] Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal, 20(4), 364-366.
•
L
L= D− A
•
A
D
d1, d2, ..., dn
A = ⎡ aij ⎤ , i,j=1, 2,
⎣ ⎦
,n
n
di = ∑ aij
j =1
•
•
•
•
[9] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4), 395-416.
[10] Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. Advances in neural information processing
systems, 2, 849-856.
- 38 -
39. •
Q=
(
1 k
∑ ∑ A jl − d j d l / 2m
2m i =1 j∈Ci
)
l∈Ci
•
•
•
•
•
[11] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.
[12] Clauset, A., Newman, M. E., & Moore, C. (2004). Finding community structure in very large networks. Physical review E, 70(6), 066111.
[13] Kehagias, A. (2012). Bad Communities with High Modularity. arXiv preprint arXiv:1209.2678.
•
•
•
[14] Daszykowski, M., Walczak, B., & Massart, D. L. (2001). Looking for natural patterns in data: Part 1. Density-based approach. Chemometrics
and Intelligent Laboratory Systems, 56(2), 83-92.
- 39 -
41. •
•
⎧
⎛ d x ,x
i
j
⎪
exp ⎜ −
⎪
⎜
wij = ⎨
d ik d k
j
⎜
⎝
⎪
⎪0
⎩
(
k
i
)
2
⎞
⎟
⎟
⎟
⎠
if x j ∈ xik and xi ∈ x k
j
ohterwise
k
i
where x is the k-nearest set of point i and d is distance between point i and k-th neighbor of point i
•
•
i
j
i
j
•
[15] Zelnik-Manor, L., & Perona, P. (2004). Self-tuning spectral clustering. In Advances in neural information processing systems (pp. 1601-1608).
[16] Ertoz, L., Steinbach, M., & Kumar, V. (2002, April). A new shared nearest neighbor clustering algorithm and its applications. In Workshop
on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining (pp. 105-115).
•
- 41 -
42. •
•
•
di = ∑ wij + ∑ w jk .
j∈xik
j ,k∈xik
k
where xi is the k-nearest set of point i
•
i
i
•
•
•
α
α
•
•
160
5
140
4
120
3
100
2
80
1
60
0
40
-1
20
-2
-5
-4
-3
-2
-1
0
1
2
3
4
5
0
0
- 42 -
50
100
150
200
250
300
350
400
450
500
60. Item 1
Item 2
Item 3
Item 4
Matrix Factorization for Collaborative Prediction
User 1
6
9
3
?
3
0
User 2
4
?
2
0
2
0
User 3
0
0
2
3
0
1
User 4
0
?
4
?
0
2
Item Factor Matrix
|
u
2
0
3
0
1
2
0
3
User Factor Matrix
• Collaborative prediction
Filling missing entries of the user-item rating matrix
• Matrix factorization
Predicting an unknown rating by
product of user factor vector and item factor vector
3
Regularized Matrix Factorization
• Minimize the regularized squared error loss
Alternating Least Squares (ALS)
Time complexity
O(2|Ω|K2+(I+J)K3)
Parallelization
Easy
Tuning parameter
λ (regularization)
- 60 -
4
61. Regularized Matrix Factorization
• Minimize the regularized squared error loss
Stochastic Gradient Descent (SGD)
Time complexity
O(2|Ω|K)
Parallelization
Possible, but not easy
Tuning parameter
λ (regularization)
(learning rate)
5
Problem of parameter tuning
• Too small : overfitting
• Too large : underfitting
- 61 -
6
62. Problem of parameter tuning
• The value of optimal regularization parameter is
different depend on the dataset and rank K.
Regularization parameter chosen by cross-validation on various
datasets and rank K (Kim & Choi, IEEE SPL 2013)
7
Problem of parameter tuning
• SGD require tuning of regularization parameter,
learning rate and even the number of epochs.
0.005
0.007
0.010
0.015
0.020
0.005
0.9061/ 13 0.9079/ 15 0.9117/ 19 0.9168/ 28 0.9168/ 44
0.007
0.9056/ 10 0.9074/ 11 0.9112/ 13 0.9168/ 19 0.9169/ 31
0.010
0.9064/ 7
0.9077/ 8
0.9113/ 10 0.9174/ 13 0.9186/ 21
0.015
0.9099/ 5
0.9011/ 6
0.9152/ 6
0.9257/ 7
0.9390/ 7
0.020
0.9166/ 4
0.9175/ 4
0.9217/ 4
0.9314/ 4
0.9431/ 3
Netflix probe10 RMSE/optimal number of epochs of the BRSIMF for
various and values ( =40). (Tákacs et al., JMLR 2009)
- 62 -
8
63. Bayesian Matrix Factorization
Prior
P(U), P(V)
Likelihood
P(X |U,V)
Posterior
P(U,V |X)
Approximate the posterior by
MCMC (Salakhutdinov & Mnih, ICML 2008)
Variational method (Lim & Teh, KDDcup 2007)
MCMC on Netflix
No parameter tuning
No overfitting
High accuracy
Huge computational cost
O(2|Ω|K2+(I+J)K3)
9
Scalable Variational Bayesian Matrix Factorization
• No parameter tuning
• Linear space complexity: O(2(I+J)K)
• Linear time complexity: O(6|Ω|K)
• Easily parallelized on multi-core systems
• Optimize
element-wisely factorized variational distribution
with coordinate descent method.
- 63 -
10
64. Variational Bayesian Matrix Factorization
• Likelihood is given by
• Gaussian priors on factor matrices U and V:
• Approximate posterior by variational distribution by
maximizing the variational lower bound,
or equivalently minimizing the KL-divergence
11
VBMF-BCD (Lim & The KDDcup 2007)
• Matrix-wisely factorized variational distribution
VBMF-BCD
Space complexity
O((I+J)(K+K2))
Time complexity
O(2|Ω|K2+(I+J)K3)
Parallelization
Easy
- 64 -
12
65. Scalable VBMF: linear space complexity
Element-wisely factorized variational distribution
K=100
O((I+J)(K+K2))
O(2(I+J)K)
Netflix
I = 480,189
J = 17,770
4.4 GB
0.8 GB
Yahoo-music
I = 1,000,990
J = 624,961
131 GB
2.6 GB
13
Scalable VBMF: quadratic time complexity
Updating rules for q(uki)
Updating all variational parameters
- 65 -
14
66. Scalable VBMF: linear time complexity
Let Rij denote the residual on ( i, j ) observation:
With Rij , updating rule can be rewritten as
15
Scalable VBMF: linear time complexity
When
is changed to
updated to
,
- 66 -
can be easily
16
67. Scalable VBMF: parallelization
I
K
• Each column of variational parameters can be updated
independently from the updates of other columns.
• Parallelization can be easily done in a column-by-column
manner.
• Easy implementation with the OpenMP library on multi-core
system.
17
Related work
(Pilásy et al., ReSys 2010)
• Similar idea is used to reduce the cubic time
complexity of ALS to linear one.
RMF
Scalable VBMF
With small extra effort,
more accurate model
is obtainable without
tuning of regularization
parameter
- 67 -
18
68. Related Work
(Raiko et al., ECML 2007)
• Consider element-wisely factorized variational
distribution
• Update U and V by scaled gradient descent method
• Require tuning of learning rate
• Learning speed is slower than our algorithm
19
Numerical Experiments
• Compare VBMF-CD, VBMF-BCD (Lim & The KDDcup 2007),
VBMF-GD (Raiko et al., ECML 2007)
• Experimental environment
– Quad-core Intel® core™ i7-3820 @ 3.6GHz
– 64 GB memory
– Implemented in Matlab 2011a, where main computational
modules are implemented in C++ as mex files
– Parallelized with the OpenMP library
• Datasets
MovieLens10M
Netflix
Yahoo-music
# of user
69,878
480,189
1,000,990
# of item
10,677
17,770
624,961
10,000,054
100,480,507
262,810,275
# of rating
- 68 -
20
69. Numerical Experiments:
= 20
RMSE versus computation time on a quad-core system for each dataset:
(a) MovieLens10M, (b) Netflix, (c) Yahoo-music
MovieLens10M
Netflix
Yahoo-music
VBMF-CD
0.8589
0.9065
22.3425
VBMF-BCD
0.8671
0.9070
22.3671
VBMF-GD
0.8591
0.9167
22.5883
21
Numerical Experiments: Netflix,
= 50
Time per iter.
VBMF-BCD
66 min.
VBMF-CD
77 sec.
VBMF-GD
29 sec.
RMSE
VBMF-BCD
VBMF-CD
Iter.
Time
Iter.
Time
0.9005
19
21 h
63
74 m
0.9004
21
23 h
70
82 m
0.9003
22
24 h
84
98 m
0.9002
25
28 h
108
2h
0.9001
27
31 h
680
13 h
0.9000
30
33 h
- 69 -
22
70. Conclusion
• We presented scalable learning algorithm for VBMF, VBMFCD.
• VBMF-CD optimizes element-wisely factorized variational
distributions with coordinate descent method.
• Space and time complexity of VBMF-CD are linear.
• VBMF-CD can be easily parallelized.
• Experimental results confirmed the user behavior of VBMFCD such as scalability, fast learning, and prediction accuracy.
23
- 70 -
71. A hybrid genetic algorithm for accelerating feature selection and
parameter optimization of support vector machine
2013. 11. 29.
Introduction
• Support Vector Machine (SVM)
– One of the most popular state-of-the-art classification algorithms.
– efficiently finds non-linear solutions by exploiting kernel functions.
– Takes training time complexity O(N3).
• “Very important” issues on training SVM
– Feature selection
• SVM is a distance based algorithm (kernel matrix computation), and doesn’t include
any feature selection mechanism.
• Irrelevant features degrade the model performance.
– Parameter optimization
• Model Tradeoff parameter C, Kernel parameter σ (for the RBF kernel).
• SVM is very sensitive to the parameter settings.
– For SVM, feature selection and parameter optimization should be performed
simultaneously.
- 71 -
2
72. Introduction
• Genetic algorithm (GA)
– A stochastic algorithm that mimics natural evolution.
– easy, but very effective!
Selection
Parents
Genetic operation
(Crossover, Mutation)
Population
p
Replacement
Offspring
• GA-based feature selection and parameter selection of SVM [1-4]
– GA effectively finds near-optimal feature subsets and parameters.
– But, Slow. (But, MUCH better than Grid-search mechanism.)
3
Introduction
If the SVM have to be re-trained periodically, fast feature selection and
parameter optimization is required.
This study aims to avoid producing a bad offspring in the “Genetic Operation”
step of GA.
This study proposes a chromosome filtering method for faster convergence of
GA using Decision Tree (DT) for feature selection and parameter optimization
of SVM.
- 72 -
4
73. The proposed method
• Flowchart
Initialization
Population
Population Replacement
Evaluate fitness
no
yes
Chromosome
Filtering
Termination
condition?
no
yes
Do genetic operations
Optimized
parameters and
feature subset
5
The proposed method
• Chromosome design
– Parameters: binary representation
C:
0
0
1
0
1
10-2
σ:
1
10-1
1
101
102
103
C=1 x 10-2 + 1 x 101
2-5 , … , 25
– Feature subset: binary representation
1
0
0
1
0
…
f 1 f2 f 3 f4 f 5
1
0
{f1, f4, … , fp-1}
fp-1 fp
Genotype
Phenotype
- 73 -
6
74. The proposed method
• Fitness evaluation
– Decode chromosome and obtain C, σ, and a feature subset.
• Genotype Æ Phenotype
– Train SVM for a dataset
given the selected C, σ, and feature
subset.
– Fitness value: Cross Validation Accuracy
7
The proposed method
• Genetic operation
– Parent selection
• Roulette-wheel scheme - Fitness proportional selection (FPS)
• Probability of i-th chromosome ci in the population to be selected =
– where f(i) is the fitness of ci
– Crossover: N-point crossover
• Choose N random crossover points, split along those points.
– Mutation: Bit-flipping mutation
• Bitwise bit-flipping with fixed probability.
- 74 -
8
75. The proposed method
• Chromosome Filtering
– For each generation, chromosomes and their fitness are stored in the
knowledgebase. A DT is trained periodically based on the knowledgebase.
Using the DT, the offspring chromosomes that are likely to have bad fitness are
removed before the fitness evaluation step.
– Assumption
• Some features and parameter settings improve (or degrade) the model
performance.
• DT can find these rules.
9
The proposed method
• Chromosome Filtering (continued)
– Why DT?
Knowledgebase (sorted by fitness)
• Effectively deal with Categorical Features.
• Find Non-linear relationship.
• Use a few, relevant features in the classification
procedure.
– DT Training
• Each ci (i-th chromosome) in the knowledgebase is
labeled by
– first highest M fitness values Æ GOOD
(probable to yield a good fitness value)
– next highest M fitness values Æ NORMAL
– remaining Æ BAD
(probable to yield a bad fitness value)
c1
c2
c3
…
cM
GOOD
cM+1
cM+2
cM+3
…
c2M
NORMAL
c2M+1
c2M+2
c2M+3
…
…
…
BAD
• Input feature: chromosome (in phenotype)
• Output feature: label {GOOD, NORMAL, BAD}
- 75 -
10
76. The proposed method
• Chromosome Filtering (continued)
– Filtering
• A DT gives rules that assess a chromosome before fitness evaluation.
: Is a chromosome GOOD or NORMAL or BAD?
• Each chromosome has a different survival probability.
ex) GOOD: 1.0, NORMAL: 0.5, BAD: 0.2
• The DT is periodically updated, so the criteria of good chromosome changes
through the generations.
11
The proposed method
• Chromosome Filtering (continued)
– DT example
C>100
Contain
F1?
BAD
σ>1
GOOD
σ>0.25
BAD
Contain
F3?
GOOD
NORMAL
- 76 -
BAD
12
77. The proposed method
• Population Replacement: Steady state model
Å to verify the effectiveness of the proposed method in the initial period of GA.
– Only one chromosome in the population is updated in a generation.
– Replacement scheme [5, 6]: The offspring replaces one of its parents or the
lowest fitness chromosome in the population.
• If the offspring is superior to both parents, it replaces the similar parent.
• If it is in between the two parents, it replaces the inferior parent.
• otherwise, the most inferior chromosome in the population is replaced.
13
Experiments
• Experimental Design
–
–
–
–
10 datasets from UCI repository, all datasets were normalized to be in [-1,1].
5 independent runs, a random seed set was used for fairness.
In SVM training, 10-fold cross validation was used.
Parameter Settings
• GA parameters
–
–
–
–
–
population size Npop = 30
crossover probability pc = 0.9
mutation probability pm = 0.05
max iteration = 300
pgood=1; pnormal=0.5; pbad=0.2
• DT parameters
– CART
– Labeling: good=10, normal=10, bad=remaining
– Training starting point: 30th generation / period=10
- 77 -
14
79. Concluding Remarks
We presented a chromosome filtering method for GA-based feature selection
and parameter optimization of SVM.
The proposed method employed a DT as a chromosome filter to remove the
offspring chromosomes that are likely to have bad fitness before the fitness
evaluation step of GA.
On most datasets, the proposed method showed faster improvement of fitness
than standard GA.
17
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF)
grant funded by the Korea government (MSIP) (No. 2011-0030814), and the
Brain Korea 21 Program for Leading Universities & Students. This work was
also supported by the Engineering Research Institute of SNU.
- 79 -
18
80. References
1.
2.
3.
4.
5.
6.
Frohlich, H., Chapelle, O., & Scholkopf, B. (2003, November). Feature selection for support vector
machines by means of genetic algorithm. In Tools with Artificial Intelligence, 2003. Proceedings. 15th IEEE
International Conference on (pp. 142-148). IEEE.
Huang, C. L., & Wang, C. J. (2006). A GA-based feature selection and parameters optimization for support
vector machines. Expert Systems with applications, 31(2), 231-240.
Min, S. H., Lee, J., & Han, I. (2006). Hybrid genetic algorithms and support vector machines for bankruptcy
prediction. Expert Systems with Applications,31(3), 652-660.
Zhao, M., Fu, C., Ji, L., Tang, K., & Zhou, M. (2011). Feature selection and parameter optimization for
support vector machines: A new approach based on genetic algorithm with feature chromosomes. Expert
Systems with Applications,38(5), 5197-5204.
Bui, T. N., & Moon, B. R. (1996). Genetic algorithm and graph partitioning.Computers, IEEE Transactions
on, 45(7), 841-855.
Oh, I. S., Lee, J. S., & Moon, B. R. (2004). Hybrid genetic algorithms for feature selection. Pattern Analysis
and Machine Intelligence, IEEE Transactions on,26(11), 1424-1437.
19
End of Document
- 80 -
20
135. Document Indexing by Ensemble Model
Yanshan Wang and In-Chan Choi
Korea University
System Optimization Lab
yansh.wang@gmail.com
November 25, 2013
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
1 / 18
November 25, 2013
2 / 18
Overview
1
The Basics
Information Retrieval and Document Indexing
Topic Modelling
Indexing by Latent Dirichlet Allocation
2
Indexing by Ensemble Model
Introduction to Ensemble Model
Algorithms
Experimental Results
3
Conclusions and Discussion
Yanshan Wang and In-Chan Choi (KU)
- 135 EnM
Indexing by-
136. The problem in Information Retrieval
As more information (Big
Data) becomes available, it is
more difficult to access what
users are looking for.
We need new tools to help us
understand and search among
vast amounts of information.
Source: www.betaversion.org/ stefano/linotype/news/26/
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
3 / 18
Document Indexing is Important
Users can get desired information by indexing (or ranking)
documents (or items). The higher position the document has, the
more valuable to users.
Yanshan Wang and In-Chan Choi (KU)
- 136 EnM
Indexing by-
November 25, 2013
4 / 18
137. Problems in Conventional Methods: Word
Representation
The majority of rule-based and statistical Natural Language
Processing (NLP) models regards words as atomic symbols.
In Vector Space Models (VSM), a word is represented by one 1
and a lot of zeros. For example,
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
Its problem:
motel [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0] =0
The conceptual meaning of words is ignored.
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
5 / 18
Topic Modeling
Latent Dirichlet Allocation (LDA)
[Blei et al. (2003)].
Uncover the hidden topics that
generate the collection.
Words and Documents can be
represented according to those
topics.
Use the representation to organize,
index and search the text.
Yanshan Wang and In-Chan Choi (KU)
- 137 EnM
Indexing by-
⎡
⎢
⎢
⎢
⎢
⎢
apple = ⎢
⎢
⎢
⎢
⎢
⎣
0.325
0.792
0.214
0.107
0.109
0.612
0.314
0.245
November 25, 2013
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
6 / 18
138. LDA [Blei et al. (2003)]
E
D
1
2
3
T
]
Z
1
0
Choose the number of words N ∼ Poisson(ξ).
Choose θ ∼ Drichelet(α).
For n = 1, 2, ..., N
Choose a topic zn ∼ Multinomial(θ);
Choose a word wn ∼ Multinomial(wn |zn , β), a multinomial
distribution conditioned on the topic zn .
Joint Distribution: p(θ, z, d|α, β) = p(θ|α)
Yanshan Wang and In-Chan Choi (KU)
N
n=1
p(zn |θ)p(wn |zn , β)
Indexing by EnM
November 25, 2013
7 / 18
Indexing by LDA (LDI) [Choi and Lee (2010)]
With adequate assumptions, the probability of a word wj
embodying the concept z k is
βjk
Wjk = p(z k = 1|wj = 1) = K
h=1 βjh
The document (or query) probability can be defined within the
topic space
V
k
j=1 Wj nij
k
k
,
Di (Qi ) =
Ndi
where nij denotes the number of occurrence of word wj in
document di and Ndi denotes the number of words in the
document di , i.e. Ndi = V nij .
j=1
Similarity between document and query
ρ(D, Q) = D · Q
where D · Q =
D
D
Yanshan Wang and In-Chan Choi (KU)
,
Q
Q
.
- 138 EnM
Indexing by-
November 25, 2013
8 / 18
139. Indexing by Ensemble Model (EnM)
[Wang et al. (2013)]
Motivation: There exit optimal weights over constituent models.
Table: A toy example. The values in the table represent similarities of
documents with respect to a given query. The scores of Ensemble 1 and
2 are defined by 0.5*Model 1+0.5*Model 2 and 0.7*Model 1+0.3*Model
2, respectively. The relevant document list is assumed to be {2,3}.
Document 1
Document 2
Document 3
(M)AP
Model 1
0.35
0.4
0.25
0.72
Yanshan Wang and In-Chan Choi (KU)
Model 2
0.2
0.1
0.7
0.72
Indexing by EnM
Ensemble 1
0.55
0.5
0.95
0.72
Ensemble 2
0.305
0.31
0.385
0.89
November 25, 2013
9 / 18
AP and MAP
Average Precision (AP) and Mean Average Precision (MAP)
Notation
|Q|
|Di |
dij ∈ Di
φki
R(dij , φki )
H=
αk φk
the number of queries in the query set;
the number of documents in the relevant document
set w.r.t. the ith query;
the jth document in Di ;
the relevant score returned by kth model w.r.t. ith
query;
the indexing position of the jth document for the ith
query returned by the kth model;
the ensemble model, a linear combination of the constituent models, where αk ≥ 0.
Definition
1
E(H, Q) ==
|Q|
Yanshan Wang and In-Chan Choi (KU)
|Q|
1
AP (H, Di ), AP (H, Di ) =
|Di |
i=1
- 139 EnM
Indexing by-
|Di |
j=1
j
R(dij , H)
November 25, 2013
.
10 / 18
140. Formulation
Formulation of the Optimization Problem
Since 0 ≤ AP ≤ 1, we can define the empirical loss as follows:
|Q|
(1 − AP (H, Di )), or
min
i=1
|Q|
1
(1 − i
min
|D |
i=1
|Di |
j=1
j
R(dij , H)
).
Our goal is to uncover optimal weights α’s that minimize the
empirical loss.
Difficulty
The position function R(dij , H) is nonconvex, nondifferentiable and
noncontinuous w.r.t. α’s.
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
11 / 18
Boosting Scheme
1
Select model:
'
|Q|
φˆ = arg max
j
j
2
TXHU
i=1
Di AP (φji );
Update the weight:
where δj =
3
1
2
=
log
t
αˆ
j
/RVV
'
+
|Q|
i=1
|Q|
i=1
t
δˆ,
j
t
αˆ
j
M
M
/RVV
Di (1+AP (φji ))
Di (1−AP (φji ))
EDG
TXHU
;
Update distribution on queries:
'
M
/RVV
exp(−AP (Hi ))
,
Di =
Z
where Z is a normalizer.
Yanshan Wang and In-Chan Choi (KU)
- 140 EnM
Indexing by-
November 25, 2013
12 / 18
141. Coordinate Descent
Since the objective is nonconvex, not each
coordinate will reduce the loss.
Select model:
1
φˆ = arg max E(Q, φj );
j
j
Update the weight:
2
D N
1 + AP (φji )
1
;
αj = log
2
1 − AP (φji )
If
3
Et
≤
E t−1 ,
DN
delete this coordinate.
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
D N
November 25, 2013
13 / 18
Parallel Coordinate Descent
The coordinate descent algorithm can be parallelized on cores.
1:
2:
3:
4:
parfor p = 1, 2, ..., Kφ do
Update the weights using αp =
end parfor
return Ensemble model H.
Yanshan Wang and In-Chan Choi (KU)
1
2
log
1+AP (φpi )
;
1−AP (φpi )
- 141 EnM
Indexing by-
November 25, 2013
14 / 18
142. Experimental Results on EnM
Data: MED corpus1 .
1033 documents from the National Library of Medicine.
30 queries.
Results.
1
TFIDF
LSA
pLSI
LDI
EnM
0.9
0.8
Method
TFIDF
LSI
pLSI
LDI
EnM.B
EnM.CD
EnM.PCD
MAP
0.4605
0.5026
0.5334
0.5738
0.6420
0.6461
0.6414
improvement (%)
0.6
0.5
0.4
0.3
9.1
15.8
24.6
39.4
40.3
39.3
0.2
0.1
0
1: ftp://ftp.cs.cornell.edu/pub/smart.
Yanshan Wang and In-Chan Choi (KU)
0.7
Precision
Table: MAP of various methods for
MED corpus.
0.1
0.2
0.3
0.4
0.5
Recall
0.6
0.7
0.8
0.9
1
Figure: Precision-Recall Curves for
various methods.
Indexing by EnM
November 25, 2013
15 / 18
Conclusions and Discussion
Conclusion
An ensemble model (EnM) is proposed and three algorithms are
introduced for solving the optimization problem.
The EnM outperformed any basis models through the overall recall
regimes.
Discussion
The algorithms cannot guarantee to converge to the global optimum
due to the nonconvexity of objective.
The parallel coordinate descent algorithm cannot guarantee the
optimum, even local optimum, due to the coupling between
variables.
Future Works
Approximate the objective with convex functions.
Using stochastic gradient descent for stochastic sequences and
large-scale data sets.
Yanshan Wang and In-Chan Choi (KU)
- 142 EnM
Indexing by-
November 25, 2013
16 / 18
143. References
Yanshan Wang and In-Chan Choi(2013)
Indexing by ensemble model
Working Paper. arXiv preprint arXiv:1309.3421.
David M, Blei, Andrew Y, Ng and Micheal I, Jordan (2003)
Latent dirichlet allocation
the Journal of machine Learning research, 3, 993-1022.
In-Chan Choi and Jae-Sung Lee (2010)
Document indexing by latent dirichlet allocation
DMIN, 409-414.
Y. Freund and R. E. Schapire (1995)
A desicion-theoretic generalization of on-line learning and an application to
boosting
Computational Learning Theory, Springer, 23-37.
My Homepage: http://optlab.korea.ac.kr/~ sam/
Yanshan Wang and In-Chan Choi (KU)
Indexing by EnM
November 25, 2013
17 / 18
November 25, 2013
18 / 18
The End
Yanshan Wang and In-Chan Choi (KU)
- 143 EnM
Indexing by-
165. •
•
•
•
Suarez, Estrella, et al. Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular lipid profiles can differentiate sex,
age, and mating status of i Anopheles gambiae/i mosquitoes. Analytica chimica acta706.1 (2011): 157-163.
Suarez, Estrella, et al. Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular lipid profiles can differentiate sex,
age, and mating status of i Anopheles gambiae/i mosquitoes. Analytica chimica acta706.1 (2011): 157-163.
- 165 -
13
14
166. Suarez Estrella, al Matrix assisted
Suarez, Estrella et al. Matrix-assisted laser desorption/ionization-mass spectrometry of cuticular li id profiles can differentiate sex,
desorption/ionization
t
f ti l lipid
fil
diff
ti t
age, and mating status of i Anopheles gambiae/i mosquitoes. Analytica chimica acta706.1 (2011): 157-163.
15
•
•
•
•
Li, Lihua, et al. Data mining techniques for cancer detection using serum proteomic profiling. Artificial intelligence in medicine 32.2
(2004): 71-83.
- 166 -
16
169. (1)
ƒ
ƒ
Tibshirani, Robert, et al. Sparsity and smoothness via the fused lasso.Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 67.1 (2005): 91-108.
21
Liu, Jun, Lei Yuan, and Jieping Ye. An efficient algorithm for a class of fused lasso problems. Proceedings of the 16th ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
- 169 -
22
170. (2)
Liu, Jun, Lei Yuan, and Jieping Ye. An efficient algorithm for a class of fused lasso problems. Proceedings of the 16th ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
23
Results
ƒ
ƒ
ƒ
- 170 -
24
171. Performance
Comparison
Average
misclass. rate
Average
Selected features
25
Fused lasso coefficient abs
6
2
4
2
1
0
0
2000
4000
6000
8000
10000
12000
m/z
Fused lasso selected features intensity value
1
B
Intensity
Others
MFemale7
1.5
2nd principal component
Coefficient
A
0.5
0
-0.5
0.5
-1
0
0
2000
4000
6000
8000
10000
12000
-1.5
m/z
- 171 -
-1
-0.5
0
0.5
1
1st principal component
1.5
2
2.5
26
455. ™
™
Country City Latitude Longitude Year DataType DataType2 DataType3 Institution Purpose
- 238 -
ScopeScopeTime Lag Count Ratio
Collection Application
471. •
•
•
™
ƒ
ƒ
ƒ
™
ƒ
ƒ
11
™
ƒ
•
•
Valid Voting Ratioi
Nall = Total number of conservative/progressive parties
Nall ≥ Ncy + Ncn + Npy + Npn
N cy N cn N py N pn
N all
ƒ
Yes No Diversityi
¦ P log
k
2
Pk
k{ y , n}
Py
N cy N py
N cy N cn N py N pn
, Pn
N cn N pn
N cy N cn N py N pn
ƒ
Political Orientation Diversityi
Pcy
N cy
N cy N cn N py N pn
, Pcn
- 254 12
¦ P log
ij
i{c , p}, j{ y , n}
4
Pij
N cn
, ...
N cy N cn N py N pn
473. ™
ƒ
ƒ
ƒ
™
ƒ
i
j
ƒ
Recall y
yy
, Precision y
yy yn
yy
, F1y
yy ny
2 u Recall y u Precision y
Recalln
nn
, Precision n
nn ny
nn
, F1n
nn yn
2 u Recalln u Precision n
Recalln Precision n
F1yn
Recall y Precision y
1
F1y F1n
511. Modified LDA with Bibliography Information
한국 BI 데이터마이닝 학회 2013 추계 학술 대회
System Optimization Lab.
Korea University
Young Min, Jun
1
Contents
1. LDA
1.1 Topic Model
1.2 LDA
2. Modified LDA with
Bibliography Information
2.1 Limitation of LDA
2.2 Introduction
2.3 Preliminary
2.4 Model
2.5 Expected Impacts
- 293 -
2
512. 1.1 Topic Model
“Topic modeling provides a suite of algorithms to discover hidden thematic structure in large
collections of texts. The results of topic modeling algorithms can be used to summarize, visualize,
explore, and theorize about a corpus.”(DM Blei, 2012)
Example
• What is the “topics” on the New York Times?
• How change the “topics” on the Twitter?
• How similar are these article?
Research of Topic Model
• LSA
• Based on reducing dimension (SVD Decomposition)
• pLSA
• Mixture decomposition
• LDA
• Most frequently studied model
3
1.2 LDA
“LDA is a generative probabilistic model for collection of discrete data such as text corpora. And this is
a three-level hierarchical Bayesian model, which each item of a collection is modeled as a finite mixture
over an underlying set of topics”(DM Blei, 2003)
Generative Process
Graphical Model
Geometric Interpretation
Example
• Three topics for three words.
• LDA makes a smooth
distribution on the topics.
- 294 -
4
513. 2.1 Limitation of LDA
LDA is effective tool for discovering topic structure, but there are some further research to improve
LDA. In that areas, this research focus three aspects such as individual, reference, and explanation.
Individual
Reference
Explanation
•LDA is a generative model for corpus. So
•LDA not considers referring to reference
•LDA often gives result which is hard to
it provides information of whole
literature in generative process.
•Modified LDA provides bibliography of a
documents.
•It provides a vector for information of a
document and its distribution.
understand.
•Modified LDA expects to provide more
explainable result.
document.
•In this study, modified LDA gives more
information of a document.
5
2.2 Introduction
LDA is motivated by writing a document.
Similarly, Modified LDA is motivated by writing a document in library.
Generic Generative Process
More detail
• The place in the library contains information that
probabilities of what reference is selected.
• References in same category have similar topics and
words.
- 295 -
6
514. 2.3 Preliminary
In this research, we use the language of text collections and introduce terms such as “parent corpus”,
“category”and “document distribution”
Parent Corpus
Category
Document Distribution
•A set of documents for reference.
•Category is a cluster of parent
•Document distribution is the probability
•Parent corpus is consisted with parent
documents.
distribution that selection of parent
•Parent documents in same category have
documents.
•Parent document influence topics and
words of the new document.
•Each parent document has own place.
corpus.
same topic and word prior.
•Each parent documents in category has
probability of selection.
7
2.3 Preliminary
Document distribution represents the information of new document.
Document Distribution
• Probabilistic representation as well as deterministic
representation of a document.
• Probabilistic
• Distribution over the parent document that the
probability of being used in generating a
document.
• Deterministic
• List of documents with high probability for
selection
Mixture of Gaussian Distribution
• The number of mixture provides the number of
category.
- 296 -
8
515. 2.3 Preliminary
This slide contains assumption of Modified LDA with Bibliography Information
Parent Corpus
Document Distribution
LDA
• Parent corpus assumed that it has
• Probability of parent document
• Bag-of-words assumption
own alpha, beta.
• Each parent document places at the
point in the document distribution.
follows mixture gaussian
distribution.
• It is known to the number of
mixture.(완화가능)
9
2.4 Model
This slide contains notation and terminology and generative process of Modified LDA with
Bibliography Information.
Notation and Terminology
Generative Process
- 297 -
10
516. 2.4 Model
This slide contains graphical model and probability of document.
Graphical Model
Probability of Document
11
2.4 Model
Estimation
- 298 -
12
517. 2.5 Expected Impacts
This research focus three aspects such as individual, reference and explanation.
Individual
Reference
Explanation
•Bibliography in probabilistic
•Verifying that a document is well
•Providing a variety of view for analyzing
representation of a document.
•Verifying plagiarism by comparing
document distribution.
classified.
text data.
•Representation that information of the
important reference.
13
2.5 Expected Impacts
Drawbacks
Dependency on LDA
Computational Complexity
Assumption
•This model is depended on LDA, such as
•This research yields a total number of
•It is assumed that the number of mixture
operations roughly on the order of
in document distribution is known (완화
O(N⁴k²)
가능)
the perplexity and the complexity.
- 299 -
14
518. References
[1] Jeff A Bilmes et al. A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden
markov models. International Computer Science Institute, 4(510):126, 1998.
[2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022,
2003.
[3] DM Blei. Topic modeling and digital humanities. Journal of Digital Humanities, 2(1):8–11, 2012.
[4] Nikos Vlassis and Aristidis Likas. A greedy em algorithm for gaussian mixture learn- ing. Neural Processing Letters, 15(1):77–87,
2002.
15
- 300 -