The document proposes a framework for optimum document clustering based on the cluster hypothesis. It defines a cluster metric called pairwise precision that evaluates how well a clustering groups together documents that are relevant to the same queries. The metric considers the number of document pairs that are both relevant or both irrelevant to a query within each cluster. The framework aims to find the clustering that maximizes this metric to optimally satisfy the cluster hypothesis. The document outlines experiments to test the framework and examine whether it leads to improved clustering over traditional methods.
This paper introduces nominal schemas as a way to integrate rules and description logics. Nominal schemas allow variables to be treated like nominals in description logics, avoiding a hybrid logic. The paper shows that reasoning in SROIQ extended with nominal schemas (SROIQV) remains N2ExpTime-complete. It also identifies a tractable fragment, SROELVn, by limiting the occurrences of "problematic" nominal schemas. The paper defines what makes a nominal schema occurrence "safe" and uses this to prove tractability.
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
Massive graphs are ubiquitous in various application domains, such as social networks, road networks, communication networks, biological networks, RDF graphs, and so on. Such graphs are massive (for example, with hundreds of millions of nodes and edges or even more) and contain rich information (for example, node/edge weights, labels and textual contents). In such massive graphs, an important class of problems is to process various graph structure related queries. Graph reachability, as an example, asks whether a node can reach another in a graph. However, the large graph scale presents new challenges for efficient query processing.
In this talk, I will introduce two new yet important types of graph reachability queries: weight constraint reachability that imposes edge weight constraint on the answer path, and k-hop reachability that imposes a length constraint on the answer path. With such realistic constraints, we can find more meaningful and practically feasible answers. These two reachablity queries have wide applications in many real-world problems, such as QoS routing and trip planning.
This document provides a reference card for data mining functions and packages in R. It lists popular R packages and functions for tasks such as association rule mining, classification/prediction, clustering, outlier detection, time series analysis, text mining, and social network analysis. Recommended packages and functions are shown in bold.
The document presents research on access strategies for network caching. It introduces the data store selection problem of determining which data stores to access based on indicators to minimize miss costs and access costs. The paper proposes modeling this as a knapsack problem and provides three approximation algorithms - DSKnap, DSPot, and DSPP. An evaluation on a real Wikipedia trace and CDN topology shows the DSKnap algorithm outperforms existing heuristics in total access costs across different miss rates and number of accessed locations.
Efficient steganography techniques are needed for the security of digital information over the Internet and for secret data communication. Therefore, many techniques are proposed for steganography. One of these intelligent techniques is Particle Swarm Optimization (PSO) algorithm. Recently, many modifications are made to Standard PSO (SPSO) such as Human-Based Particle Swarm Optimization (HPSO). Therefore, this paper presents image steganography using HPSO in order to find best locations in image cover to hide text secret message. Then, a comparison is done between image steganography using PSO and using HPSO. Experimental results on six (256×256) cover images and different size of secret massages, prove that the performance of the proposed image steganography using HPSO has been improved in comparison with using SPSO.
This document provides a summary of data mining and text mining packages and functions available in R. It lists popular packages and functions for tasks such as association rule mining, classification/prediction using decision trees and random forests, clustering, outlier detection, time series analysis, text cleaning/preparation, topic modeling, and social network analysis. It also includes packages and functions for evaluating model performance and visualizing results.
This paper introduces nominal schemas as a way to integrate rules and description logics. Nominal schemas allow variables to be treated like nominals in description logics, avoiding a hybrid logic. The paper shows that reasoning in SROIQ extended with nominal schemas (SROIQV) remains N2ExpTime-complete. It also identifies a tractable fragment, SROELVn, by limiting the occurrences of "problematic" nominal schemas. The paper defines what makes a nominal schema occurrence "safe" and uses this to prove tractability.
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
Massive graphs are ubiquitous in various application domains, such as social networks, road networks, communication networks, biological networks, RDF graphs, and so on. Such graphs are massive (for example, with hundreds of millions of nodes and edges or even more) and contain rich information (for example, node/edge weights, labels and textual contents). In such massive graphs, an important class of problems is to process various graph structure related queries. Graph reachability, as an example, asks whether a node can reach another in a graph. However, the large graph scale presents new challenges for efficient query processing.
In this talk, I will introduce two new yet important types of graph reachability queries: weight constraint reachability that imposes edge weight constraint on the answer path, and k-hop reachability that imposes a length constraint on the answer path. With such realistic constraints, we can find more meaningful and practically feasible answers. These two reachablity queries have wide applications in many real-world problems, such as QoS routing and trip planning.
This document provides a reference card for data mining functions and packages in R. It lists popular R packages and functions for tasks such as association rule mining, classification/prediction, clustering, outlier detection, time series analysis, text mining, and social network analysis. Recommended packages and functions are shown in bold.
The document presents research on access strategies for network caching. It introduces the data store selection problem of determining which data stores to access based on indicators to minimize miss costs and access costs. The paper proposes modeling this as a knapsack problem and provides three approximation algorithms - DSKnap, DSPot, and DSPP. An evaluation on a real Wikipedia trace and CDN topology shows the DSKnap algorithm outperforms existing heuristics in total access costs across different miss rates and number of accessed locations.
Efficient steganography techniques are needed for the security of digital information over the Internet and for secret data communication. Therefore, many techniques are proposed for steganography. One of these intelligent techniques is Particle Swarm Optimization (PSO) algorithm. Recently, many modifications are made to Standard PSO (SPSO) such as Human-Based Particle Swarm Optimization (HPSO). Therefore, this paper presents image steganography using HPSO in order to find best locations in image cover to hide text secret message. Then, a comparison is done between image steganography using PSO and using HPSO. Experimental results on six (256×256) cover images and different size of secret massages, prove that the performance of the proposed image steganography using HPSO has been improved in comparison with using SPSO.
This document provides a summary of data mining and text mining packages and functions available in R. It lists popular packages and functions for tasks such as association rule mining, classification/prediction using decision trees and random forests, clustering, outlier detection, time series analysis, text cleaning/preparation, topic modeling, and social network analysis. It also includes packages and functions for evaluating model performance and visualizing results.
A Signature Scheme as Secure as the Diffie Hellman Problemvsubhashini
This document summarizes a theory seminar on cryptography that covered digital signature schemes. It began with an introduction to hard assumptions like the discrete log problem and computational Diffie-Hellman problem. It then described the ElGamal digital signature scheme, including its key generation, signing, and verification algorithms. It discussed the security of signature schemes in the chosen message attack model and how the ElGamal scheme's unforgeability relies on the hardness of computing discrete logs. It analyzed the probability of an adversary using oracle queries to forge a signature or solve the computational Diffie-Hellman problem. References for the original ElGamal and related signature scheme papers were also provided.
The document outlines the PAC-Bayesian bound for deep learning. It discusses how the PAC-Bayesian bound provides a generalization guarantee that depends on the KL divergence between the prior and posterior distributions over hypotheses. This allows the bound to account for factors like model complexity and noise in the training data, avoiding some limitations of other generalization bounds. The document also explains how the PAC-Bayesian bound can be applied to stochastic neural networks by placing distributions over the network weights.
1) The document outlines PAC-Bayesian bounds, which provide probabilistic guarantees on the generalization error of a learning algorithm.
2) PAC-Bayesian bounds relate the expected generalization error of the output distribution Q to the training error, number of samples, and KL divergence between the prior P and posterior Q distributions over hypotheses.
3) The bounds show that better generalization requires a smaller divergence between P and Q, meaning the training process should not alter the distribution of hypotheses too much. This provides insights into reducing overfitting in deep learning models.
"Scalable Link Discovery for Modern Data-Driven Applications" as presented in the 15th International Semantic Web Conference ISWC, Doctoral Consortium, October 18th, 2016, held in Kobe, Japan
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
This document summarizes locality sensitive hashing (LSH) for approximate near neighbor search in high dimensional spaces. LSH works by using hash functions that map similar points to the same buckets with high probability, allowing efficient retrieval of approximate near neighbors. The document outlines how LSH can solve the (c,R)-approximate near neighbor problem in sublinear time, discusses analysis of success probability and query time, and gives an example with preprocessing in O(N√N log N) time and queries in O(√N log N) time.
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
This document discusses computing commonalities between SPARQL conjunctive queries. It defines the concept of a least general generalization (lgg) of queries, which is a most general query that entails each of the input queries. The document presents definitions for lgg of basic graph pattern queries in SPARQL with respect to a set of RDF entailment rules and RDFS constraints. It focuses on computing the lgg of two queries by iteratively taking the lgg of query pairs. The goal is to study computing lgg in the conjunctive fragment of SPARQL to applications like query optimization and recommendation.
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeFrederic Desprez
This document summarizes a joint workshop on workflow allocations and scheduling on Infrastructure as a Service (IaaS) platforms. It discusses using on-demand resources to more efficiently allocate workflows compared to static allocations. It proposes two algorithms, Eager and Deferred, to determine allocations within a given budget limit. Simulations using synthetic workflows showed Deferred guarantees budget constraints while Eager is faster. For small applications and budgets, Deferred is preferred. For larger applications and budgets approaching task parallelism saturation, Eager performs better. The document also discusses a prototype system using Nimbus, Phantom and DIET to deploy workflows on IaaS resources.
The document discusses information theory concepts like entropy, joint entropy, conditional entropy, and mutual information. It then discusses how these concepts relate to generalization in deep learning models. Specifically, it explains that the PAC-Bayesian bound is data-dependent, so models with high VC dimension can still generalize if the data is clean, resulting in low KL divergence between the prior and posterior distributions.
The document discusses information theory concepts like entropy, joint entropy, conditional entropy, and mutual information. It then discusses how these concepts relate to generalization in deep learning models. Specifically, it explains that the PAC-Bayesian bound is data-dependent, so models with high VC dimension can still generalize if the data is clean, resulting in low KL divergence between the prior and posterior distributions.
This document discusses speaker diarization, which is the process of segmenting an audio stream into homogeneous segments according to speaker identity. It covers feature extraction methods like MFCCs, segmentation using Bayesian Information Criteria to compare Gaussian mixture models, and clustering algorithms like k-means and hierarchical agglomerative clustering. Dendrogram visualizations are used to identify natural speaker clusters. The overall goal is to partition audio recordings of discussions or debates into homogeneous segments to attribute speech segments to individual speakers.
The document describes a data structure called a Compact Dynamic Rewritable Array (CDRW) that compactly stores arrays where each entry can be dynamically rewritten. It supports creating an array of size N where each entry is initially 0 bits, setting an entry to a value of at most k bits, and getting an entry's value. The goal is to use close to the minimum possible space of the sum of each entry's length while supporting these operations in O(1) time. The document presents solutions using compact hashing that achieve O(1) time for get and set using (1+) times the minimum space plus O(N) bits, for any constant >0. Experimental results show these perform well in terms
This document provides information about options for a 401(k) account when leaving a job or retiring. The main options are leaving the money in the current 401(k), rolling it over to an IRA, transferring to a new employer's 401(k), or withdrawing the funds. Rolling over to a Homestead Funds IRA is presented as one choice that provides investment options and control over access to funds. Key details are provided about rolling over to a Roth IRA and the tax implications. Overall the document aims to help readers understand their choices for managing 401(k) savings after leaving a job.
Julia Stoyanovich - Making interval-based clustering rank-awareyaevents
This document discusses rank-aware clustering of interval-based data. It introduces the problem of finding clusters in datasets where attributes are correlated in complex ways and where the goal is to discover clusters that correlate with a specified ranking function. It presents the BARAC algorithm, a bottom-up approach for discovering such rank-aware clusters. BARAC builds ranked intervals, merges neighboring intervals based on a rank-aware locality measure, and joins intervals to form maximal clusters that meet a rank-aware clustering quality threshold. The document evaluates BARAC on a real-world dating preferences dataset, finding that it effectively discovers meaningful clusters and scales to large datasets.
Tips for Communicating With a Worldwide AssociationStarChapter
This document provides tips for communicating with a worldwide association membership using email, forums, and web-based conferences. It recommends using email for newsletters and events, forums to allow two-way communication and feedback, and web conferences for interactive presentations and discussions. The document stresses making use of available technology to keep geographically dispersed members connected and utilizing their diverse perspectives.
Association Events: Don’t Make These Top 5 Tech MistakesStarChapter
This document outlines 5 common technology mistakes to avoid when planning association events. The mistakes include: 1) using mail/fax for registration instead of online registration software, 2) having poor quality Wi-Fi that frustrates attendees, 3) equipment failures or presenters who are not familiar with the equipment, 4) creating hashtags for social media promotion at the last minute, and 5) not providing presentation slides to attendees after the event. The document provides tips and reasons to avoid each mistake to ensure successful use of technology and a positive experience for attendees.
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
This presentation discusses a project titled "Document Ranking Using QPRP with Concept of Multi-Dimensional Subspace". It was presented by Prakash Kumar Dubey and guided by Mr. Sourish Dhar and Mr. Bhagaban Swain of the Department of IT. The presentation provides an overview of the project, including an introduction to information retrieval, classical IR models such as Boolean, vector space, and probabilistic models. It then discusses quantum probability and how it can be applied to document ranking. The presentation outlines the proposed solution, data collection and implementation, and concludes with future work.
Text clustering involves grouping text documents into clusters such that documents within a cluster are similar to each other and dissimilar to documents in other clusters. Common text clustering methods include bisecting k-means clustering, which recursively partitions clusters, and agglomerative hierarchical clustering, which iteratively merges clusters. Text clustering is used to automatically organize large document collections and improve search by returning related groups of documents.
A Signature Scheme as Secure as the Diffie Hellman Problemvsubhashini
This document summarizes a theory seminar on cryptography that covered digital signature schemes. It began with an introduction to hard assumptions like the discrete log problem and computational Diffie-Hellman problem. It then described the ElGamal digital signature scheme, including its key generation, signing, and verification algorithms. It discussed the security of signature schemes in the chosen message attack model and how the ElGamal scheme's unforgeability relies on the hardness of computing discrete logs. It analyzed the probability of an adversary using oracle queries to forge a signature or solve the computational Diffie-Hellman problem. References for the original ElGamal and related signature scheme papers were also provided.
The document outlines the PAC-Bayesian bound for deep learning. It discusses how the PAC-Bayesian bound provides a generalization guarantee that depends on the KL divergence between the prior and posterior distributions over hypotheses. This allows the bound to account for factors like model complexity and noise in the training data, avoiding some limitations of other generalization bounds. The document also explains how the PAC-Bayesian bound can be applied to stochastic neural networks by placing distributions over the network weights.
1) The document outlines PAC-Bayesian bounds, which provide probabilistic guarantees on the generalization error of a learning algorithm.
2) PAC-Bayesian bounds relate the expected generalization error of the output distribution Q to the training error, number of samples, and KL divergence between the prior P and posterior Q distributions over hypotheses.
3) The bounds show that better generalization requires a smaller divergence between P and Q, meaning the training process should not alter the distribution of hypotheses too much. This provides insights into reducing overfitting in deep learning models.
"Scalable Link Discovery for Modern Data-Driven Applications" as presented in the 15th International Semantic Web Conference ISWC, Doctoral Consortium, October 18th, 2016, held in Kobe, Japan
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
This document summarizes locality sensitive hashing (LSH) for approximate near neighbor search in high dimensional spaces. LSH works by using hash functions that map similar points to the same buckets with high probability, allowing efficient retrieval of approximate near neighbors. The document outlines how LSH can solve the (c,R)-approximate near neighbor problem in sublinear time, discusses analysis of success probability and query time, and gives an example with preprocessing in O(N√N log N) time and queries in O(√N log N) time.
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
This document discusses computing commonalities between SPARQL conjunctive queries. It defines the concept of a least general generalization (lgg) of queries, which is a most general query that entails each of the input queries. The document presents definitions for lgg of basic graph pattern queries in SPARQL with respect to a set of RDF entailment rules and RDFS constraints. It focuses on computing the lgg of two queries by iteratively taking the lgg of query pairs. The goal is to study computing lgg in the conjunctive fragment of SPARQL to applications like query optimization and recommendation.
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeFrederic Desprez
This document summarizes a joint workshop on workflow allocations and scheduling on Infrastructure as a Service (IaaS) platforms. It discusses using on-demand resources to more efficiently allocate workflows compared to static allocations. It proposes two algorithms, Eager and Deferred, to determine allocations within a given budget limit. Simulations using synthetic workflows showed Deferred guarantees budget constraints while Eager is faster. For small applications and budgets, Deferred is preferred. For larger applications and budgets approaching task parallelism saturation, Eager performs better. The document also discusses a prototype system using Nimbus, Phantom and DIET to deploy workflows on IaaS resources.
The document discusses information theory concepts like entropy, joint entropy, conditional entropy, and mutual information. It then discusses how these concepts relate to generalization in deep learning models. Specifically, it explains that the PAC-Bayesian bound is data-dependent, so models with high VC dimension can still generalize if the data is clean, resulting in low KL divergence between the prior and posterior distributions.
The document discusses information theory concepts like entropy, joint entropy, conditional entropy, and mutual information. It then discusses how these concepts relate to generalization in deep learning models. Specifically, it explains that the PAC-Bayesian bound is data-dependent, so models with high VC dimension can still generalize if the data is clean, resulting in low KL divergence between the prior and posterior distributions.
This document discusses speaker diarization, which is the process of segmenting an audio stream into homogeneous segments according to speaker identity. It covers feature extraction methods like MFCCs, segmentation using Bayesian Information Criteria to compare Gaussian mixture models, and clustering algorithms like k-means and hierarchical agglomerative clustering. Dendrogram visualizations are used to identify natural speaker clusters. The overall goal is to partition audio recordings of discussions or debates into homogeneous segments to attribute speech segments to individual speakers.
The document describes a data structure called a Compact Dynamic Rewritable Array (CDRW) that compactly stores arrays where each entry can be dynamically rewritten. It supports creating an array of size N where each entry is initially 0 bits, setting an entry to a value of at most k bits, and getting an entry's value. The goal is to use close to the minimum possible space of the sum of each entry's length while supporting these operations in O(1) time. The document presents solutions using compact hashing that achieve O(1) time for get and set using (1+) times the minimum space plus O(N) bits, for any constant >0. Experimental results show these perform well in terms
This document provides information about options for a 401(k) account when leaving a job or retiring. The main options are leaving the money in the current 401(k), rolling it over to an IRA, transferring to a new employer's 401(k), or withdrawing the funds. Rolling over to a Homestead Funds IRA is presented as one choice that provides investment options and control over access to funds. Key details are provided about rolling over to a Roth IRA and the tax implications. Overall the document aims to help readers understand their choices for managing 401(k) savings after leaving a job.
Julia Stoyanovich - Making interval-based clustering rank-awareyaevents
This document discusses rank-aware clustering of interval-based data. It introduces the problem of finding clusters in datasets where attributes are correlated in complex ways and where the goal is to discover clusters that correlate with a specified ranking function. It presents the BARAC algorithm, a bottom-up approach for discovering such rank-aware clusters. BARAC builds ranked intervals, merges neighboring intervals based on a rank-aware locality measure, and joins intervals to form maximal clusters that meet a rank-aware clustering quality threshold. The document evaluates BARAC on a real-world dating preferences dataset, finding that it effectively discovers meaningful clusters and scales to large datasets.
Tips for Communicating With a Worldwide AssociationStarChapter
This document provides tips for communicating with a worldwide association membership using email, forums, and web-based conferences. It recommends using email for newsletters and events, forums to allow two-way communication and feedback, and web conferences for interactive presentations and discussions. The document stresses making use of available technology to keep geographically dispersed members connected and utilizing their diverse perspectives.
Association Events: Don’t Make These Top 5 Tech MistakesStarChapter
This document outlines 5 common technology mistakes to avoid when planning association events. The mistakes include: 1) using mail/fax for registration instead of online registration software, 2) having poor quality Wi-Fi that frustrates attendees, 3) equipment failures or presenters who are not familiar with the equipment, 4) creating hashtags for social media promotion at the last minute, and 5) not providing presentation slides to attendees after the event. The document provides tips and reasons to avoid each mistake to ensure successful use of technology and a positive experience for attendees.
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
This presentation discusses a project titled "Document Ranking Using QPRP with Concept of Multi-Dimensional Subspace". It was presented by Prakash Kumar Dubey and guided by Mr. Sourish Dhar and Mr. Bhagaban Swain of the Department of IT. The presentation provides an overview of the project, including an introduction to information retrieval, classical IR models such as Boolean, vector space, and probabilistic models. It then discusses quantum probability and how it can be applied to document ranking. The presentation outlines the proposed solution, data collection and implementation, and concludes with future work.
Text clustering involves grouping text documents into clusters such that documents within a cluster are similar to each other and dissimilar to documents in other clusters. Common text clustering methods include bisecting k-means clustering, which recursively partitions clusters, and agglomerative hierarchical clustering, which iteratively merges clusters. Text clustering is used to automatically organize large document collections and improve search by returning related groups of documents.
The document discusses various techniques for information retrieval and language modeling approaches to IR, including:
- Clustering documents into similar groups to aid in retrieval
- Using term frequency-inverse document frequency (TF-IDF) to measure word importance in documents
- Language models that represent documents and queries as probability distributions over words
- Smoothing language models to address data sparsity issues
- Cluster-based scoring methods that incorporate information from query-relevant document clusters
This document provides an introduction and overview of document clustering techniques in information retrieval. It discusses motivations for clustering documents, such as improving search recall and organizing search results. It covers common clustering algorithms like K-means and hierarchical clustering, how they work, and considerations like choosing the number of clusters. The document uses examples and diagrams to illustrate clustering concepts and algorithms.
This document provides an introduction and overview of document clustering techniques in information retrieval. It discusses motivations for clustering documents, such as improving search recall and organizing search results. It covers common clustering algorithms like K-means and hierarchical clustering, how they work, and considerations like choosing the number of clusters. The document uses examples and diagrams to illustrate clustering concepts and algorithms.
This document provides an introduction and overview of document clustering techniques in information retrieval. It discusses motivations for clustering documents, such as improving search recall and organizing search results. It covers common clustering algorithms like K-means and hierarchical clustering, how they work, and considerations like choosing the number of clusters. The document uses examples and diagrams to illustrate clustering concepts and algorithms.
This document provides an introduction and overview of document clustering techniques in information retrieval. It discusses motivations for clustering documents, different document representations, evaluation criteria, and clustering algorithms including partitional algorithms like K-means and hierarchical algorithms. It provides examples and discusses issues like determining the optimal number of clusters to generate. The overall summary is that document clustering groups similar documents together to help with tasks like document navigation, improving search recall, and organizing search results.
The document proposes a novel approach for document and feature reduction in text categorization using prototypes and rough sets. It introduces a prototype-based algorithm to reduce documents while preserving classification accuracy. A rough set-based method is also presented to select a subset of relevant features. The methods are evaluated on benchmark datasets and are shown to improve both classification performance and computational efficiency compared to baseline methods.
very useful for cluster analysis. supportive for engineering student as well as it students. also provide example for every topic helps in numerical problems. good material for reading.
This document provides an introduction to document clustering and clustering algorithms. It discusses how clustering can be used in information retrieval applications like organizing search results and improving search recall. It also covers different types of clustering algorithms like partitioning algorithms (such as K-means) and hierarchical algorithms. Key steps of the K-means and hierarchical agglomerative clustering algorithms are described.
The document discusses probabilistic retrieval models in information retrieval. It provides an overview of older models like Boolean retrieval and vector space models. The main focus is on probabilistic models like BM25 and language models. It explains key concepts in probabilistic IR like the probability ranking principle, using Bayes' rule to estimate the probability that a document is relevant given features of the document, and estimating probabilities based on the frequencies of terms in relevant documents. The goal is to rank documents based on the probability of relevance to the query.
Iwsm2014 an analogy-based approach to estimation of software development ef...Nesma
The document discusses fuzzy analogy, a technique for software effort estimation that can handle categorical data. It introduces fuzzy analogy and fuzzy k-modes clustering. Fuzzy k-modes is used to cluster similar software projects from a repository based on categorical attributes into homogeneous groups. Fuzzy analogy then assesses the similarity between projects based on their membership to clusters and estimates the effort of a new project as a weighted average of similar past projects' efforts. The document evaluates fuzzy analogy on 194 projects from the ISBSG repository selected based on data quality and attributes criteria.
Clustering is the process of grouping similar objects together. Hierarchical agglomerative clustering builds a hierarchy by iteratively merging the closest pairs of clusters. It starts with each document in its own cluster and successively merges the closest pairs of clusters until all documents are in one cluster, forming a dendrogram. Different linkage methods, such as single, complete, and average linkage, define how the distance between clusters is calculated during merging. Hierarchical clustering provides a multilevel clustering structure but has computational complexity of O(n3) in general.
(1) The document describes probabilistic methods for structured document classification that were submitted to the INEX'07 Document Mining track.
(2) It evaluates five runs using naive Bayes and OR gate Bayesian network classifiers with different document representations and feature selection techniques.
(3) The best run used an OR gate classifier with a better weight approximation on only text, achieving a microaverage of 0.78998 and macroaverage of 0.76054.
This document outlines clustering algorithms for large datasets. It discusses k-means clustering and extensions like k-means++ that improve initialization. It also covers spectral relaxation methods that reformulate k-means as a trace maximization problem to address local minima. Additionally, it proposes landmark-based clustering algorithms for biological sequences that select landmarks in one pass and assign sequences to the nearest landmark using hashing to search for neighbors. The document provides analysis of the time and space complexity of these algorithms as well as assumptions about separability and cluster size.
This document provides an overview of text classification and the Naive Bayes algorithm for text classification. It begins by defining text classification and giving examples like spam filtering and document classification. It then explains supervised classification and the goal of learning a classifier from labeled training data. The document spends several slides explaining the Naive Bayes algorithm for text classification, including the Naive Bayes assumption of conditional independence between features. It discusses parameter estimation and smoothing techniques to avoid overfitting. Finally, it compares the multivariate Bernoulli and multinomial Naive Bayes models for text classification.
The document discusses K-means clustering, an unsupervised machine learning algorithm that partitions observations into k clusters where each observation belongs to the cluster with the nearest mean. It describes how K-means aims to minimize intra-cluster similarity while maximizing inter-cluster similarity. The algorithm works by first selecting k random cluster centroids, then iteratively reassigning observations to the closest centroid and recalculating the centroids until convergence is reached. It also addresses computational complexity, extensions, tools for implementing K-means, and examples of applications like image compression, recommendation systems, and yield management.
The document discusses various model-based clustering techniques for handling high-dimensional data, including expectation-maximization, conceptual clustering using COBWEB, self-organizing maps, subspace clustering with CLIQUE and PROCLUS, and frequent pattern-based clustering. It provides details on the methodology and assumptions of each technique.
This document provides an overview of probabilistic approaches to information retrieval. It discusses why probabilities are useful for IR given the inherent uncertainty. It covers the Probability Ranking Principle, which aims to rank documents by estimated probability of relevance. Other probabilistic techniques discussed include probabilistic indexing, probabilistic inference using logic representations, and using Bayesian networks for IR. The document notes open issues with some of these approaches and concludes by surveying existing survey papers on probabilistic IR.
Similar to The Optimum Clustering Framework: Implementing the Cluster Hypothesis (20)
Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...yaevents
Артем Ерошенко, Яндекс
Закончил математико-механический факультет Санкт-Петербургского государственного университета, учится на 3 курсе аспирантуры по специальности «Теория управления». С 2008 года занимается автоматизацией тестирования выдачи поиска и околопоисковых сервисов в компании «Яндекс». С 2011 года координирует группу разработки инструментов тестирования.
Илья Кацев, Яндекс
Окончил математико-механический факультет Санкт-Петербургского государственного университета, защитил диссертацию по теории игр на степень PhD в VU University Amsterdam (Нидерланды). В Яндексе занимается автоматизацией тестирования (имитация действий пользователя и анализ результата).
Тема доклада
Как научить роботов тестировать веб-интерфейсы.
Тезисы
Речь пойдет об инструменте, который будет сам проверять веб-интерфейсы на наличие ошибок. Главное его качество – способность самостоятельно (автоматически) обнаруживать связанные элементы на странице, строить модели, которые потом можно будет тестировать автоматически. Мы не только предложим идеи, как использовать и развивать эту систему, но и покажем её прототип.
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...yaevents
Сергей Бережной, Яндекс
С 2005 года работает веб-разработчиком в Яндексе. За это время успел поучаствовать в разработке целого ряда сервисов, например, Поиска по блогам, Я.ру, Яндекс.Почты, Поиска, Картинок, Видео. Помимо внешних проектов активно занимается развитием различных внутренних инструментов для полного цикла создания сайтов. Больше всего на свете любит жену и программирование.
Тема доклада
Построение сложносоставных блоков в шаблонизаторе bemhtml.
Тезисы
Предметно-ориентированный шаблонизатор bemhtml позволяет создавать шаблоны блоков согласно методологии БЭМ. После компиляции получаются быстрые plain JavaScript-шаблоны, которые можно исполнить как на сервере, так и на клиенте. Эта технология используется в библиотеке блоков bem-bl, а также на некоторых сервисах Яндекса. Мастер-класс демонстрирует одно из преимуществ шаблонизатора bemhtml — возможность построения сложносоставных блоков. Во время мастер-класса вы узнаете об идее и синтаксисе шаблонизатора, получите готовые рецепты для решения типовых задач и анализ возможностей bemhtml.
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндексyaevents
Елена Глухова, Яндекс
Верстальщик, разработчик веб-интерфейсов. Работает в Яндексе с 2008 года.
Варвара Степанова, Яндекс
Закончила Петрозаводский государственный университет. Работает в Яндексе с 2008 года разработчиком интерфейсов. Разрабатывала проекты Яндекс.Ответы и Яндекс.Фотки. Последние полтора года Елена Глухова и Варвара Степанова совместно работают вместе над внутренним интерфейсным фреймворком, помогающим единообразно делать сервисы Яндекса. В последнее время также заняты разработкой подобного интерфейсного фреймворка в open source.
Тема доклада
i-bem.js: JavaScript в БЭМ-терминах.
Тезисы
Разрабатывая сайты по методологии БЭМ, мы используем единую предметную область во всех технологиях: CSS, шаблоны, JavaScript. Для того чтобы это было возможно, в библиотеке блоков bem-bl реализовано ядро клиентского JS-фреймворка, которое позволяет работать со страницей в терминах БЭМ, на следующем уровне абстракции над DOM-представлением. В этом мастер-классе показаны ключевые моменты использования такого подхода для написания клиентского JS. Мы создаём составной блок, использующий JS-функциональность входящих в него маленьких блоков. В результате всё работает, и никакого копипаста.
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...yaevents
Елена Глухова, Яндекс
Верстальщик, разработчик веб-интерфейсов. Работает в Яндексе с 2008 года.
Варвара Степанова, Яндекс
Закончила Петрозаводский государственный университет. Работает в Яндексе с 2008 года разработчиком интерфейсов. Разрабатывала проекты Яндекс.Ответы и Яндекс.Фотки. Последние полтора года Елена Глухова и Варвара Степанова совместно работают вместе над внутренним интерфейсным фреймворком, помогающим единообразно делать сервисы Яндекса. В последнее время также заняты разработкой подобного интерфейсного фреймворка в open source.
Тема доклада
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты.
Тезисы
Все сайты немного похожи друг на друга. Если заниматься веб-разработкой долгие годы, накапливаются практики и типовые решения распространённых задач. Результатом наших накоплений становится open source библиотека блоков bem-bl , которую мы разрабатываем на GitHub. Библиотека реализована согласно методологии БЭМ и позволяет использовать блоки, уже имеющие шаблонную, CSS и JS-реализации, для построения web-страницы. Мастер-класс продемонстрирует, как можно использовать готовые блоки из этой библиотеки и как модифицировать их под нужды своего сайта. Для работы с файлами библиотеки используются консольные инструменты bem-tools.
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...yaevents
Александр Петренко, ИСП РАН
Профессор, доктор физико-математических наук, заведующий отделом технологий программирования Института системного программирования (ИСП РАН), профессор ВМК МГУ. Основные работы в областях: формализация требований, генерация тестов на основе формализованных требований и формальных моделей (model based testing – MBT). Приложения: тестирование операционных систем и распределенных систем, тестирование компиляторов, верификация дизайна микропроцессоров, формализация стандартов на API операционных систем и телекоммуникационных протоколов. Сопредседатель оргкомитетов International MBT workshop (http://www.mbrworkshop.org/), Spring Young Researcher Colloquium on Software Engineering – SYRCoSE (http://syrocose.ispras.ru), городского семинара по технологиям разработки и анализа программ ТРАП/SDAT (http://sdat.ispras.ru/).
Тема доклада
Модели в профессиональной инженерии и тестировании программ.
Тезисы
Model Based Software Engineering (MBSE) является расширением подхода к разработке программ на основе моделей. В MBSE в отличие, например, от MDA (Model Driver Architecture) существенное внимание уделяется не только задачам собственно проектирования и разработки кода, но и задачам других фаз жизненного цикла – анализу требований, верификации и валидации, управлению требованиями на всех фазах жизненного цикла. Model Based Testing (MBT) хронологически возник гораздо раньше, чем MBSE и MDA, однако его место в разработке программ в полной мере раскрылось вместе с развитием MBSE. По этой причине MBT и MBSE следует рассматривать в тесной связке. В докладе будут рассмотрены концепции MBSE-MDA-MBT, основные источники и виды моделей, которые используются в этих подходах, методы генерации тестов на основе моделей, известные инструменты для
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...yaevents
Роман Андриади, Яндекс
Работает в департаменте эксплуатации Яндекса с 2005 года. С 2010 года – руководитель группы администрирования коммуникационных, контент- и внутренних сервисов.
Тема доклада
Администрирование небольших сервисов, или Один за всех и 100 на одного.
Тезисы
Администрирование коммуникационных сервисов начиналось в 2004 году с обслуживания десятка серверов и десятка сервисов, на них располагающихся. Со временем сервисов становилось все больше, увеличивалось число задач по ним, а десяток серверов вырос в парк из сотен машин, разделенных на множество разношерстных кластеров. В докладе будет рассказано, как с ростом объемов кластера эволюционировали приемы администрирования, какие инструменты при этом использовались, как мы написали свой инструмент управления, как и чем он научился помогать нам за эти годы.
Истории про разработку сайтов. Сергей Бережной, Яндексyaevents
Сергей Бережной, Яндекс
С 2005 года работает веб-разработчиком в Яндексе. За это время успел поучаствовать в разработке целого ряда сервисов, например, Поиска по блогам, Я.ру, Яндекс.Почты, Поиска, Картинок, Видео. Помимо внешних проектов активно занимается развитием различных внутренних инструментов для полного цикла создания сайтов. Больше всего на свете любит жену и программирование.
Тема доклада
Истории про разработку сайтов.
Тезисы
Мы расскажем о том, какие задачи, связанные с разработкой сайтов, появлялись в Яндексе в разное время и как мы их решали. Выступление задумывается как диалог с разработчиками, которые тоже сталкиваются с похожими задачами. В итоге у нас получится некий сборник технологических историй для размышления.
Разработка приложений для Android на С++. Юрий Береза, Shturmannyaevents
Юрий Береза, Shturmann
Окончил факультет приборостроения Московской государственной академии приборостроения и информатики. В 2004 году пришел на работу в отдел мобильных разработок компании «Макцентр». Занимался разработкой под огромное число мобильных платформ: Windows Mobile, Symbian, Android, Embedded linux и iOS. В данный момент работает руководителем группы в компании «Контент Мастер», где занимается разработкой автомобильной навигации Shturmann.
Тема доклада
Разработка приложений для Android на С++.
Тезисы
Платформа Android становится популярнее с каждым годом. Несмотря на то, что основным языком разработки приложений для Android является Java, часто для написания кросс-платформенных приложений или при использовании сторонних библиотек программистам приходится использовать С или С++. К сожалению, разработка на С++ для платформы Android описана довольно скупо, и зачастую приходится тратить много времени на поиск нужной информации. В докладе будут представлены ответы на основные вопросы по всему циклу разработки: как писать С++ код, который будет работать на Android, как его отлаживать и находить ошибки во время падения приложений, есть ли возможность профилировать код и где искать дополнительную информацию по этим вопросам.
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...yaevents
Дмитрий Жестилевский, Яндекс
Закончил факультет экспериментальной и теоретической физики Московского инженерно-физического института в 2011 году. С 2006 года занимается разработкой приложений (игры, бизнес-приложения) под мобильные устройства на платформах J2ME, BREW, Windows Mobile, Android, iOS. В Яндексе с 2010 года, занимается разработкой архитектуры мобильных картографических сервисов. Область интересов: кросс-платформенная разработка под мобильные устройства, визуализация 3D.
Тема доклада
Кросс-платформенная разработка под мобильные устройства.
Тезисы
Разработка приложений под embedded-устройства сильно фрагментирована из-за обилия OS (Android, iOS, WM, WP7, Symbian, Bada). Независимая разработка под каждую платформу в отдельности приводит к пропорциональному росту количества участников процесса разработки и объема поддерживаемого CodeBase. Внедрение общего кода, который будет работать на всех платформах за счет использования Platform Abstraction Layer с унифицированным интерфейсом, способно сократить эти издержки. В то же время остается возможность использовать платформенно-зависимые сущности, например UI, для придания приложению native look and feel. В докладе рассматривается процесс внедрения общих компонентов в мобильные приложения Яндекса на примере Панорам улиц, а также трудности, с которыми мы столкнулись во время разработки, и пути их решения.
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...yaevents
Вячеслав Закоржевский, Kaspersky Lab
Пришёл в «Лабораторию Касперского» в середине 2007 года на должность вирусного аналитика. В конце 2008 года занял позицию старшего вирусного аналитика в группе эвристического детектирования. В область интересов входит исследование полиморфных вирусов и сильно изменяющихся зловредов. Также следит за современными тенденциями в методах обфускации, антиэмуляции и прочих, применяемых вредоносным программным обеспечением.
Тема доклада
Сложнейшие техники, применяемые буткитами и полиморфными вирусами.
Тезисы
Бытует мнение, что современные зловреды достаточно просты и пишутся неподготовленными людьми. Данное выступление призвано развеять этот миф. В презентации будут описаны три зловреда, которые используют нетривиальные и сложные методы в процессе своего функционирования. В частности, будет рассмотрена схема работы современных буткитов, которые всё больше и больше набирают обороты. На двух других примерах мы проиллюстрируем изобретательность вирусописателей, которые пытаются максимально усложнить жизнь исследователям и антивирусным компаниям. В одном случае они использовали собственную виртуальную машину совместно с EPO техникой заражения. А в другом - «подключение» нулевых виртуальных адресов для размещения в них своих данных.
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндексyaevents
Тарас Иващенко, Яндекс
Администратор информационной безопасности в Яндексе. Специалист по информационной безопасности, проповедник свободного программного обеспечения, автор Termite, xCobra и участник проекта W3AF.
Тема доклада
Сканирование уязвимостей со вкусом Яндекса.
Тезисы
В докладе будет рассказано о внедрении в Яндексе сканирования сервисов на уязвимости как одного из контроля безопасности в рамках SDLC (Secure Development Life Cycle). Речь пойдет о сканировании уязвимостей на этапе тестирования сервисов, а также о сканировании сервисов, находящихся в промышленной эксплуатации. Мы рассмотрим проблемы, с которыми столкнулись, и объясним, почему в качестве основного механизма решили выбрать открытое программное обеспечение (сканер уязвимостей w3af), доработанное под наши нужды.
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebookyaevents
Дмитрий Мольков, Facebook
Бакалавр прикладной математики Киевского национального университета им. Тараса Шевченко (2007). Магистр компьютерных наук Stony Brook University (2009). Hadoop HDFS Commiter с 2011 года. Член команды Hadoop в Facebook с 2009 года.
Тема доклада
Масштабируемость Hadoop в Facebook.
Тезисы
Hadoop и Hive являются прекрасным инструментарием для хранения и анализа петабайтов информации в Facebook. Работая с такими объемами информации, команда разработчиков Hadoop в Facebook ежедневно сталкивается с проблемами масштабируемости и эффективности Hadoop. В докладе пойдет речь о некоторых деталях оптимизаций в разных частях Hadoop инфраструктуры в Facebook, которые позволяют предоставлять высококачественный сервис. Это может быть, например, оптимизация стоимости хранения в многопетабайтных HDFS кластерах, увеличение пропускной способности системы, сокращение времени отказа системы с помощью High Availability разработок для HDFS.
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
Александр Козлов, Cloudera Inc.
Александр Козлов, старший архитектор в Cloudera Inc., работает с большими компаниями, многие из которых находятся в рейтинге Fortune 500, над проектами по созданию систем анализа большого количества данных. Закончил аспирантуру физического факультета Московского государственного университета, после чего также получил степень Ph.D. в Стэнфорде. До Cloudera и после окончания учебы работал над статистическим анализом данных и соответствующими компьютерными технологиями в SGI, Hewlett-Packard, а также стартапе Turn.
Тема доклада
Контроль зверей: инструменты для управления и мониторинга распределенных систем от Cloudera.
Тезисы
Поддержание распределенных систем, состоящих из тысяч компьютеров, является сложной задачей. Компания Cloudera, которая специализируется на создании распределенных технологий, разработала набор средств для централизованного управления распределенных Hadoop/HBase кластеров. Hadoop и HBase являются проектами Apache Software Foundation, и их применение для анализа частично структурированных данных ускоряется во всем мире. В этом докладе будет рассказано о SCM, системе для конфигурации, настройки, и управления Hadoop/HBase и Activity Monitor, системе для мониторинга ряда ОС и Hadoop/HBase метрик, а также об особенностях подхода Cloudera в отличие от существующих решений для мониторинга (Tivoli, xCat, Ganglia, Nagios и т.д.).
Юнит-тестирование и Google Mock. Влад Лосев, Googleyaevents
Владимир Лосев, Google
Закончил математико-механический факультет Санкт-Петербургского государственного университета в 1995 году. Работал в компаниях Motоrola, Fair Isaac и Yahoo. С 2008 года работает в Google, в группе, занимающейся вопросами повышения производительности инженеров.
Тема доклада
Юнит-тестирование и Google Mock.
Тезисы
В модульных (юнит) тестах каждый элемент программы тестируется по отдельности, в изоляции от других. Такие тесты исполняются очень быстро, поэтому их можно запускать когда угодно, что позволяет отлавливать дефекты на самых ранних стадиях разработки. Однако для тестирования объекта в изоляции от других необходимо имитировать поведение связанных с ним объектов, что на C++ довольно утомительное занятие. Разработанная в Googlе библиотека для создания и использования mock-объектов — Google Mock — позволяет существенно упростить этот процесс и ускорить написание тестов. В докладе пойдет речь о принципах и возможностях библиотеки, примерах её использования и её внутреннем устройстве.
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...yaevents
Dave Abrahams, BoostPro Computing
He is a founding member of Boost.org and an active participant in the ISO C++ standards committee. His broad range of experience in the computer industry includes shrink-wrap software development, embedded systems design and natural language processing. He has authored eight Boost libraries and has made contributions to numerous others. Dave made his mark on C++ standardization by developing a conceptual framework for understanding exception-safety and applying it to the C++ standard library. He created the first exception-safe standard library implementation and, with Greg Colvin, drafted the proposals that eventually became the standard library’s exception safety guarantees.
Presentation topic:
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abrahams, BoostPro Computing.
Key points:
The ISO C++ standardization committee has just unanimously approved its final draft international standard, and it's chock full of new features. Though a few of the features have been available for years, some are brand new, and nobody really knows what it's like to program in this new C++ language. As with C++03, Boost.org is expected to take a leading role in exploiting C++11. In this talk, I'll give an overview of the most important new developments.
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...yaevents
Алексей Воинов, Яндекс
Закончил МГТУ им. Н.Э.Баумана в 1998 году. Посвятил часть своей жизни свободному программному обеспечению. Замечен в любви к языкам, как к алгоритмическим, так и к человеческим, как к естественным, так и к искусственным. Работает в Яндексе с 2009 года, занимается разработкой Яндекс.Почты.
Тема доклада
Зачем обычному программисту знать языки, на которых почти никто не пишет.
Тезисы
Есть категория алгоритмических языков, которые большинство программистов считает в лучшем случае странными. Это такие языки как Haskell, *ML, Lisp, Q. «Странные» языки не приживаются в промышленной разработке софта, потому что они не дают возможности писать стандартный «промышленный» код. Однако они бывают очень хороши для придумывания приёмов, которые помогают улучшить промышленный код. Впоследствии многие из них становятся стандартными промышленными. Знание «странных» языков очень полезно, когда в силу внешних обстоятельств сделать промышленный код радикально лучше невозможно, но его можно улучшать небольшими шагами.
В поисках математики. Михаил Денисенко, Нигмаyaevents
Михаил Денисенко, Нигма
Закончил факультет вычислительной математики и кибернетики МГУ. Завершает работу над диссертацией, посвященной математическим аспектам информационной безопасности. Занимался исследованиями в области обработки видеопоследовательностей и компьютерной безопасности в компании Intel. С 2009 года является старшим разработчиком математического сервиса в компании Nigma.ru. С 2011 года — системный архитектор поисковой системы ITim.vn.
Тема доклада
В поисках математики.
Тезисы
Nigma-Математика – это сервис, с помощью которого пользователи могут решать различные математические задачи (упрощать выражения, решать уравнения, системы уравнений и т. д.), вводя их прямо в строку поиска в виде обычного текста. Система распознает более тысячи физических, математических констант и единиц измерения, что позволяет пользователям производить операции с различными величинами (в том числе решать уравнения) и получать ответ в указанных единицах измерения. Помимо уравнений система решает все задачи, характерные для калькуляторов поисковых систем и конвертеров валют. В докладе будет описана общая схема функционирования сервиса, базовые и новые алгоритмы системы символьных вычислений (алгоритмы решения уравнений и неравенств, алгоритм учета области допустимых значений, алгоритм исследования функций и т.п.). Также будет рассказано об ускорении работы сервиса, распределении нагрузки на систему, распознавании математичности запроса, преобразовании валют и метрических величинах.
Using classifiers to compute similarities between face images. Prof. Lior Wol...yaevents
Prof. Lior Wolf, Tel-Aviv University
He is a faculty member at the School of Computer Science at Tel-Aviv University. Previously, he was a post-doctoral associate in Prof. Poggio's lab at MIT. He graduated from the Hebrew University, Jerusalem, where he worked under the supervision of Prof. Shashua. He was awarded the 2008 Sackler Career Development Chair, the Colton Excellence Fellowship for new faculty (2006-2008), the Max Shlumiuk award for 2004, and the Rothchild fellowship for 2004. His joint work with Prof. Shashua in ECCV 2000 received the best paper award, and their work in ICCV 2001 received the Marr prize honorable mention. He was also awarded the best paper award at the post ICCV workshop on eHeritage 2009. In addition, Lior has held several development, consulting and advisory positions in computer vision companies including face.com and superfish, and is a co-founder of FDNA.
Presentation topic:
Using classifiers to compute similarities between images of faces.
Key points:
The One-Shot-Similarity (OSS) is a framework for classifier-based similarity functions. It is based on the use of background samples and was shown to excel in tasks ranging from face recognition to document analysis. In this talk we will present the framework as well as the following results: (1) when using a version of LDA as the underlying classifier, this score is a Conditionally Positive Definite kernel and may be used within kernel-methods (e.g., SVM), (2) OSS can be efficiently computed, and (3) a metric learning technique that is geared toward improved OSS performance.
Using classifiers to compute similarities between face images. Prof. Lior Wol...
The Optimum Clustering Framework: Implementing the Cluster Hypothesis
1. A Framework for Optimum Document
Clustering:
Implementing the Cluster Hypothesis
Norbert Fuhr
University of Duisburg-Essen
March 30, 2011
2. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 2
Outline
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
3. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 3
Introduction
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
4. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4
Introduction
Motivation
Ad-hoc Retrieval
heuristic models:
define retrieval function
evaluate to test if it yields good quality
Probability Ranking Principle (PRP)
theoretic foundation for optimum retrieval
numerous probabilistic models based on PRP
Document clustering
classic approach:
define similarity function and fusion principle
evaluate to test if they yield good quality
Optimum Clustering Principle?
5. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4
Introduction
Motivation
Ad-hoc Retrieval
heuristic models:
define retrieval function
evaluate to test if it yields good quality
Probability Ranking Principle (PRP)
theoretic foundation for optimum retrieval
numerous probabilistic models based on PRP
Document clustering
classic approach:
define similarity function and fusion principle
evaluate to test if they yield good quality
Optimum Clustering Principle?
6. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4
Introduction
Motivation
Ad-hoc Retrieval
heuristic models:
define retrieval function
evaluate to test if it yields good quality
Probability Ranking Principle (PRP)
theoretic foundation for optimum retrieval
numerous probabilistic models based on PRP
Document clustering
classic approach:
define similarity function and fusion principle
evaluate to test if they yield good quality
Optimum Clustering Principle?
7. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5
Introduction
Cluster Hypothesis
Original Formulation
”closely associated documents tend to be relevant to the same
requests” (Rijsbergen 1979)
Idea of optimum clustering:
Cluster documents in such a way, that for any request, the
relevant documents occur together in one cluster
redefine document similarity:
documents are similar if they are relevant to the same queries
8. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5
Introduction
Cluster Hypothesis
Original Formulation
”closely associated documents tend to be relevant to the same
requests” (Rijsbergen 1979)
Idea of optimum clustering:
Cluster documents in such a way, that for any request, the
relevant documents occur together in one cluster
redefine document similarity:
documents are similar if they are relevant to the same queries
9. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5
Introduction
Cluster Hypothesis
Original Formulation
”closely associated documents tend to be relevant to the same
requests” (Rijsbergen 1979)
Idea of optimum clustering:
Cluster documents in such a way, that for any request, the
relevant documents occur together in one cluster
redefine document similarity:
documents are similar if they are relevant to the same queries
10. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
11. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
12. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
13. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6
Introduction
The Optimum Clustering Framework
14. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 7
Cluster Metric
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
15. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 8
Cluster Metric
Defining a Metric based on the Cluster Hypothesis
General idea:
Evaluate clustering wrt. a set of queries
For each query and each cluster, regard pairs of
documents co-occurring:
relevant-relevant: good
relevant-irrelevant: bad
irrelevant-irrelevant: don’t care
16. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queries
D Document collection
R relevance judgments: R ⊂ Q × D
C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
i=1
∀i, j : i = j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
17. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queries
D Document collection
R relevance judgments: R ⊂ Q × D
C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
i=1
∀i, j : i = j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
18. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queries
D Document collection
R relevance judgments: R ⊂ Q × D
C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
i=1
∀i, j : i = j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
19. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queries
D Document collection
R relevance judgments: R ⊂ Q × D
C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
i=1
∀i, j : i = j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
20. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9
Cluster Metric
Pairwise precision
Q Set of queries
D Document collection
R relevance judgments: R ⊂ Q × D
C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
i=1
∀i, j : i = j → Ci ∩ Cj = ∅
ci = |Ci | (size of cluster Ci ),
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
Pairwise precision (weighted average over all clusters)
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
21. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10
Cluster Metric
Pairwise precision – Example
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
Query set: disjoint classification with two classes a and b,
three clusters: (aab|bb|aa)
Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 .
7 3 7
Perfect clustering for a disjoint classification would yield
Pp = 1
for arbitrary query sets, values > 1 are possible
22. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10
Cluster Metric
Pairwise precision – Example
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
Query set: disjoint classification with two classes a and b,
three clusters: (aab|bb|aa)
Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 .
7 3 7
Perfect clustering for a disjoint classification would yield
Pp = 1
for arbitrary query sets, values > 1 are possible
23. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11
Cluster Metric
Pairwise recall
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant
documents for qk )
(micro recall)
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Example: (aab|bb|aa)
2 a pairs (out of 6)
1 b pair (out of 3)
2+1 1
Rp = 6+3 = 3.
24. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11
Cluster Metric
Pairwise recall
rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
documents in Ci wrt. qk )
gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant
documents for qk )
(micro recall)
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Example: (aab|bb|aa)
2 a pairs (out of 6)
1 b pair (out of 3)
2+1 1
Rp = 6+3 = 3.
25. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12
Cluster Metric
Perfect clustering
C is a perfect clustering iff there exists no clustering C s.th.
Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
Rp (D, Q, R, C) < Rp (D, Q, R, C )
strong Pareto optimum – more than one perfect clustering
possible
Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
Example: Rp = 23
Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
Rp = 1
26. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12
Cluster Metric
Perfect clustering
C is a perfect clustering iff there exists no clustering C s.th.
Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
Rp (D, Q, R, C) < Rp (D, Q, R, C )
strong Pareto optimum – more than one perfect clustering
possible
Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
Example: Rp = 23
Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
Rp = 1
27. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12
Cluster Metric
Perfect clustering
C is a perfect clustering iff there exists no clustering C s.th.
Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
Rp (D, Q, R, C) < Rp (D, Q, R, C )
strong Pareto optimum – more than one perfect clustering
possible
Example:
Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
Rp = 23
Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
Rp = 1
28. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
29. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
Pp
1
C
1 Rp
C = {{d1 , d2 , d3 , d4 }}
30. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
Pp
1 C’
C
1 Rp
C = {{d1 , d2 , d3 , d4 }}
C = {{d1 , d2 }, {d3 , d4 }}
31. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13
Cluster Metric
Do perfect clusterings form a hierarchy?
Pp
1 C’
C’’
C
1 Rp
C = {{d1 , d2 , d3 , d4 }}
C = {{d1 , d2 }, {d3 , d4 }} C = {{d1 , d2 , d3 }, {d4 }}
32. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 14
Optimum clustering
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
33. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge about
relevance judgments
switch from external to internal cluster measures
replace relevance judgments by estimates of probability of
relevance
requires probabilistic retrieval method yielding P(rel|q, d)
compute expected cluster quality
34. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge about
relevance judgments
switch from external to internal cluster measures
replace relevance judgments by estimates of probability of
relevance
requires probabilistic retrieval method yielding P(rel|q, d)
compute expected cluster quality
35. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge about
relevance judgments
switch from external to internal cluster measures
replace relevance judgments by estimates of probability of
relevance
requires probabilistic retrieval method yielding P(rel|q, d)
compute expected cluster quality
36. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge about
relevance judgments
switch from external to internal cluster measures
replace relevance judgments by estimates of probability of
relevance
requires probabilistic retrieval method yielding P(rel|q, d)
compute expected cluster quality
37. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15
Optimum clustering
Optimum Clustering
Usually, clustering process has no knowledge about
relevance judgments
switch from external to internal cluster measures
replace relevance judgments by estimates of probability of
relevance
requires probabilistic retrieval method yielding P(rel|q, d)
compute expected cluster quality
38. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16
Optimum clustering
Expected cluster quality
Pairwise precision:
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
Expected precision:
1 ci
π(D, Q, C) =
! P(rel|qk , dl )P(rel|qk , dm )
|D| ci (ci − 1)
Ci ∈C qk ∈Q (dl ,dm )∈Ci ×Ci
|Ci |>1 dl =dm
39. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16
Optimum clustering
Expected cluster quality
Pairwise precision:
1 rik (rik − 1)
Pp (D, Q, R, C) = ci
|D| ci (ci − 1)
Ci ∈C qk ∈Q
ci >1
Expected precision:
1 ci
π(D, Q, C) =
! P(rel|qk , dl )P(rel|qk , dm )
|D| ci (ci − 1)
Ci ∈C qk ∈Q (dl ,dm )∈Ci ×Ci
|Ci |>1 dl =dm
40. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17
Optimum clustering
Expected precision
1 ci
π(D, Q, C) = P(rel|qk , dl )P(rel|qk , dm )
|D| ci (ci − 1)
Ci ∈C (dl ,dm )∈Ci ×Ci qk ∈Q
|Ci |>1 dl =dm
here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected
number of queries for which both dl and dm are relevant
Transform a document into a vector of relevance probabilities:
τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )).
1 1
π(D, Q, C) = τ T (dl ) · τ (dm )
|D| Ci ∈C
ci − 1 (dl ,dm )∈Ci ×Ci
|Ci |>1 dl =dm
41. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17
Optimum clustering
Expected precision
1 ci
π(D, Q, C) = P(rel|qk , dl )P(rel|qk , dm )
|D| ci (ci − 1)
Ci ∈C (dl ,dm )∈Ci ×Ci qk ∈Q
|Ci |>1 dl =dm
here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected
number of queries for which both dl and dm are relevant
Transform a document into a vector of relevance probabilities:
τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )).
1 1
π(D, Q, C) = τ T (dl ) · τ (dm )
|D| Ci ∈C
ci − 1 (dl ,dm )∈Ci ×Ci
|Ci |>1 dl =dm
42. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Direct estimation requires estimation of denominator → biased
estimates
But: denominator is constant for a given query set → ignore
compute an estimate for the numerator only:
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈Ci ×Ci
dl =dm
(Scalar product τ T (dl ) · τ (dm ) gives the expected number of
queries for which both dl and dm are relevant)
43. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Direct estimation requires estimation of denominator → biased
estimates
But: denominator is constant for a given query set → ignore
compute an estimate for the numerator only:
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈Ci ×Ci
dl =dm
(Scalar product τ T (dl ) · τ (dm ) gives the expected number of
queries for which both dl and dm are relevant)
44. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Direct estimation requires estimation of denominator → biased
estimates
But: denominator is constant for a given query set → ignore
compute an estimate for the numerator only:
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈Ci ×Ci
dl =dm
(Scalar product τ T (dl ) · τ (dm ) gives the expected number of
queries for which both dl and dm are relevant)
45. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18
Optimum clustering
Expected recall
qk ∈Q Ci ∈C rik (rik − 1)
Rp (D, Q, R, C) =
qk ∈Q gk (gk − 1)
gk >1
Direct estimation requires estimation of denominator → biased
estimates
But: denominator is constant for a given query set → ignore
compute an estimate for the numerator only:
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈Ci ×Ci
dl =dm
(Scalar product τ T (dl ) · τ (dm ) gives the expected number of
queries for which both dl and dm are relevant)
46. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C s.th.
π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
Pareto optima
Set of perfect (and optimum) clusterings not even forms a
cluster hierarchy
no hierarchic clustering method will find all optima!
47. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C s.th.
π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
Pareto optima
Set of perfect (and optimum) clusterings not even forms a
cluster hierarchy
no hierarchic clustering method will find all optima!
48. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C s.th.
π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
Pareto optima
Set of perfect (and optimum) clusterings not even forms a
cluster hierarchy
no hierarchic clustering method will find all optima!
49. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19
Optimum clustering
Optimum clustering
C is an optimum clustering iff there exists no clustering C s.th.
π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
Pareto optima
Set of perfect (and optimum) clusterings not even forms a
cluster hierarchy
no hierarchic clustering method will find all optima!
50. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 20
Towards Optimum Clustering
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
51. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 21
Towards Optimum Clustering
Towards Optimum Clustering
Development of an (optimum) clustering method
1 Set of queries,
2 Probabilistic retrieval method,
3 Document similarity metric, and
4 Fusion principle.
52. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
53. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
54. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
55. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
56. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
57. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
58. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22
Towards Optimum Clustering
A simple application
1 Set of queries: all possible one-term queries
2 Probabilistic retrieval method: tf ∗ idf
3 Document similarity metric: τ T (dl ) · τ (dm )
4 Fusion principle: group average clustering
1
π(D, Q, C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
standard clustering method
59. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 23
Towards Optimum Clustering
Query set
Too few queries in real collections → artificial query set
collection clustering: set of all possible one-term queries
Probability distribution over the query set: uniform /
proportional to doc. freq.
Document representation: original terms / transformations
of the term space
Semantic dimensions: focus on certain aspects only (e.g.
images: color, contour, texture)
result clustering: set of all query expansions
60. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 24
Towards Optimum Clustering
Probabilistic retrieval method
Model: In principle, any retrieval model suitable
Transformation to probabilities: direct estimation /
transforming the retrieval score into such a probability
61. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 25
Towards Optimum Clustering
Document similarity metric.
fixed as τ T (dl ) · τ (dm )
62. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 26
Towards Optimum Clustering
Fusion principles
OCF only gives guidelines for good fusion principles:
consider metrics π and/or ρ during fusion
63. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
1
σ(C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
expected precision as criterion!
starts with singleton clusters minimum recall
building larger clusters for increasing recall
forms cluster with highest precision
(which may be lower than that of the current clusters)
64. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
1
σ(C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
expected precision as criterion!
starts with singleton clusters minimum recall
building larger clusters for increasing recall
forms cluster with highest precision
(which may be lower than that of the current clusters)
65. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
1
σ(C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
expected precision as criterion!
starts with singleton clusters minimum recall
building larger clusters for increasing recall
forms cluster with highest precision
(which may be lower than that of the current clusters)
66. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27
Towards Optimum Clustering
Group average clustering:
1
σ(C) = τ T (dl ) · τ (dm )
c(c − 1) (dl ,dm )∈C×C
dl =dm
expected precision as criterion!
starts with singleton clusters minimum recall
building larger clusters for increasing recall
forms cluster with highest precision
(which may be lower than that of the current clusters)
67. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28
Towards Optimum Clustering
Fusion principles – min cut
starts with single cluster (maximum recall)
searches for cut with minimum loss in recall
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈C×C
dl =dm
consider expected precision for breaking ties!
68. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28
Towards Optimum Clustering
Fusion principles – min cut
starts with single cluster (maximum recall)
searches for cut with minimum loss in recall
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈C×C
dl =dm
consider expected precision for breaking ties!
69. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28
Towards Optimum Clustering
Fusion principles – min cut
starts with single cluster (maximum recall)
searches for cut with minimum loss in recall
ρ(D, Q, C) = τ T (dl ) · τ (dm )
Ci ∈C (dl ,dm )∈C×C
dl =dm
consider expected precision for breaking ties!
70. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
71. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
72. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
73. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
74. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
75. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
76. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
77. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
78. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29
Towards Optimum Clustering
Finding optimum clusterings
Min cut
(assume cohesive similarity graph)
starts with optimum clustering for maximum recall
min cut finds split with minimum loss in recall
consider precision for tie breaking
optimum clustering for two clusters
O(n3 ) (vs. O(2n ) for the general case)
subsequent splits will not necessarily reach optima
Group average
in general, multiple fusion steps for reaching first optimum
greedy strategy does not necessarily find this optimum!
79. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 30
Experiments
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
80. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 31
Experiments
Experiments with a Query Set
ADI collection:
35 queries
70 documents (relevant to 2.4 queries on avg.)
Experiments:
Q35opt using the actual relevance in τ (d)
Q35 BM25 estimates for the 35 queries
1Tuni 1-term queries, uniform distribution
1Tdf 1-term queries, according to document frequency
82. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33
Experiments
Using Keyphrases as Query Set
Compare clustering results based on different query sets
1 ‘bag-of-words’: single words as queries
2 keyphrases automatically extracted as head-noun phrases,
single query = all keyphrases of a document
Test collections:
4 test collections assembled from the RCV1 (Reuters)
news corpus
# documents: 600 vs. 6000
# categories: 6 vs. 12,
Frequency distribution of classes: ([U]niform vs.
[R]andom).
83. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33
Experiments
Using Keyphrases as Query Set
Compare clustering results based on different query sets
1 ‘bag-of-words’: single words as queries
2 keyphrases automatically extracted as head-noun phrases,
single query = all keyphrases of a document
Test collections:
4 test collections assembled from the RCV1 (Reuters)
news corpus
# documents: 600 vs. 6000
# categories: 6 vs. 12,
Frequency distribution of classes: ([U]niform vs.
[R]andom).
84. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 34
Experiments
Using Keyphrases as Query Set - Results
Average Precision (External) F-measure
85. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 35
Experiments
Evaluation of the Expected F-Measure
Correlation between expected F-Measure (internal measure)
and
standard F-measure (comparison with reference classification)
test collections as before
regard quality of 40 different clustering methods for each
setting
(find optimum clustering among these 40 methods)
86. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 36
Experiments
Correlation results
Pearson correlation between internal measures and the
external F-Measure
87. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 37
Conclusion and Outlook
1 Introduction
2 Cluster Metric
3 Optimum clustering
4 Towards Optimum Clustering
5 Experiments
6 Conclusion and Outlook
88. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 38
Conclusion and Outlook
Summary
Optimum Clustering Framework
makes Cluster Hypothesis a requirement
forms theoretical basis for development of better clustering
methods
yields positive experimental evidence
89. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 39
Conclusion and Outlook
Further Research
theoretical
compatibility of existing clustering methods with OCF
extension of OCF to soft clustering
extension of OCF to hierarchical clustering
experimental
variation of query sets
user experiments