This document provides an introduction and overview of document clustering techniques in information retrieval. It discusses motivations for clustering documents, such as improving search recall and organizing search results. It covers common clustering algorithms like K-means and hierarchical clustering, how they work, and considerations like choosing the number of clusters. The document uses examples and diagrams to illustrate clustering concepts and algorithms.
This document provides an introduction to document clustering and clustering algorithms. It discusses how clustering can be used in information retrieval applications like organizing search results and improving search recall. It also covers different types of clustering algorithms like partitioning algorithms (such as K-means) and hierarchical algorithms. Key steps of the K-means and hierarchical agglomerative clustering algorithms are described.
Clustering is the process of grouping similar objects together. Hierarchical agglomerative clustering builds a hierarchy by iteratively merging the closest pairs of clusters. It starts with each document in its own cluster and successively merges the closest pairs of clusters until all documents are in one cluster, forming a dendrogram. Different linkage methods, such as single, complete, and average linkage, define how the distance between clusters is calculated during merging. Hierarchical clustering provides a multilevel clustering structure but has computational complexity of O(n3) in general.
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
The document discusses various techniques for document clustering and choosing the number of clusters. It describes how clustering can help address issues like ambiguous queries by grouping query results by topic. It also covers partitioning and hierarchical clustering approaches, as well as methods for choosing the number of clusters like hypothesis testing, Bayesian estimation, and penalizing model complexity. Visualization techniques like multidimensional scaling and self-organizing maps are also summarized.
The document provides an overview of different clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods like agglomerative and divisive, and density-based methods like DBSCAN and OPTICS. It discusses the basic concepts of clustering, requirements for effective clustering like scalability and ability to handle different data types and shapes. It also summarizes clustering algorithms like BIRCH that aim to improve scalability for large datasets.
The document discusses various model-based clustering techniques for handling high-dimensional data, including expectation-maximization, conceptual clustering using COBWEB, self-organizing maps, subspace clustering with CLIQUE and PROCLUS, and frequent pattern-based clustering. It provides details on the methodology and assumptions of each technique.
This document provides an introduction to document clustering and clustering algorithms. It discusses how clustering can be used in information retrieval applications like organizing search results and improving search recall. It also covers different types of clustering algorithms like partitioning algorithms (such as K-means) and hierarchical algorithms. Key steps of the K-means and hierarchical agglomerative clustering algorithms are described.
Clustering is the process of grouping similar objects together. Hierarchical agglomerative clustering builds a hierarchy by iteratively merging the closest pairs of clusters. It starts with each document in its own cluster and successively merges the closest pairs of clusters until all documents are in one cluster, forming a dendrogram. Different linkage methods, such as single, complete, and average linkage, define how the distance between clusters is calculated during merging. Hierarchical clustering provides a multilevel clustering structure but has computational complexity of O(n3) in general.
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
The document discusses various techniques for document clustering and choosing the number of clusters. It describes how clustering can help address issues like ambiguous queries by grouping query results by topic. It also covers partitioning and hierarchical clustering approaches, as well as methods for choosing the number of clusters like hypothesis testing, Bayesian estimation, and penalizing model complexity. Visualization techniques like multidimensional scaling and self-organizing maps are also summarized.
The document provides an overview of different clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods like agglomerative and divisive, and density-based methods like DBSCAN and OPTICS. It discusses the basic concepts of clustering, requirements for effective clustering like scalability and ability to handle different data types and shapes. It also summarizes clustering algorithms like BIRCH that aim to improve scalability for large datasets.
The document discusses various model-based clustering techniques for handling high-dimensional data, including expectation-maximization, conceptual clustering using COBWEB, self-organizing maps, subspace clustering with CLIQUE and PROCLUS, and frequent pattern-based clustering. It provides details on the methodology and assumptions of each technique.
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
The document discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. It provides details on popular partitioning algorithms like k-means and k-medoids, describing how they work, their strengths and weaknesses. Hierarchical clustering methods like AGNES and DIANA are also covered, including how distances between clusters are calculated during the merging or splitting process.
This document summarizes clustering analysis techniques described in Chapter 10 of the book "Data Mining: Concepts and Techniques". It introduces the basic concepts of cluster analysis including partitioning, hierarchical, density-based, and grid-based methods. It then describes the k-means and k-medoids partitioning algorithms in more detail, noting that k-means can be sensitive to outliers while k-medoids uses actual data points as cluster representatives.
This document summarizes clustering analysis techniques described in Chapter 10 of the book "Data Mining: Concepts and Techniques". It introduces the basic concepts of cluster analysis including partitioning, hierarchical, density-based, and grid-based methods. It then describes the k-means and k-medoids partitioning algorithms in more detail, noting that k-means can be sensitive to outliers while k-medoids uses actual data points as cluster representatives.
This document outlines clustering algorithms for large datasets. It discusses k-means clustering and extensions like k-means++ that improve initialization. It also covers spectral relaxation methods that reformulate k-means as a trace maximization problem to address local minima. Additionally, it proposes landmark-based clustering algorithms for biological sequences that select landmarks in one pass and assign sequences to the nearest landmark using hashing to search for neighbors. The document provides analysis of the time and space complexity of these algorithms as well as assumptions about separability and cluster size.
This chapter discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. Partitioning methods like k-means and k-medoids aim to partition observations into k clusters by optimizing some objective function. Hierarchical clustering builds a hierarchy of clusters based on distance between observations. Density-based methods identify clusters based on density rather than distance. Grid-based methods quantize the space into finite number of cells that form clusters.
This document summarizes chapter 10 of the book "Data Mining: Concepts and Techniques" which discusses cluster analysis. The chapter covers basic concepts of cluster analysis including partitioning, hierarchical, density-based and grid-based methods. It describes popular partitioning algorithms like k-means and k-medoids, and notes that k-means can be sensitive to outliers while k-medoids uses medioids which are less sensitive to outliers. The chapter also discusses evaluating clustering quality and major considerations for cluster analysis.
This document provides an overview of supervised and unsupervised learning, with a focus on clustering as an unsupervised learning technique. It describes the basic concepts of clustering, including how clustering groups similar data points together without labeled categories. It then covers two main clustering algorithms - k-means, a partitional clustering method, and hierarchical clustering. It discusses aspects like cluster representation, distance functions, strengths and weaknesses of different approaches. The document aims to introduce clustering and compare it with supervised learning.
This document provides an overview of clustering techniques and presents a sample MapReduce implementation of K-Means clustering and Canopy clustering on a large dataset. It discusses how clustering can be used to group large, high-dimensional datasets and describes hierarchical and partitional clustering algorithms. It also outlines the steps taken in the MapReduce implementation to distribute the clustering computation across multiple nodes.
K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.
This document summarizes a lecture on clustering and provides a sample MapReduce implementation of K-Means clustering. It introduces clustering, discusses different clustering algorithms like hierarchical and partitional clustering, and focuses on K-Means clustering. It also describes Canopy clustering, which can be used as a preliminary step to partition large datasets and parallelize computation for K-Means clustering. The document then outlines the steps to implement K-Means clustering on large datasets using MapReduce, including selecting canopy centers, assigning points to canopies, and performing the iterative K-Means algorithm in parallel.
This document provides an overview and summary of a project report on text clustering. The report describes a system that takes in a collection of documents as input, clusters the documents into groups based on similarity, and allows the user to iteratively explore and refine the clusters to find relevant documents. The system represents documents as vectors, uses cosine similarity to initially cluster documents, and applies Bayesian machine learning to further refine the clusters. It aims to allow users to efficiently browse and retrieve relevant documents without viewing the entire collection.
Distributed Computing Seminar Lecture 4 provided an overview of clustering algorithms and techniques. It discussed how clustering is used to group related data in applications like Google News and Amazon. It described hierarchical and partitional clustering algorithms and the k-means clustering algorithm. The lecture also introduced canopy clustering as a preliminary step that can help parallelize computation for clustering large datasets using MapReduce. It provided an example of how to efficiently partition a large movie ratings dataset into clusters using canopy clustering, k-means clustering, and MapReduce.
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k
medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it
generates unstable and empty clusters which are meaningless.
expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro
requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates
unnecessary distance computation by using previous iteration. The new approach for k
systematically based on initial centroids. It generates stable clusters to improve accuracy.
This document presents an overview of clustering algorithms including K-means clustering and canopy clustering. It discusses how K-means aims to minimize error across clusters and describes the canopy clustering approach of using an initial rough clustering followed by more rigorous clustering within canopies. It also provides examples of applying K-means clustering in Mathematica, R, and Excel and canopy clustering in Excel. Finally, it discusses complexity analysis and outlines future work involving Hadoop and parallelizing K-means.
Published a paper entitled ‘Visualization of Crisp and Rough Clustering using MATLAB’ in CIIT International Journal of Data Mining and Knowledge Engineering on 12th December 2012, Vol.4 , ISSN 0974 – 9683.
Cluster analysis is an unsupervised learning technique used to group similar data objects into clusters. It aims to partition data into groups called clusters such that objects within a cluster are as similar as possible while objects in different clusters are as dissimilar as possible. The k-means algorithm is commonly used for partitioning-based clustering. It works by randomly selecting k initial cluster centroids and then iteratively assigning data points to their nearest centroid and recalculating the centroids until cluster membership stabilizes. However, k-means is sensitive to outliers and noise since outliers can distort cluster centroids.
This thesis focuses on classification and clustering techniques using kernel density estimates that can be efficiently implemented using P-trees. Chapter 1 introduces the topics of data mining, classification, clustering, and P-trees. Chapter 2 analyzes bit-column-based data organization and P-trees. Chapter 3 describes P-trees and a new sorting scheme. Chapters 4-7 present various classification and clustering algorithms developed using the P-tree framework, including a kernel-based classifier, a decision tree approach, a semi-naive Bayes classifier, and a hierarchical clustering method. Chapter 8 concludes the thesis.
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
As lockdown restrictions eased, the survey found that:
1) More people were visiting outdoor spaces at least weekly compared to the initial lockdown period and the previous year.
2) Nearly 40% reported spending more time outdoors than the same period in 2019, with certain groups like younger people and families with children spending more.
3) Most people said they would continue spending meaningful time in nature, with around half expecting to visit outdoor spaces more after the pandemic ends.
This document presents results from surveys conducted in Scotland regarding participation in outdoor activities during COVID-19 lockdowns and easing of restrictions. Some key findings include:
- Walking, cycling, and outdoor exercise increased significantly compared to previous years. More people are traveling farther for outdoor activities as restrictions have eased.
- Participation in activities like walking, wildlife watching, and running increased the most, while activities like sightseeing and picnics remained lower.
- People report increased engagement with nature through activities like gardening and enjoying local wildlife. Many note increased benefits to mental health and well-being from outdoor time.
- Common problems encountered included issues with social distancing and inconsiderate behavior from others
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
The document discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. It provides details on popular partitioning algorithms like k-means and k-medoids, describing how they work, their strengths and weaknesses. Hierarchical clustering methods like AGNES and DIANA are also covered, including how distances between clusters are calculated during the merging or splitting process.
This document summarizes clustering analysis techniques described in Chapter 10 of the book "Data Mining: Concepts and Techniques". It introduces the basic concepts of cluster analysis including partitioning, hierarchical, density-based, and grid-based methods. It then describes the k-means and k-medoids partitioning algorithms in more detail, noting that k-means can be sensitive to outliers while k-medoids uses actual data points as cluster representatives.
This document summarizes clustering analysis techniques described in Chapter 10 of the book "Data Mining: Concepts and Techniques". It introduces the basic concepts of cluster analysis including partitioning, hierarchical, density-based, and grid-based methods. It then describes the k-means and k-medoids partitioning algorithms in more detail, noting that k-means can be sensitive to outliers while k-medoids uses actual data points as cluster representatives.
This document outlines clustering algorithms for large datasets. It discusses k-means clustering and extensions like k-means++ that improve initialization. It also covers spectral relaxation methods that reformulate k-means as a trace maximization problem to address local minima. Additionally, it proposes landmark-based clustering algorithms for biological sequences that select landmarks in one pass and assign sequences to the nearest landmark using hashing to search for neighbors. The document provides analysis of the time and space complexity of these algorithms as well as assumptions about separability and cluster size.
This chapter discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. Partitioning methods like k-means and k-medoids aim to partition observations into k clusters by optimizing some objective function. Hierarchical clustering builds a hierarchy of clusters based on distance between observations. Density-based methods identify clusters based on density rather than distance. Grid-based methods quantize the space into finite number of cells that form clusters.
This document summarizes chapter 10 of the book "Data Mining: Concepts and Techniques" which discusses cluster analysis. The chapter covers basic concepts of cluster analysis including partitioning, hierarchical, density-based and grid-based methods. It describes popular partitioning algorithms like k-means and k-medoids, and notes that k-means can be sensitive to outliers while k-medoids uses medioids which are less sensitive to outliers. The chapter also discusses evaluating clustering quality and major considerations for cluster analysis.
This document provides an overview of supervised and unsupervised learning, with a focus on clustering as an unsupervised learning technique. It describes the basic concepts of clustering, including how clustering groups similar data points together without labeled categories. It then covers two main clustering algorithms - k-means, a partitional clustering method, and hierarchical clustering. It discusses aspects like cluster representation, distance functions, strengths and weaknesses of different approaches. The document aims to introduce clustering and compare it with supervised learning.
This document provides an overview of clustering techniques and presents a sample MapReduce implementation of K-Means clustering and Canopy clustering on a large dataset. It discusses how clustering can be used to group large, high-dimensional datasets and describes hierarchical and partitional clustering algorithms. It also outlines the steps taken in the MapReduce implementation to distribute the clustering computation across multiple nodes.
K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.
This document summarizes a lecture on clustering and provides a sample MapReduce implementation of K-Means clustering. It introduces clustering, discusses different clustering algorithms like hierarchical and partitional clustering, and focuses on K-Means clustering. It also describes Canopy clustering, which can be used as a preliminary step to partition large datasets and parallelize computation for K-Means clustering. The document then outlines the steps to implement K-Means clustering on large datasets using MapReduce, including selecting canopy centers, assigning points to canopies, and performing the iterative K-Means algorithm in parallel.
This document provides an overview and summary of a project report on text clustering. The report describes a system that takes in a collection of documents as input, clusters the documents into groups based on similarity, and allows the user to iteratively explore and refine the clusters to find relevant documents. The system represents documents as vectors, uses cosine similarity to initially cluster documents, and applies Bayesian machine learning to further refine the clusters. It aims to allow users to efficiently browse and retrieve relevant documents without viewing the entire collection.
Distributed Computing Seminar Lecture 4 provided an overview of clustering algorithms and techniques. It discussed how clustering is used to group related data in applications like Google News and Amazon. It described hierarchical and partitional clustering algorithms and the k-means clustering algorithm. The lecture also introduced canopy clustering as a preliminary step that can help parallelize computation for clustering large datasets using MapReduce. It provided an example of how to efficiently partition a large movie ratings dataset into clusters using canopy clustering, k-means clustering, and MapReduce.
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k
medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it
generates unstable and empty clusters which are meaningless.
expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro
requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates
unnecessary distance computation by using previous iteration. The new approach for k
systematically based on initial centroids. It generates stable clusters to improve accuracy.
This document presents an overview of clustering algorithms including K-means clustering and canopy clustering. It discusses how K-means aims to minimize error across clusters and describes the canopy clustering approach of using an initial rough clustering followed by more rigorous clustering within canopies. It also provides examples of applying K-means clustering in Mathematica, R, and Excel and canopy clustering in Excel. Finally, it discusses complexity analysis and outlines future work involving Hadoop and parallelizing K-means.
Published a paper entitled ‘Visualization of Crisp and Rough Clustering using MATLAB’ in CIIT International Journal of Data Mining and Knowledge Engineering on 12th December 2012, Vol.4 , ISSN 0974 – 9683.
Cluster analysis is an unsupervised learning technique used to group similar data objects into clusters. It aims to partition data into groups called clusters such that objects within a cluster are as similar as possible while objects in different clusters are as dissimilar as possible. The k-means algorithm is commonly used for partitioning-based clustering. It works by randomly selecting k initial cluster centroids and then iteratively assigning data points to their nearest centroid and recalculating the centroids until cluster membership stabilizes. However, k-means is sensitive to outliers and noise since outliers can distort cluster centroids.
This thesis focuses on classification and clustering techniques using kernel density estimates that can be efficiently implemented using P-trees. Chapter 1 introduces the topics of data mining, classification, clustering, and P-trees. Chapter 2 analyzes bit-column-based data organization and P-trees. Chapter 3 describes P-trees and a new sorting scheme. Chapters 4-7 present various classification and clustering algorithms developed using the P-tree framework, including a kernel-based classifier, a decision tree approach, a semi-naive Bayes classifier, and a hierarchical clustering method. Chapter 8 concludes the thesis.
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
As lockdown restrictions eased, the survey found that:
1) More people were visiting outdoor spaces at least weekly compared to the initial lockdown period and the previous year.
2) Nearly 40% reported spending more time outdoors than the same period in 2019, with certain groups like younger people and families with children spending more.
3) Most people said they would continue spending meaningful time in nature, with around half expecting to visit outdoor spaces more after the pandemic ends.
This document presents results from surveys conducted in Scotland regarding participation in outdoor activities during COVID-19 lockdowns and easing of restrictions. Some key findings include:
- Walking, cycling, and outdoor exercise increased significantly compared to previous years. More people are traveling farther for outdoor activities as restrictions have eased.
- Participation in activities like walking, wildlife watching, and running increased the most, while activities like sightseeing and picnics remained lower.
- People report increased engagement with nature through activities like gardening and enjoying local wildlife. Many note increased benefits to mental health and well-being from outdoor time.
- Common problems encountered included issues with social distancing and inconsiderate behavior from others
MongoDB is a non-relational database that supports document-based queries, indexing of all fields, master-slave replication for high availability, automatic sharding of data across multiple servers, and MapReduce for flexible aggregation. It uses dynamic schemas and embeds documents which can store binary data. Queries in MongoDB support ad-hoc queries on documents using standard operators and indexes can be applied on any field.
MongoDB is a cross-platform, document-oriented NoSQL database that provides high performance and scalability. It stores data in BSON documents which are similar to JSON documents. MongoDB does not enforce a schema and documents can have dynamic schemas. It supports queries, indexing, replication and sharding. Some key features include schema-less design, document-oriented storage, queries on indexed fields and MapReduce for flexible aggregation.
This document provides an overview of MongoDB including:
- MongoDB is a cross-platform, document-oriented NoSQL database that stores data as JSON-like documents.
- It discusses MongoDB architecture, important features like queries, indexing, replication, and auto-sharding.
- The document compares MongoDB to relational databases and covers installation, CRUD operations, and aggregation.
- Examples of queries, updates, projections and other MongoDB operations are provided.
The document discusses adaptations in animals. It explains that the Viceroy butterfly uses mimicry, a physical adaptation, to resemble the Monarch butterfly for protection. Behavioral adaptations allow animals to respond to life needs through actions and can be instinctive like finding shelter, or learned through environment interactions. Physical adaptations are body structures like the elephant's trunk, while behavioral adaptations are animals' actions that can be innate or acquired.
This document discusses physical and behavioral adaptations in animals. Physical adaptations are body structures that help animals find food, defend themselves, and reproduce, including camouflage, mimicry, body coverings, and chemical defenses. Behavioral adaptations are animals' actions, which can be instinctive behaviors that are innate or learned behaviors acquired through experience. Together, physical and behavioral adaptations allow animals to survive in their environments.
This document discusses animal adaptations, separating them into two categories: physical and behavioral. Physical adaptations are body structures that help animals survive, such as camouflage, mimicry, chemical defenses, and body coverings. Behavioral adaptations are animals' actions, either instinctive behaviors that are innate or learned behaviors acquired through experience. Examples of instinctive behaviors include finding shelter, gathering food, and raising young. Learned behaviors must be taught and cannot be passed genetically to offspring.
The document discusses different types of animal adaptations including physical and behavioral adaptations. Physical adaptations are body structures that help animals survive, such as camouflage, mimicry, body coverings, and chemical defenses. Behavioral adaptations are animals' actions, which can be instinctive behaviors that are innate or learned behaviors acquired through experience. Examples of instinctive behaviors include finding shelter and raising young, while learned adaptations must be taught.
The document describes a 7 minute video field trip to Pearson Landfills and Recycling. It provides contact information for the Maine Department of Environmental Protection website for more information. The video gives a brief overview of operations at a landfill and recycling facility.
Landfills are constructed and operated to strict environmental standards to protect groundwater, with liners at the bottom. While landfilling waste is the lowest priority option, modern landfills are very different from old open dump areas as they carefully manage garbage in an engineered facility.
The document outlines Maine's waste hierarchy which prioritizes waste reduction and reuse above recycling, composting, processing and beneficial use, waste-to-energy, and landfilling as a last resort. It encourages reducing waste by avoiding unnecessary packaging and single-use items, reusing items to extend their lifespan, recycling recyclable materials, and composting organic waste to reduce landfilling. The hierarchy is designed to minimize environmental impacts and costs at each step.
The document discusses different ways that nature has been socially constructed and conceived. It outlines three fundamental meanings of nature: dualistic, monistic, and adverbial. It then describes four major ways nature has been conceived: as a collection, as a web of relationships, as a process, and as Gaia. Different constructions of nature lead to different views on environmental policy and human relationships with the natural world.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
1. Introduction to Information Retrieval
Introduction to
Information Retrieval
CS276: Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 12: Clustering
3. Introduction to Information Retrieval
What is clustering?
Clustering: the process of grouping a set of objects
into classes of similar objects
Documents within a cluster should be similar.
Documents from different clusters should be
dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
A common and important task that finds many
applications in IR and other places
Ch. 16
4. Introduction to Information Retrieval
A data set with clear cluster structure
How would
you design
an algorithm
for finding
the three
clusters in
this case?
Ch. 16
5. Introduction to Information Retrieval
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective “user recall” will be higher
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
Sec. 16.1
6. Introduction to Information Retrieval
Yahoo! Hierarchy isn’t clustering but is the kind
of output you want from clustering
dairy
crops
agronomy
forestry
AI
HCI
craft
missions
botany
evolution
cell
magnetism
relativity
courses
agriculture biology physics CS space
... ... ...
… (30)
www.yahoo.com/Science
... ...
7. Introduction to Information Retrieval
Google News: automatic clustering gives an
effective news presentation metaphor
9. Introduction to Information Retrieval
For visualizing a document collection and its
themes
Wise et al, “Visualizing the non-visual” PNNL
ThemeScapes, Cartia
[Mountain height = cluster size]
10. Introduction to Information Retrieval
For improving search recall
Cluster hypothesis - Documents in the same cluster behave similarly
with respect to relevance to information needs
Therefore, to improve search recall:
Cluster docs in corpus a priori
When a query matches a doc D, also return other docs in the
cluster containing D
Hope if we do this: The query “car” will also return docs containing
automobile
Because clustering grouped together docs containing car with
those containing automobile.
Why might this happen?
Sec. 16.1
12. Introduction to Information Retrieval
Issues for clustering
Representation for clustering
Document representation
Vector space? Normalization?
Centroids aren’t length normalized
Need a notion of similarity/distance
How many clusters?
Fixed a priori?
Completely data driven?
Avoid “trivial” clusters - too large or small
If a cluster's too large, then for navigation purposes you've
wasted an extra user click without whittling down the set of
documents much.
Sec. 16.2
13. Introduction to Information Retrieval
Notion of similarity/distance
Ideal: semantic similarity.
Practical: term-statistical similarity
We will use cosine similarity.
Docs as vectors.
For many algorithms, easier to think in
terms of a distance (rather than similarity)
between docs.
We will mostly speak of Euclidean distance
But real implementations use cosine similarity
14. Introduction to Information Retrieval
Clustering Algorithms
Flat algorithms
Usually start with a random (partial) partitioning
Refine it iteratively
K means clustering
(Model based clustering)
Hierarchical algorithms
Bottom-up, agglomerative
(Top-down, divisive)
15. Introduction to Information Retrieval
Hard vs. soft clustering
Hard clustering: Each document belongs to exactly one cluster
More common and easier to do
Soft clustering: A document can belong to more than one
cluster.
Makes more sense for applications like creating browsable
hierarchies
You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
You can only do that with a soft clustering approach.
We won’t do soft clustering today. See IIR 16.5, 18
16. Introduction to Information Retrieval
Partitioning Algorithms
Partitioning method: Construct a partition of n
documents into a set of K clusters
Given: a set of documents and the number K
Find: a partition of K clusters that optimizes the
chosen partitioning criterion
Globally optimal
Intractable for many objective functions
Ergo, exhaustively enumerate all partitions
Effective heuristic methods: K-means and K-
medoids algorithms
See also Kleinberg NIPS 2002 – impossibility for natural clustering
17. Introduction to Information Retrieval
K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity
or mean) of points in a cluster, c:
Reassignment of instances to clusters is based on
distance to the current cluster centroids.
(Or one can equivalently phrase it in terms of similarities)
c
x
x
c
|
|
1
(c)
μ
Sec. 16.4
18. Introduction to Information Retrieval
K-Means Algorithm
Select K random docs {s1, s2,… sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
Sec. 16.4
19. Introduction to Information Retrieval
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
x
x
Reassign clusters
x
x x
x Compute centroids
Reassign clusters
Converged!
Sec. 16.4
20. Introduction to Information Retrieval
Termination conditions
Several possibilities, e.g.,
A fixed number of iterations.
Doc partition unchanged.
Centroid positions don’t change.
Does this mean that the docs in a
cluster are unchanged?
Sec. 16.4
21. Introduction to Information Retrieval
Convergence
Why should the K-means algorithm ever reach a
fixed point?
A state in which clusters don’t change.
K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
EM is known to converge.
Number of iterations could be large.
But in practice usually isn’t
Sec. 16.4
22. Introduction to Information Retrieval
Convergence of K-Means
Define goodness measure of cluster k as sum of
squared distances from cluster centroid:
Gk = Σi (di – ck)2 (sum over all di in cluster k)
G = Σk Gk
Reassignment monotonically decreases G since
each vector is assigned to the closest centroid.
Lower case!
Sec. 16.4
23. Introduction to Information Retrieval
Convergence of K-Means
Recomputation monotonically decreases each Gk
since (mk is number of members in cluster k):
Σ (di – a)2 reaches minimum for:
Σ –2(di – a) = 0
Σ di = Σ a
mK a = Σ di
a = (1/ mk) Σ di = ck
K-means typically converges quickly
Sec. 16.4
24. Introduction to Information Retrieval
Time Complexity
Computing distance between two docs is O(M)
where M is the dimensionality of the vectors.
Reassigning clusters: O(KN) distance computations,
or O(KNM).
Computing centroids: Each doc gets added once to
some centroid: O(NM).
Assume these two steps are each done once for I
iterations: O(IKNM).
Sec. 16.4
25. Introduction to Information Retrieval
Seed Choice
Results can vary based on
random seed selection.
Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
clusterings.
Select good seeds using a heuristic
(e.g., doc least similar to any
existing mean)
Try out multiple starting points
Initialize with the results of another
method.
In the above, if you start
with B and E as centroids
you converge to {A,B,C}
and {D,E,F}
If you start with D and F
you converge to
{A,B,D,E} {C,F}
Example showing
sensitivity to seeds
Sec. 16.4
26. Introduction to Information Retrieval
K-means issues, variations, etc.
Recomputing the centroid after every assignment
(rather than after all points are re-assigned) can
improve speed of convergence of K-means
Assumes clusters are spherical in vector space
Sensitive to coordinate changes, weighting etc.
Disjoint and exhaustive
Doesn’t have a notion of “outliers” by default
But can add outlier filtering
Sec. 16.4
Dhillon et al. ICDM 2002 – variation to fix some issues with small
document clusters
27. Introduction to Information Retrieval
How Many Clusters?
Number of clusters K is given
Partition n docs into predetermined number of clusters
Finding the “right” number of clusters is part of the
problem
Given docs, partition into an “appropriate” number of
subsets.
E.g., for query results - ideal value of K not known up front
- though UI may impose limits.
Can usually take an algorithm for one flavor and
convert to the other.
28. Introduction to Information Retrieval
K not specified in advance
Say, the results of a query.
Solve an optimization problem: penalize having
lots of clusters
application dependent, e.g., compressed summary
of search results list.
Tradeoff between having more clusters (better
focus within each cluster) and having too many
clusters
29. Introduction to Information Retrieval
K not specified in advance
Given a clustering, define the Benefit for a
doc to be the cosine similarity to its
centroid
Define the Total Benefit to be the sum of
the individual doc Benefits.
Why is there always a clustering of Total Benefit n?
30. Introduction to Information Retrieval
Penalize lots of clusters
For each cluster, we have a Cost C.
Thus for a clustering with K clusters, the Total Cost is
KC.
Define the Value of a clustering to be =
Total Benefit - Total Cost.
Find the clustering of highest value, over all choices
of K.
Total benefit increases with increasing K. But can stop
when it doesn’t increase by “much”. The Cost term
enforces this.
31. Introduction to Information Retrieval
Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
One approach: recursive application of a
partitional clustering algorithm.
animal
vertebrate
fish reptile amphib. mammal worm insect crustacean
invertebrate
Ch. 17
32. Introduction to Information Retrieval
Dendrogram: Hierarchical Clustering
Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
32
33. Introduction to Information Retrieval
Hierarchical Agglomerative Clustering
(HAC)
Starts with each doc in a separate cluster
then repeatedly joins the closest pair of
clusters, until there is only one cluster.
The history of merging forms a binary tree
or hierarchy.
Sec. 17.1
Note: the resulting clusters are still “hard” and induce a partition
34. Introduction to Information Retrieval
Closest pair of clusters
Many variants to defining closest pair of clusters
Single-link
Similarity of the most cosine-similar (single-link)
Complete-link
Similarity of the “furthest” points, the least cosine-similar
Centroid
Clusters whose centroids (centers of gravity) are the most
cosine-similar
Average-link
Average cosine between pairs of elements
Sec. 17.2
35. Introduction to Information Retrieval
Single Link Agglomerative Clustering
Use maximum similarity of pairs:
Can result in “straggly” (long and thin) clusters
due to chaining effect.
After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:
)
,
(
max
)
,
(
,
y
x
sim
c
c
sim
j
i c
y
c
x
j
i
))
,
(
),
,
(
max(
)
),
(( k
j
k
i
k
j
i c
c
sim
c
c
sim
c
c
c
sim
Sec. 17.2
37. Introduction to Information Retrieval
Complete Link
Use minimum similarity of pairs:
Makes “tighter,” spherical clusters that are typically
preferable.
After merging ci and cj, the similarity of the resulting
cluster to another cluster, ck, is:
)
,
(
min
)
,
(
,
y
x
sim
c
c
sim
j
i c
y
c
x
j
i
))
,
(
),
,
(
min(
)
),
(( k
j
k
i
k
j
i c
c
sim
c
c
sim
c
c
c
sim
Ci Cj Ck
Sec. 17.2
39. Introduction to Information Retrieval
Computational Complexity
In the first iteration, all HAC methods need to
compute similarity of all pairs of N initial instances,
which is O(N2).
In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
Often O(N3) if done naively or O(N2 log N) if done more
cleverly
Sec. 17.2.1
40. Introduction to Information Retrieval
Group Average
Similarity of two clusters = average similarity of all pairs
within merged cluster.
Compromise between single and complete link.
Two options:
Averaged across all ordered pairs in the merged cluster
Averaged over all pairs between the two original clusters
No clear difference in efficacy
)
( :
)
(
)
,
(
)
1
(
1
)
,
(
j
i j
i
c
c
x x
y
c
c
y
j
i
j
i
j
i y
x
sim
c
c
c
c
c
c
sim
Sec. 17.3
41. Introduction to Information Retrieval
Computing Group Average Similarity
Always maintain sum of vectors in each cluster.
Compute similarity of clusters in constant time:
j
c
x
j x
c
s
)
(
)
1
|
|
|
|)(|
|
|
(|
|)
|
|
(|
))
(
)
(
(
))
(
)
(
(
)
,
(
j
i
j
i
j
i
j
i
j
i
j
i
c
c
c
c
c
c
c
s
c
s
c
s
c
s
c
c
sim
Sec. 17.3
42. Introduction to Information Retrieval
What Is A Good Clustering?
Internal criterion: A good clustering will produce
high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is
high
the inter-class similarity is low
The measured quality of a clustering depends on
both the document representation and the
similarity measure used
Sec. 16.3
43. Introduction to Information Retrieval
External criteria for clustering quality
Quality measured by its ability to discover some
or all of the hidden patterns or latent classes in
gold standard data
Assesses a clustering with respect to ground truth
… requires labeled data
Assume documents with C gold standard classes,
while our clustering algorithms produce K clusters,
ω1, ω2, …, ωK with ni members.
Sec. 16.3
44. Introduction to Information Retrieval
External Evaluation of Cluster Quality
Simple measure: purity, the ratio between the
dominant class in the cluster πi and the size of
cluster ωi
Biased because having n clusters maximizes
purity
Others are entropy of classes in clusters (or
mutual information between classes and
clusters)
C
j
n
n
Purity ij
j
i
i
)
(
max
1
)
(
Sec. 16.3
45. Introduction to Information Retrieval
Cluster I Cluster II Cluster III
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6
Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5
Purity example
Sec. 16.3
46. Introduction to Information Retrieval
Rand Index measures between pair
decisions. Here RI = 0.68
Number of
points
Same Cluster
in clustering
Different
Clusters in
clustering
Same class in
ground truth 20 24
Different
classes in
ground truth
20 72
Sec. 16.3
47. Introduction to Information Retrieval
Rand index and Cluster F-measure
B
A
A
P
D
C
B
A
D
A
RI
C
A
A
R
Compare with standard Precision and Recall:
People also define and use a cluster F-measure,
which is probably a better measure.
Sec. 16.3
48. Introduction to Information Retrieval
Final word and resources
In clustering, clusters are inferred from the data without
human input (unsupervised learning)
However, in practice, it’s a bit less clear: there are many
ways of influencing the outcome of clustering: number of
clusters, similarity measure, representation of documents, .
. .
Resources
IIR 16 except 16.5
IIR 17.1–17.3