This document discusses clustering and different definitions and types of clusters. It explores how there is no single universal definition of a cluster, and different clustering algorithms work better for certain types of clusters depending on factors like shape, separation, density, and dimensionality of the data. A variety of distance measures are also presented that can be used to quantify similarity when performing clustering.
Mathematical Model of Affinity Predictive Model for Multi-Class Predictioninventionjournals
The notion of affinity which is one of the predictive models can bedefined as the distance or closeness between two objects.Unlike the fuzzy Set and Rough Set, the affinity can deal with third objects and deals with time dimension. In addition, it could deal with entities or abstract side by side with real objects. However, the existing model of affinity is developed for binary classification or prediction. In this paper, Affinity Predictive Modelhas been proposed in order to provide a multi-classprediction. This developed method can be used in many applications when multi-classpredictions are needed.
Mathematical Model of Affinity Predictive Model for Multi-Class Predictioninventionjournals
The notion of affinity which is one of the predictive models can bedefined as the distance or closeness between two objects.Unlike the fuzzy Set and Rough Set, the affinity can deal with third objects and deals with time dimension. In addition, it could deal with entities or abstract side by side with real objects. However, the existing model of affinity is developed for binary classification or prediction. In this paper, Affinity Predictive Modelhas been proposed in order to provide a multi-classprediction. This developed method can be used in many applications when multi-classpredictions are needed.
A start guide to the concepts and algorithms in machine learning, including regression frameworks, ensemble methods, clustering, optimization, and more. Mathematical knowledge is not assumed, and pictures/analogies demonstrate the key concepts behind popular and cutting-edge methods in data analysis.
Updated to include newer algorithms, such as XGBoost, and more geometrically/topologically-based algorithms. Also includes a short overview of time series analysis
Community detection from research papers (AAN dataset) using the algorithms:
K-Means
Louvain
Newman-Girvan
github link to code: https://goo.gl/CXej44
github link to project web page: http://goo.gl/7OOkhI
youtube link to video:https://goo.gl/SCpamf
dropbox link to ppt report video: https://goo.gl/cgACzU
The mathematical and philosophical concept of vectorGeorge Mpantes
What is behind the physical phenomenon of the velocity; of the force; there is the mathematical concept of the vector. This is a new concept, since force has direction, sense, and magnitude, and we accept the physical principle that the forces exerted on a body can be added to the rule of the parallelogram. This is the first axiom of Newton. Newton essentially requires that the power is a " vectorial " size , without writing clearly , and Galileo that applies the principle of the independence of forces .
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Principal Component Analysis, or PCA, is a factual method that permits you to sum up the data contained in enormous information tables by methods for a littler arrangement of "synopsis files" that can be all the more handily envisioned and broke down.
Data Science - Part VII - Cluster AnalysisDerek Kane
This lecture provides an overview of clustering techniques, including K-Means, Hierarchical Clustering, and Gaussian Mixed Models. We will go through some methods of calibration and diagnostics and then apply the technique on a recognizable dataset.
A start guide to the concepts and algorithms in machine learning, including regression frameworks, ensemble methods, clustering, optimization, and more. Mathematical knowledge is not assumed, and pictures/analogies demonstrate the key concepts behind popular and cutting-edge methods in data analysis.
Updated to include newer algorithms, such as XGBoost, and more geometrically/topologically-based algorithms. Also includes a short overview of time series analysis
Community detection from research papers (AAN dataset) using the algorithms:
K-Means
Louvain
Newman-Girvan
github link to code: https://goo.gl/CXej44
github link to project web page: http://goo.gl/7OOkhI
youtube link to video:https://goo.gl/SCpamf
dropbox link to ppt report video: https://goo.gl/cgACzU
The mathematical and philosophical concept of vectorGeorge Mpantes
What is behind the physical phenomenon of the velocity; of the force; there is the mathematical concept of the vector. This is a new concept, since force has direction, sense, and magnitude, and we accept the physical principle that the forces exerted on a body can be added to the rule of the parallelogram. This is the first axiom of Newton. Newton essentially requires that the power is a " vectorial " size , without writing clearly , and Galileo that applies the principle of the independence of forces .
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Principal Component Analysis, or PCA, is a factual method that permits you to sum up the data contained in enormous information tables by methods for a littler arrangement of "synopsis files" that can be all the more handily envisioned and broke down.
Data Science - Part VII - Cluster AnalysisDerek Kane
This lecture provides an overview of clustering techniques, including K-Means, Hierarchical Clustering, and Gaussian Mixed Models. We will go through some methods of calibration and diagnostics and then apply the technique on a recognizable dataset.
The term Machine Learning was coined by Arthur Samuel in 1959, an American pioneer in the field of computer gaming and artificial intelligence, and stated that “it gives computers the ability to learn without being explicitly programmed”. Machine Learning is the latest buzzword floating around. It deserves to, as it is one of the most interesting subfields of Computer Science. So what does Machine Learning really mean? Let’s try to understand Machine Learning
Intro to SVM with its maths and examples. Types of SVM and its parameters. Concept of vector algebra. Concepts of text analytics and Natural Language Processing along with its applications.
GIS in Public Health Research: Understanding Spatial Analysis and Interpretin...hpaocec
Geographic information systems (GIS) allow us to visualize data to better understand public health issues in our communities. Maps help recognize patterns for hypothesis generation; however, spatial analysis is necessary to substantiate relationships and produce meaningful outcomes. In this presentation we will discuss a few of the basic questions related to spatial analysis:
Informações sobre as oportunidade de estudo e carreira promovidos pelo curso de Ciências de Computação da Universidade de São Paulo, campus de São Carlos.
Introdução às ferramentas de Business Intelligence do ecossistema Hadoop:
Business Intelligence e Big Data
Big Data warehousing
Arquitetura de um data warehouse
Hadoop e Apache Hive
Extract Transform Load
Data warehouse vs Banco de dados operacional
OLAP – Online Analytical Processing
Apache Kylin
Soluções OLAP convencionais
Advanced Analytics com o Apache Mahout
On the Support of a Similarity-Enabled Relational Database Management System ...Universidade de São Paulo
Crowdsourcing solutions can be helpful to extract information from disaster-related data during crisis management. However, certain information can only be obtained through similarity operations. Some of them also depend on additional data stored in a Relational Database Management System (RDBMS). In this context, several works focus on crisis management supported by data. Nevertheless, none of them provides a methodology for employing a similarity-enabled RDBMS in disaster-relief tasks. To fill this gap, we introduce a similarity-enabled methodology together with a supporting architecture named Data-Centric Crisis Management (DCCM), which employs our methods over a RDBMS. We evaluate our proposal through three tasks: classification of incoming data regarding current events, identifying relevant information to guide rescue teams; filtering of incoming data, enhancing the decision support by removing near-duplicate data; and similarity retrieval of historical data, supporting analytical comprehension of the crisis context. To make it possible, similarity-based operations were implemented within one popular, open-source RDBMS. Results using real data from Flickr show that the proposed methodology over DCCM is feasible for real-time applications. In addition to high performance, accurate results were obtained with a proper combination of techniques for each task. At last, given its accuracy and efficiency, we expect our work to provide a framework for further developments on crisis management solutions.
Effective and Unsupervised Fractal-based Feature Selection for Very Large Dat...Universidade de São Paulo
Given a very large dataset of moderate-to-high di-
mensionality, how to mine useful patterns from it? In such
cases, dimensionality reduction is essential to overcome the
“curse of dimensionality”. Although there exist algorithms to
reduce the dimensionality of Big Data, unfortunately, they
all fail to identify/eliminate non-linear correlations between
attributes. This paper tackles the problem by exploring con-
cepts of the Fractal Theory and massive parallel processing
to present Curl-Remover, a novel dimensionality reduction
technique for very large datasets. Our contributions are: Curl-
Remover eliminates linear and non-linear attribute correlations
as well as irrelevant ones; it is unsupervised and suits for
analytical tasks in general – not only classification; it presents
linear scale-up; it does not require the user to guess the
number of attributes to be removed, and; it preserves the
attributes’ semantics. We performed experiments on synthetic
and real data spanning up to 1.1 billion points and Curl-
Remover outperformed a PCA-based algorithm, being up to
8% more accurate.
Fire Detection on Unconstrained Videos Using Color-Aware Spatial Modeling and...Universidade de São Paulo
The semantic segmentation of events on emergency contexts involves the identification of previously defined events of interest. In this work, the focused semantic event is the presence of fire in videos. The literature presents several methods for automatic video fire detection, but these methods were built under assumptions, such as stationary cameras and controlled lightening conditions that are often in contrast to the videos acquired by hand-held devices. To fulfill this gap, we propose a fire detection method, called SPATFIRE. Our method innovates on three aspects: (1) it relies on a specifically tailored color model named Fire-like Pixel Detector able to improve the accuracy of fire detection, (2) it employs a new technique for motion compensation, diminishing the problems observed in videos captured with non-stationary cameras, and, (3) it defines a segmentation method able to identify, not only the presence of fire in a video, but also the segments in the video where fire occurs. We experimented our proposal on two video datasets with different characteristics and summarize the results to demonstrate the superior efficacy, in terms of true positives and negatives, as compared to state-of-the-art methods.
Can we use information from social media and crowdsourced images to detect smoke and assist rescue forces? While there are computer vision methods for detecting smoke, they require movement information extracted from video data. In this paper we propose SmokeBlock: a method that is able to segment and detect smoke in still images. SmokeBlock uses superpixel segmentation and extracts local color and texture features from images to spot smoke. We used real data from Flickr and compared SmokeBlock against state-of-the-art methods for feature extraction. Our method achieved performance superior than the competitors, for the task of smoke detection. Our findings shall support further investigations in the field of image analysis, in particular, concerning images captured with mobile devices.
Vertex Centric Asynchronous Belief Propagation Algorithm for Large-Scale GraphsUniversidade de São Paulo
Inference problems on networks and their algorithms were always important subjects, but more so now with so much data available and so little time to make sense of it.
Common applications range from product recommendation to social networks and protein interaction.
One of the main inferences in this types of networks is the guilty-by-association method, where labeled nodes propagate their information throughout the network, towards unlabeled nodes.
While there is a widely used algorithm for this context, called Belief Propagation, it lacks the necessary convergence guarantees for loopy-networks.
More recently, a new alternative method was proposed, called LinBP and while it solved the convergence issue, the scalability for large graphs that do not fit memory remains a challenge.
Additionally, most works that try to use BP considering large scale graphs rely on specific infrastructure such as supercomputers and computational clusters.
Therefore we propose a new algorithm, that leverages state-of-the-art asynchronous vertex-centric parallel processing techniques in conjunction with the state-of-the-art BP alternative LinBP, to provide a scalable framework for large graph inference that runs on a single commodity machine.
Our results show that our algorithm is up to 200 times faster than LinBP's SQL implementation on tested networks, while achieving the same accuracy rate.
We also show that due to the asynchronous processing, our algorithm actually needs less iterations to converge when compared to LinBP when using the same parameters.
Finally, we believe that our methodology highlights the yet not fully explored parallelism available on commodity machines, leaning towards a more cost-efficient computational paradigm.
Fast Billion-scale Graph Computation Using a Bimodal Block Processing ModelUniversidade de São Paulo
Recent graph computation approaches have demonstrated that a single PC can perform efficiently on billion-scale graphs. While these approaches achieve scalability by optimizing I/O operations, they do not fully exploit the capabilities of modern hard drives and processors. To overcome their performance, in this work, we introduce the Bimodal Block Processing (BBP), an innovation that is able to boost the graph computation by minimizing the I/O cost even further. With this strategy, we achieved the following contributions: (1) \mflash, the fastest graph computation framework to date; (2) a flexible and simple programming model to easily implement popular and essential graph algorithms, including the \textit{first} single-machine billion-scale eigensolver; and (3) extensive experiments on real graphs with up to 6.6 billion edges, demonstrating M-Flash's consistent and significant speedup.
StructMatrix: large-scale visualization of graphs by means of structure detec...Universidade de São Paulo
Given a large-scale graph with millions of nodes and edges, how to reveal macro patterns of interest, like cliques, bi-partite cores, stars, and chains? Furthermore, how to visualize such patterns altogether getting insights from the graph to support wise decision-making? Although there are many algorithmic and visual techniques to analyze graphs, none of the existing approaches is able to present the structural information of graphs at large-scale. Hence, this paper describes StructMatrix, a methodology aimed at high-scalable visual inspection of graph structures with the goal of revealing macro patterns of interest. StructMatrix combines algorithmic structure detection and adjacency matrix visualization to present cardinality, distribution, and relationship features of the structures found in a given graph. We performed experiments in real, large-scale graphs with up to one million nodes and millions of edges. StructMatrix revealed that graphs of high relevance (e.g., Web, Wikipedia and DBLP) have characterizations that reflect the nature of their corresponding domains; our findings have not been seen in the literature so far. We expect that our technique will bring deeper insights into large graph mining, leveraging their use for decision making.
Several graph visualization tools exist. However, they are not able to handle large graphs, and/or they do not allow interaction. We are interested on large graphs, with hundreds of thousands of nodes. Such graphs bring two challenges: the first one is that any straightforward interactive manipulation will be prohibitively slow. The second one is sensory overload: even if we could plot and replot the graph quickly, the user would be overwhelmed with the vast volume of information because the screen would be too cluttered as nodes and edges overlap each other. GMine system addresses both these issues, by using summarization and multi-resolution. GMine offers multi-resolution graph exploration by partitioning a given graph into a hierarchy of com-munities-within-communities and storing it into a novel R-tree-like structure which we name G-Tree. GMine offers summarization by implementing an innovative subgraph extraction algorithm and then visualizing its output.
Techniques for effective and efficient fire detection from social media imagesUniversidade de São Paulo
Social media provides information, in the form of images, that is valuable to a vast set of human activities, including salvage and rescue in the case of crisis situations (such as accidents, explosions, and fire). However, these services produce images in a rate that is impossible for human beings to absorb and analyze; thus, it is a requirement to have methods for automatic analysis. However, despite the multiple works on image analysis, there are no studies on the specific topic of fire detection over social media. To fill this gap, this work describes the use and the evaluation of an ample set of content-based image retrieval and classification techniques in the task of fire detection. In our intent, we (1) built a ground-truth set of annotated images regarding fire occurrence; (2) engineered the Fast-Fire Detection and Retrieval ($\FFDnR$) architecture to combine configurations of feature extractors and distance functions to work with instance-based learning; and (3) evaluated 36 image descriptors in the task of fire detection. Our results demonstrated that, for fire detection, the best image descriptors concerning efficacy (F-measure, Precision-Recall, and ROC) and processing efficiency (wall-clock time) are achieved with MPEG-7 feature extractors Color Structure and Scalable Color, and with distance functions City-Block and Euclidean. Our work shall provide basis for further developments regarding monitoring of images from social media.
Multimodal graph-based analysis over the DBLP repository: critical discoverie...Universidade de São Paulo
The use of graph theory for analyzing network-like data has gained central importance with the rise of the Web 2.0. However, many graph-based techniques are not well-disseminated and neither explored at their full potential, what might depend on a complimentary approach achieved with the combination of multiple techniques. This paper describes the systematic use of graph-based techniques of different types (multimodal) combining the resultant analytical insights around a common domain, the Digital Bibliography & Library Project (DBLP). To do so, we introduce an analytical ensemble based on statistical (degree, and weakly-connected components distribution), topological (average clustering coefficient, and effective diameter evolution), algorithmic (link prediction/machine learning), and algebraic techniques to inspect non-evident features of DBLP at the same time that we interpret the heterogeneous discoveries found along the work. As a result, we have put together a set of techniques demonstrating over DBLP what we call multimodal analysis, an innovative process of information understanding that demands a wide technical knowledge and a deep understanding of the data domain. We expect that our methodology and our findings will foster other multimodal analyses and also that they will bring light over the Computer Science research.
Currently, link recommendation has gained more attention as networked data becomes abundant in several scenarios. However, existing methods for this task have failed in considering solely the structure of dynamic networks for improved performance and accuracy. Hence, in this work, we present a methodology based on the use of multiple topological metrics in order to achieve prospective link recommendations considering time constraints. The combination of such metrics is used as input to binary classification algorithms that state whether two pairs of authors will/should define a link. We experimented with five algorithms, what allowed us to reach high rates of accuracy and to evaluate the different classification paradigms. Our results also demonstrated that time parameters and the activity profile of the authors can significantly influence the recommendation. In the context of DBLP, this research is strategic as it may assist on identifying potential partners, research groups with similar themes, research competition (absence of obvious links), and related work.
Relational databases are rigid-structured data sources characterized by complex relationships among a set of relations (tables). Making sense of such relationships is a challenging problem because users must consider multiple relations, understand their ensemble of integrity constraints, interpret dozens of attributes, and draw complex SQL queries for each desired data exploration. In this scenario, we introduce a twofold methodology; we use a hierarchical graph representation to efficiently model the database relationships and, on top of it, we designed a visualization technique for rapidly relational exploration. Our results demonstrate that the exploration of databases is profoundly simplified as the user is able to visually browse the data with little or no knowledge about its structure, dismissing the need of complex SQL queries. We believe our findings will bring a novel paradigm in what concerns relational data comprehension.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
2. http://publicationslist.org/junio
What is it about?
Clustering refers to the process of finding groups of points
that are in some way “lumped together”
A modality of unsupervised learning, as we do not know
ahead of time where and what are the clusters – no training!
Explanatorily tries to characterize the structure of a dataset
3. http://publicationslist.org/junio
But, what is a cluster?
groups of points that are similar
groups of points that are close to each other
groups well-separated one from each other
contiguous regions of high data point density separated by
regions of lower point density
4. http://publicationslist.org/junio
But, what is a cluster?
Any clusters here? There should not be, as they are uniformly (no two points overlap, yet)
generated points. Eventhough, most algorithms would point out some clusters.
It is not that there are clusters there, it is only that we do not have enough points yet.
5. http://publicationslist.org/junio
But, what is a cluster?
Any clusters here? There should not be, as they are uniformly (no two points overlap, yet)
generated points. Eventhough, most algorithms would point out some clusters.
It is not that there are clusters there, it is only that we do not have enough points yet.
The point here is – although one would find clusters, they definitely do not
explain the phenomenon accurately.
6. http://publicationslist.org/junio
But, what is a cluster?
Yes! Three clusters, I can see them. Distance-based algorithms can do well here.
Easy huh?! No wonder, here we have convex, disjoint, and well-separated groups of points.
Try the next ones!
7. http://publicationslist.org/junio
But, what is a cluster?
Non-convex clusters – simple distance-based algorithms would have trouble here.
A cluster is convex if the line connecting any two points lies entirely within the cluster
itself.
8. http://publicationslist.org/junio
But, what is a cluster?
Non-convex clusters – simple distance-based algorithms would have trouble here.
A cluster is convex if the line connecting any two points lies entirely within the cluster
itself.
There are also the star-convex clusters: in such case, the connecting line from
the spatial center of the cluster to any other point lies entirely within the
cluster.
10. http://publicationslist.org/junio
But, what is a cluster?
No general clustering algorithm can solve this.The clustering is given by the global
properties observed in the points – distance or neighbor based algorithms would yield a
single cluster.
11. http://publicationslist.org/junio
But, what is a cluster?
No general clustering algorithm can solve this.The clustering is given by the global
properties observed in the points – distance or neighbor based algorithms would yield a
single cluster.
In this case, for any algorithm that considers a single point (or a single pair of
points) at a time, this leads to a problem: to determine cluster membership,
we need the property of the whole cluster; but to determine the properties
(vertical, horizontal, and pairwise orthogonal) of the cluster, we must first
assign points to clusters.
12. http://publicationslist.org/junio
But, what is a cluster?
To handle such situations, we would need to perform some
kind of global structure analysis—a task our minds are
incredibly good at (which is why we tend to think of clusters
this way) but that we have a hard time teaching computers to
do
For problems in two dimensions, digital image processing has
developed methods to recognize and extract certain features
(such as edge detection)
But general clustering methods deal only with local properties
and therefore can’t handle problems such as these
13. http://publicationslist.org/junio
But, what is a cluster?
If we return to our candidate definitions of cluster, we can
verify that none of them survives the possibilities just
presented – try it!
groups of points that are similar
groups of points that are close to each other
groups well-separated one from each other
contiguous regions of high data point density separated by regions
of lower point density
14. http://publicationslist.org/junio
But, what is a cluster?
If we return to our candidate definitions of cluster, we can
verify that none of them survives the possibilities just
presented – try it!
groups of points that are similar
groups of points that are close to each other
groups of points well-separated one from each other
contiguous regions of high data point density separated by regions
of lower point density
So this is it.
• No mathematical, nor universal definition of a cluster
• Rather, we have our intuition and it could be quite useful provided we have a
good comprehension of the data properties – structural, statistical, and domain-
related
• Having, as much as possible, well-defined goals is also a demand
• Just as for any other data analysis approach, do not try to use it as a magic black
box – doing so will fail with high probability!
15. http://publicationslist.org/junio
Distances
Clustering does not actually require data points to be
embedded into a geometric space: all that is required is a
distance or (equivalently) a similarity measure for any pair of
points
This makes it possible to perform clustering on a set of
strings, for example
However, if the data points have properties of a vector space
then we can develop more efficient algorithms that exploit
these properties
16. http://publicationslist.org/junio
Distances – what is it?
A distance is any function d(x, y) that takes two points and returns
a scalar value that is a measure for how different these points are:
the more different, the larger the distance
A distance function – or, a similarity function:
s(x, y) = 1-d(x,y), for 0 ≤ d(x,y) ≤ 1
s(x,y) = 1/d(x,y)
s(x,y) = e-d
For some problems, a particular distance measure will present itself
naturally - if the data points are points in space, then we will most
likely employ the Euclidean distance or a measure similar to it, but
for other problems, we have more freedom to define our own
metric
17. http://publicationslist.org/junio
Distances – metric distances
There are certain properties that a distance (or similarity) function
should have. Mathematicians have developed a set of properties
that a function must possess to be considered a metric (or
distance) in a mathematical sense
d(x, y) = 0
d(x, y) = 0 if and only if x = y
d(x, y) = d(y, x)
d(x, y) + d(y, z) ≥ d(x, z)
These conditions are not necessarily fulfilled in practice. A funny
example for an asymmetric distance occurs if you ask everyone in a
group of people how much they like every other member of the
group and then use the responses to construct a distance measure:
it is not at all guaranteed that the feelings of person A for person B
are requited by B
18. http://publicationslist.org/junio
Distances – metric distances
There are certain properties that a distance (or similarity) function
should have. Mathematicians have developed a set of properties
that a function must possess to be considered a metric (or
distance) in a mathematical sense.
d(x, y) = 0
d(x, y) = 0 if and only if x = y
d(x, y) = d(y, x)
d(x, y) + d(y, z) ≥ d(x, z)
These conditions are not necessarily fulfilled in practice. A funny
example for an asymmetric distance occurs if you ask everyone in a
group of people how much they like every other member of the
group and then use the responses to construct a distance measure:
it is not at all guaranteed that the feelings of person A for person B
are requited by B
For technical reasons, the symmetry property is usually highly
desirable. You can always construct a symmetric distance
function from an asymmetric one:
dS(x, y) = d(x, y) + d(y, x)
2
20. http://publicationslist.org/junio
Distances – common distances
Distances Manhattan, Euclidean, Maximum, and Minkowski have all
similar properties, the application of each may depend on empirical
testing, or on subtle details of the data-domain
Minkowski
(L metric)
Maximum
(L infinity)
Minkowski
(L metric)
21. http://publicationslist.org/junio
Distances – correlation-based
Correlation-based measures: used if the data is numeric but not mixable
(so that it does not make sense to add a random fraction of one data set to
a random fraction of a different data set), as for example, in time series
The dot product of two points is the cosine of the angle that the two
vectors make with each other - if they are perfectly aligned then the
angle is 0 and the cosine (and the correlation) is 1; If they are at right
angles to each other, the cosine is 0
The only difference between the dot
product and the correlation coefficient is
that for the second, we first center both
data points by subtracting their respective
means
By construction, the value of a dot product
always falls in the interval [0, 1], and the
correlation coefficient always falls in the
interval [−1, 1]
22. http://publicationslist.org/junio
Distances – binary and sparse
If the data is categorical, then we can count the number of features
that do not agree in both data points (i.e., the number of mismatched
features); this is the Hamming distance
As an example, imagine a patient’s health record: each possible
medical condition constitutes a feature, and we want to know
whether the patient has ever suffered from it
In situations where the features are categorical, binary, and sparse
(just a few are On), we may be interested in matches between
features that are On than those that are Off; this leads us to the
Jaccard coefficient s: the number of matches between features that
are On for both points, divided by the number of features that are
On in at least one of the data points
The Jaccard coefficient is a similarity measure; the corresponding
distance function is the Jaccard distance dj = 1-sj
23. http://publicationslist.org/junio
Distances – binary and sparse
If the data is categorical, then we can count the number of features
that do not agree in both data points (i.e., the number of mismatched
features); this is the Hamming distance
As an example, imagine a patient’s health record: each possible
medical condition constitutes a feature, and we want to know
whether the patient has ever suffered from it
In situations where the features are categorical, binary, and sparse
(just a few are On), we may be interested in matches between
features that are On than those that are Off; this leads us to the
Jaccard coefficient s: the number of matches between features that
are On for both points, divided by the number of features that are
On in at least one of the data points
The Jaccard coefficient is a similarity measure; the corresponding
distance function is the Jaccard distance dj = 1-sj
The Jaccard distance:
As an example, imagine graph data.The similarity of two vertices is given by how
many neighbors they have in common (On) – what is usually sparse, as just a few
vertices are neighbors of a given vertex
24. http://publicationslist.org/junio
Distances – strings
If we are dealing with many strings that are rather similar to each
other (distorted through typos, for instance), then we can use a more
detailed measure of the difference between them—namely the edit
or Levenshtein distance. The Levenshtein distance is the minimum
number of single-character operations (insertions, deletions, and
substitutions) required to transform one string into the other
Another approach is to find the length of the longest common
subsequence; this metric is often used for gene sequence analysis in
computational biology
25. http://publicationslist.org/junio
Distances – strings
If we are dealing with many strings that are rather similar to each
other (distorted through typos, for instance), then we can use a more
detailed measure of the difference between them—namely the edit
or Levenshtein distance. The Levenshtein distance is the minimum
number of single-character operations (insertions, deletions, and
substitutions) required to transform one string into the other
Another approach is to find the length of the longest common
subsequence; this metric is often used for gene sequence analysis in
computational biology
The best distance measure to use does not follow automatically from data type; rather,
it depends on the semantics of the data—or, more precisely, on the semantics that you
care about for your current analysis!
In some cases, a simple metric that only calculates the difference in string length may be
perfectly sufficient. In another case, you might want to use the Hamming distance.
If you really care about the details of otherwise similar strings, the Levenshtein distance
is most appropriate.You might even want to calculate how often each letter appears in a
string and then base your comparison on that.
It all depends on what the data means and on what aspect of it you are interested at the
moment (which may also change as the analysis progresses).
Similar considerations apply everywhere—there are no “cookbook” rules.
26. http://publicationslist.org/junio
Clustering methods
Different algorithms are suitable for different kinds of problems—
depending, for example, on the shape and structure of the clusters
Some require vector-like data, whereas others require only a distance
function
Different algorithms tend to be misled by different kinds of pitfalls,
and they all have different performance (i.e., computational
complexity) characteristics
There are tree main categories of clustering algorithms: center
seekers, tree builders, and neighborhood growers – I said three main, not
only three (check Survey Of Clustering Data Mining Techniques of author Pavel
Berkhin)
27. http://publicationslist.org/junio
Clustering methods – k-means
One of the most popular clustering methods is the k-means
algorithm; the k-means algorithm requires the number of expected
clusters k as input, and works in an iterative scheme to search for the
correct center of each cluster
The main idea is to calculate the position of each cluster’s center (or
centroid) from the positions of the points belonging to the cluster
and then to assign points to their nearest centroid – this process is
repeated until sufficient convergence is achieved
The algorithm is as follows:
choose initial positions for the cluster centroids
repeat:
for each point:
calculate its distance from each cluster centroid
assign the point to the nearest cluster
recalculate the positions of the cluster centroids
28. http://publicationslist.org/junio
Clustering methods – k-means
The k-means algorithm is nondeterministic: a different choice of starting
values may result in a different assignment of points to clusters; for this
reason, it is customary to run the k-means algorithm several times and then
compare the results
If you have previous knowledge of likely positions for the cluster centers,
you can use it to precondition the algorithm; otherwise, choose random
data points as initial values.
What makes this algorithm efficient is that you don’t have to search the
existing data points to find one that would make a good centroid—instead
you are free to construct a new centroid position; this is usually done by
calculating the cluster’s center of mass:
29. http://publicationslist.org/junio
Clustering methods – k-means
The k-means algorithm is nondeterministic: a different choice of starting
values may result in a different assignment of points to clusters; for this
reason, it is customary to run the k-means algorithm several times and then
compare the results
If you have previous knowledge of likely positions for the cluster centers,
you can use it to precondition the algorithm; otherwise, choose random
data points as initial values.
What makes this algorithm efficient is that you don’t have to search the
existing data points to find one that would make a good centroid—instead
you are free to construct a new centroid position; this is usually done by
calculating the cluster’s center of mass:
If we are using categorical data, then the k-mean algorithm cannot be
used (one cannot calculate the mass center), in this case we must use
the k-medoids algorithm
The only difference is that instead of calculating a new centroid, it is
necessary to search all the points in the cluster to find the data point
that has the smallest average distance to all other points in its cluster
For this reason, the k-medoids algorithm is O(n2), meanwhile the k-
means algorithm is O(k*n), where k is the number of clusters
For performance,it is possible to run k-medoids in a sample of the
dataset to have an idea of the cluster centers, and then run it on the
entire dataset
30. http://publicationslist.org/junio
Clustering methods – k-means
Despite its cheap-and-cheerful appearance, the k-means algorithm works surprisingly
well. It is pretty fast and relatively robust. Convergence is usually quick. Because the
algorithm is simple and highly intuitive, it is easy to augment or extend it—for example,
to incorporate points with different weights. You might also want to experiment with
different ways to calculate the centroid, possibly using the median position rather than
the mean, and so on.
In summary:
The k-means algorithms and its variants work best for globular (at least star-convex) clusters; the
results will be meaningless for clusters with complicated shapes and for nested clusters
The expected number of clusters is required as an input; if this number is not known, it will be
necessary to repeat the algorithm with different values and compare the results
The algorithm is iterative and nondeterministic; the specific outcome may depend on the choice of
starting values
The k-means algorithm requires vector data; use the k-medoids algorithm for categorical data
The algorithm can be misled if there are clusters of highly different size or different density
The k-means algorithm is linear in the number of data points; the k-medoids algorithm is quadratic in
the number of points
31. http://publicationslist.org/junio
Clustering methods – DBSCAN
Neighborhood growers work by connecting points that are
“sufficiently close” to each other to form a cluster and then keep
doing so until all points have been classified
Based on the idea (definition) of a cluster as a region of high density,
and it makes no assumptions about the overall shape of the cluster
More robust than k-means variations in respect to the structure of
the clusters
32. http://publicationslist.org/junio
Clustering methods – DBSCAN
The DBSCAN algorithm is an example of Neighborhood grower
It is based on two metrics:
The minimum density accepted for the points that define the cluster
The size of the region over which we expect the minimum density to be
verified
In practice, the algorithm asks for:
The neighborhood radius r
The minimum number of points n that we expect to find within the neighborhood of each
point
33. http://publicationslist.org/junio
Clustering methods – DBSCAN
DBSCAN distinguishes between three types of points: noise, core,
and edge points:
A noise point is a point which has fewer than n points in its
neighborhood of radius r, such a point does not belong to any
cluster – background data
A core point has more than n neighbors
An edge point is a point that has fewer neighbors than required for
a core point but that is itself the neighbor of a core point - the
algorithm discards noise points and concentrates on core points
Whenever the algorithm finds a core point, it assigns a cluster
label to that point and then continues to add all its neighbors,
and their neighbors recursively to the cluster, until all points
have been classified
34. http://publicationslist.org/junio
Clustering methods – DBSCAN
DBSCAN distinguishes between three types of points: noise, core,
and edge points:
A noise point is a point which has fewer than n points in its
neighborhood of radius r, such a point does not belong to any
cluster
A core point has more than n neighbors
An edge point is a point that has fewer neighbors than required for
a core point but that is itself the neighbor of a core point - the
algorithm discards noise points and concentrates on core points
Whenever the algorithm finds a core point, it assigns a cluster
label to that point and then continues to add all its neighbors,
and their neighbors recursively to the cluster, until all points
have been classified
Finally, the basic algorithm lends itself to elegant recursive implementations,
but keep in mind that the recursion will not unwind until the current
cluster is complete.This means that, in the worst case (of a single
connected cluster), you will end up putting the entire data set onto the
stack!
35. http://publicationslist.org/junio
Clustering methods – DBSCAN
DBSCAN is sensitive to the choice of parameters
For example, if a data set contains several clusters with widely varying
densities, then a single set of parameters may not be sufficient to
classify all of the clusters
A possible workaround it to use k-means first to identify cluster
candidates, and then to extract statistics that will help parametrize
DBSCAN
The computational complexity of DBSCAN is O(n2), what can be
ameliorated by indexing structures able to quickly find the neighbors
of each point
36. http://publicationslist.org/junio
Clustering methods – tree builders
Another way to find clusters is by successively combining clusters
that are “close” to each other into a larger cluster until only a single
cluster remains; this approach is known as agglomerative hierarchical
clustering, and it leads to a treelike hierarchy of clusters
The distance between clusters is given is respect to representative
points within each cluster, the possibilities are:
Minimum or single link: the two points, one from each cluster that are
closest to each other; handles thinly connected clusters with complicated
shapes, but it is sensible to noise
Maximum or complete link: considers the points the farthest away from each
other, favors compact globular clusters
Average:considers the average between all pairs of points
Centroid: considers the centroids of each cluster
Ward’s method: combiners clusters whose coherence is higher; coherence
can be the average distance of all pairs, for example
37. http://publicationslist.org/junio
Clustering methods – tree builders
The result of hierarchical clustering is not actually a set of clusters;
instead, we obtain a treelike structure that contains the individual
data points at the leaf nodes - this structure can be represented
graphically in a dendrogram
Tree builder algorithms are expensive, on the order of O(n3)
One outstanding feature of hierarchical clustering is that it does more than
produce a flat list of clusters; it also shows their relationships in an explicit way
Tree builder can benefit from algorithms that are center seeker or
neighborhood growers
38. http://publicationslist.org/junio
Pre-processing
The core algorithm for grouping data points into clusters is usually
only part (though the most important one) of the whole strategy
Some data sets may require some cleanup or normalization before
they are suitable for clustering: that’s the first topic in this section
For example, look at the two plots below and answer: which one has
well-defined clusters?
39. http://publicationslist.org/junio
Pre-processing
For example, look at the two plots below and answer: which one has
well-defined clusters?
Well, as a matter of fact, both plots show the same dataset, but with different
aspect ratios
The same applies to datasets that spam to very different ranges – in such cases,
it is necessary to normalize the data
Problems like these are not observed in correlation-based distance
40. http://publicationslist.org/junio
Pre-processing
The simplest normalization can be achieved by:
x’ = (x – xmin)/(xmax – xmin)
Or, otherwise, if the data is reasonably Gaussian, it is possible to use the Z-
score normalization:
x’ = (x – xmean)/xStdDev
But first, use an Interquartile Range analysis to get rid of outliers
Actually, normalization is very sensitive to outliers and distributions that are
too skewed – for these cases, there are many other normalization
techniques, check for instance:
http://stn.spotfire.com/spotfire_client_help/norm/norm_normalizing_columns.
htm
41. http://publicationslist.org/junio
Pre-processing
The simplest normalization can be achieved by:
x’ = (x – xmin)/(xmax-xmin)
Or, otherwise, if the data is reasonably Gaussian, it is possible to use the Z-
score normalization:
x’ = (x - xmean)/xStdDev
But first, use an Interquartile Range analysis to get rid of outliers
Actually, normalization is very sensitive to outliers and distributions that are
too skewed – for these cases, there are many other normalization
techniques, check for instance:
http://stn.spotfire.com/spotfire_client_help/norm/norm_normalizing_columns.
htm
http://stn.spotfire.com/spotfire_client_help/norm/norm_normalizing_columns.htm
Normalization by Mean
Normalization byTrimmed Mean
Normalization by Percentile
Scale between 0 and 1
Subtract the Mean
Subtract the Median
Normalization by Signed Ratio
Normalization by Log Ratio
Normalization by Log Ratio in Standard Deviation Units
Z-score Calculation
Normalization by Standard Deviation
Also, the Mahalanobis distance is less susceptible to normalization issues
42. http://publicationslist.org/junio
Post-processing (cluster evaluation)
It is also necessary to inspect the results of every clustering algorithm in
order to validate and characterize the clusters that have been found
Given a set of clusters whose centroids are known, we can think of two
metrics:
Mass: the number of points in the cluster
Radius: the standard deviation of the distances of all points in relation to
the center of a given cluster; for two dimensions, we would have:
r2 = ∑i (xc – xi)2 + (yc – yi)2
(xc,yc ) the center of a cluster
We can also have the density of a cluster given by:
density = mass/radius
43. http://publicationslist.org/junio
Post-processing (cluster evaluation)
Besides density, there are:
Cohesion: the average distance between all points in a cluster, the smaller the more
compact
Separation: the average distance between all point in one cluster, and all the points in
another cluster – if we know the centroids, we could use them to simplify calculi
For a set of clusters, we can calculate the average cohesion and separation
for all clusters, and have an idea of the overall quality
If a data set can be clearly grouped into clusters, then we expect the
distance between the clusters to be large compared to the radii of the
clusters; therefore, we can think of an interesting metric based on cohesion
and separation:
cluster_quality = separation/cohesion
44. http://publicationslist.org/junio
Post-processing (cluster evaluation)
One the most used metrics for clustering is the Silhouette coefficient, which for a
sigle point i is given by:
Si = bi – ai .
max(ai,bi)
where ai is the average distance from point i to all other points in its cluster (this is
point i’s cohesion), bi is the smallest average distance from point i to all the points in
each of the other clusters (this is point i’s separation from the closest other cluster)
The numerator is a measure for the “empty space” between clusters, the
denominator is the biggest between radius and distance between clusters
Next, average the silhouette for all points in each cluster – this is the cluster’s
silhouette; average it for all clusters, this is the clustering’s silhouette
The silhouette coefficient ranges from −1 to 1; negative values indicate that the
cluster radius is greater than the distance between clusters, so that clusters overlap;
this suggests poor clustering. Large values of S suggest good clustering
45. http://publicationslist.org/junio
Post-processing (cluster evaluation)
One the most used metrics for clustering is the Silhouette coefficient, which for a
sigle point i is given by:
Si = bi – ai .
max(ai,bi)
where ai is the average distance from point i to all other points in its cluster (this is
point i’s cohesion), bi is the smallest average distance from point i to all the points in
each of the other clusters (this is point i’s separation from the closest other cluster)
The numerator is a measure for the “empty space” between clusters, the
denominator is the biggest between radius and distance between clusters
Next, average the silhouette for all points in each cluster – this is the cluster’s
silhouette; average it for all clusters, this is the clustering’s silhouette
The silhouette coefficient ranges from −1 to 1; negative values indicate that the
cluster radius is greater than the distance between clusters, so that clusters overlap;
this suggests poor clustering. Large values of S suggest good clustering
The silhouette can be used to toss background points from the clustering process,
that is, points that notoriously exceed the average cohesion within a given cluster.
This process can be used iteratively – once some points are tossed off, the
clustering can be repeated and hopefully produce better results; and again.
47. http://publicationslist.org/junio
Post-processing (cluster evaluation)
The clustering silhouette is very important, it not only tells us the quality of
a clustering, it can also tell us what is the correct clustering; for example,
consider the following dataset:
Clearly we have clusters, but how many?Visually, we can track from 6 to 8 clusters,
depending on the observation.
What to do?
49. http://publicationslist.org/junio
Post-processing (cluster evaluation)
One way to solve this problem is to use the k-mean algorithm and calculate
the Silhoutte different numbers of clusters
In our example, we would get the following curve:
6 7
The plot indicates that 6 or 7 clusters are acceptable answers, the next stage is to
consider the data characteristics in order to define what the best answer is.
50. http://publicationslist.org/junio
Warning
Just like any other analytical technique, clustering can lead you to
unproductive circumstances (waste of time) if not used with caution; some
points must be of concern:
Most algorithms depend on heuristic parameters that may demand hours for one to
find the most appropriate values
Also, the algorithm lend themselves to modifications that, although may sound
intuitively right, are taking you nowhere
It is reasonably possible that, although you are looking for, the data has no clusters at
all; it is not such an improbable circumstance because clustering algorithms usually
are treated as black boxes – be circumspect, attention with the evidences!
Despite the fact that there are evaluation methods and visualization tools, still the
clustering result may be flawed; remember, there are no formal theory behind cluster
concepts
Finally, this review is mostly addressed for practitioners, and not for
academic personnel; for those, there are many other aspects that must be
considered – for more details, please check the paper “Survey Of Clustering
Data MiningTechniques” of author Pavel Berkhin, among other sources