Machine learning is an area of artificial intelligence concerned with the development of techniques which allow computers to "learn".
More specifically, machine learning is a method for creating computer programs by the analysis of data sets.
Machine learning overlaps heavily with statistics, since both fields study the analysis of data, but unlike statistics, machine learning is concerned with the algorithmic complexity of computational implementations.
Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure.
Machine learning typically regards data clustering as a form of unsupervised learning .
Kmeans is to minimize the “sum of point-centroid” distances. The optimization is difficult.
After each iteration of K-means the MSE (mean square error) decreases. But K-means may converge to a local optimum. So K-means is sensitive to initial guesses.
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments.
It maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the result has a global minimum of variance.
The distance between two points is the length of a straight line segment between them.
A more formal definition:
A distance between two points P and Q in a metric space is d ( P , Q ), where d is the distance function that defines the given metric space.
We can also define the distance between two sets A and B in a metric space as being the minimum (or infimum) of distances between any two points P in A and Q in B .
In a very straightforward way we can define the Similarity function sim: SxS [0,1] as sim(o 1 ,o 2 ) = 1 - d(o 1 ,o 2 ) where o 1 and o 2 are elements of the space S .
Learning (either supervised or unsupervised) is impossible without ASSUMPTIONS
Watanabe’s Ugly Duckling theorem
Wolpert’s No Free Lunch theorem
Learning is impossible without some sort of bias .
22.
The Ugly Duckling theorems The theorem gets its fanciful name from the following counter-intuitive statement: assuming similarity is based on the number of shared predicates, an ugly duckling is as similar to a beautiful swan A as a beautiful swan B is to A, given that A and B differ at all. It was proposed and proved by Satosi Watanabe in 1969.
We have to classify them without prior knowledge on the essence of categories.
The number of different classes, i.e. the different way to group the objects into clusters, is given by the cardinality of the Power Set of S :
|Pow(S)|=2 n
Without any prior information, the most natural way to measure the similarity among two distinct objects we can measure the number of classes they share.
Oooops… They share exactly the same number of classes, namely 2 n-2 .
The Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character
The binary edit distance , d(x,y) , from a binary vector x to a binary vector y is the minimum number of simple flips required to transform one vector to the other
x=(0,1,0,0,1,1) y=(1,1,0,1,0,1) d(x,y)=3 The binary edit distance is equivalent to the Manhattan distance ( Minkowski p=1 ) for binary features vectors.
In the bag-of-words model the document is represented as d={Apple,Banana,Coffee,Peach}
Each term is represented.
No information on frequency.
Binary encoding. t-dimensional Bitvector
d=[1,0,1,1,0,0,0,1]. Apple Peach Apple Banana Apple Banana Coffee Apple Coffee Apple Peach Apple Banana Apple Banana Coffee Apple Coffee d=[ 1 ,0, 1 , 1 ,0,0,0, 1 ].
In the vector-space model the document is represented as d={<Apple,4>,<Banana,2>,<Coffee,2>,<Peach,2>}
Information about frequency are recorded.
t-dimensional vectors.
Apple Peach Apple Banana Apple Banana Coffee Apple Coffee Apple Peach Apple Banana Apple Banana Coffee Apple Coffee d=[4,0,2,2,0,0,0,1]. d=[ 4 ,0, 2 , 2 ,0,0,0, 1 ].
dimensionality reduction approaches can be divided into two categories:
feature selection approaches try to find a subset of the original features. Optimal feature selection for supervised learning problems requires an exhaustive search of all possible subsets of features of the chosen cardinality
feature extraction is applying a mapping of the multidimensional space into a space of fewer dimensions . This means that the original feature space is transformed by applying e.g. a linear transformation via a principal components analysis
In statistics, principal components analysis (PCA) is a technique that can be used to simplify a dataset.
More formally it is a linear transformation that chooses a new coordinate system for the data set such that the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component), the second greatest variance on the second axis, and so on.
PCA can be used for reducing dimensionality in a dataset while retaining those characteristics of the dataset that contribute most to its variance by eliminating the later principal components (by a more or less heuristic decision).
The components of C x , denoted by c ij , represent the covariances between the random variable components x i and x j . The component c ii is the variance of the component x i . The variance of a component indicates the spread of the component values around its mean value. If two components x i and x j of the data are uncorrelated, their covariance is zero ( c ij = c ji = 0 ).
The covariance matrix is, by definition, always symmetric.
Take a sample of vectors x 1 , x 2 , …, x M we can calculate the sample mean and the sample covariance matrix as the estimates of the mean and the covariance matrix.
From a symmetric matrix such as the covariance matrix, we can calculate an orthogonal basis by finding its eigenvalues and eigenvectors. The eigenvectors e i and the corresponding eigenvalues i are the solutions of the equation: C x e i = i e i ==> | C x - I |=0
By ordering the eigenvectors in the order of descending eigenvalues (largest first), one can create an ordered orthogonal basis with the first eigenvector having the direction of largest variance of the data. In this way, we can find directions in which the data set has the most significant amounts of energy.
By ordering the eigenvectors in the order of descending eigenvalues (largest first), one can create an ordered orthogonal basis with the first eigenvector having the direction of largest variance of the data. In this way, we can find directions in which the data set has the most significant amounts of energy.
The key idea of this approach is to create a small signature for each documents, to ensure that similar documents have similar signatures.
There exists a family H of hash functions such that for each pair of pages u , v we have Pr [ mh ( u ) = mh ( v )] = sim ( u,v ), where the hash function mh is chosen at random from the family H .
Be the first to comment