Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Machine Learning
Algorithms
Girish Khanzode
Contents
• Supervised Learning Model
• Linear Regression
• KNN
• DecisionTree Learning
• OptimizedTree Induction
• Random ...
Machine Learning and Pattern Classification
• Predictive modelling is building a model capable of making predictions
• Suc...
Machine Learning Methodologies
• Supervised learning
– Learning from labelled data
– Classification, Regression, Predictio...
Machine Learning Methodologies
• Semi-supervised learning
– mix of Supervised and Unsupervised learning
– usually small pa...
Applications
• Speech recognition
• Effective web search
• Recommendation systems
• Computer vision
• Information retrieva...
LearningTypes
Machine Learning Algorithms
Learning Process
• Supervised LearningAlgorithms are used in classification and prediction
• Training set - each record co...
Learning Process
Typical Steps in ML
Supervised Learning Model
• The class labels in the dataset used to build the classification model are
known
• Example - a...
Supervised Learning Model
Classification and Regression
Linear Regression
• A standard and simple mathematical technique for predicting numeric outcome
• Oldest and most widely u...
Linear Regression
K Nearest Neighbors - KNN
• A simple algorithm that stores all available cases and classifies new cases based on a
similar...
KNN
KNN Classification
Non-Default
Default
Age
Loan$
DecisionTree Learning
• Decision trees classify instances or examples by starting at the root of the
tree and moving throu...
When to Consider DecisionTrees
• Attribute-value description- object or case must be expressible in terms
of a fixed colle...
DecisionTree Applications
• Credit risk analysis
• Manufacturing – chemical material evaluation
• Production – Process opt...
Strengths
• Trees are inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for sma...
Weaknesses
• Not suitable for prediction of continuous attribute
• Perform poorly with many classes and small data
• Compu...
Tree Representation
• Each node in the tree specifies a test for some
attribute of the instance
• Each branch corresponds ...
Tree Representation
Tree Induction
Problems of Random split
• The tree can grow huge
• These trees are hard to understand
• Larger trees are typically less a...
OptimizedTree Induction
• Greedy strategy - Split the records based on an attribute test
that optimizes certain criterion
...
OptimizedTree Induction
• Selection of an attribute at each node
– Choose the most useful attribute for classifying traini...
Entropy
• A measure of homogeneity of the set of examples
• Given a set S of positive and negative examples of some target...
Entropy
Information Gain
• Information gain measures the expected reduction in entropy or uncertainty
• Values(A) is the set of al...
A simple example
• Guess the outcome of next week's game between the MallRats and the
Chinooks
• Available knowledge / Att...
Basket ball data
Problem Data
• The game will be away at 9pm and that Joe will play center on offense…
• A classification problem
• General...
Examples
• Before partitioning, the entropy is
– H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1
• Using the w...
Examples
• Using the when attribute, divide into 3 subsets
– Entropy of the first set H(5pm) = - 1/4 log(1/4) - 3/4 log(3/...
Decision
• Knowing the when attribute values provides larger information gain than where
• Therefore the when attribute sh...
Continuous Attribute
• Each non-leaf node is a test
• Its edge partitions the attribute into subsets (easy for discrete at...
Evaluation
• Training accuracy
– How many training instances can be correctly classify based on the available data?
– Is i...
DecisionTree Creation Algorithms
• ID3
• C4.5
• Hunt’s Algorithm
• CART
• SLIQ,SPRINT
Random Forest
• An ensemble classifier that consists of many decision trees
• Outputs the class that is the mode of the cl...
Random Forest
Algorithm
• Let the number of training cases be N and number of variables in the classifier M
• The number m of input vari...
Gini Index
• Random forest uses Gini index taken from CART learning system to
construct decision trees
• The Gini Index of...
Random Forest - Flow Chart
Working of Random Forest
• For prediction, a new sample is pushed down the tree
• It is assigned the label of the training...
Random Forest - Advantages
• One of the most accurate learning algorithms
• Produces a highly accurate classifier
• Runs e...
Random Forest - Advantages
• Methods for balancing error in class population unbalanced data sets
• Prototypes are compute...
Random Forest - Disadvantages
• Random forests have been observed to overfit for some datasets with
noisy classification/r...
Logistic Regression
• Models the relationship between a dependent and one or more independent variables
• Allows to look a...
Logistic Regression
• Logistic regression function $$ P =
frac{e^{alpha+{beta}x}}{1+e^{alpha+{beta}x}} $$
• P is the proba...
Logistic Regression
SupportVector Machine - SVM
• A supervised learning model with associated learning algorithms that analyze
data and recogn...
SVM
Naive Bayes Classifier
• A family of simple probabilistic classifiers based on applying Bayes'
theorem with strong (naive)...
Conditional Probability Model
Naive Bayes - Example
UNSUPERVISED LEARNING
Clustering
• A technique to find similar groups in data clusters
• Groups data instances that are similar to (near) each o...
Clustering
Applications
• Group people of similar sizes together to make small, medium and largeT-Shirts
– Tailor-made for each perso...
Aspects of clustering
• Clustering algorithms
– Partitional clustering
– Hierarchical clustering
• A distance function - s...
K-means Clustering
• A partitional clustering algorithm
• Classify a given data set through a certain number of k clusters...
K-means Clustering
K-Means Algorithm
1. Choose k
2. Randomly choose k data points (seeds) as initial centroids
3. Assign each data point to t...
k initial means (in
this case k=3) are
randomly generated
within the data
domain
k clusters are
created by
associating eve...
Stopping / Convergence Criterion
1. No (or minimum) re-assignments of data points to different clusters
2. No (or minimum)...
Example
+
+
Example
K Means - Strengths
• Simple to understand and implement
• Efficient:Time complexity O(tkn) where
– n is number of data po...
K Means -Weaknesses
• Only applicable if mean is defined
– For categorical data, k-mode - the centroid is represented by m...
K Means -Weaknesses
Handling Outliers
• Remove data points in the clustering process that are much further away
from the centroids than other ...
Sensitivity to Initial Seeds
Use of Different Seeds for Good Results
There are some methods to help choose good seeds
Weaknesses
+
Unsuitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres)
K-Means
• Still the most popular algorithm - simplicity, efficiency
• Other clustering algorithms have their own weaknesse...
Clusters Representation
• Use the centroid of each cluster to represent the cluster
• Compute the radius and standard devi...
Cluster Classification
• All the points in a cluster have the same class label - the
cluster ID
• Run a supervised learnin...
Cluster Classification
Distance BetweenTwo Clusters
• Single link
• Complete link
• Average link
• Centroids
• …
Single Link Method
• The distance between two clusters is the distance between two closest
data points in the two clusters...
Complete Link Method
• Distance between two clusters is the distance of two furthest data points
in the two clusters
• Sen...
Average Link Method
• Distance between two clusters is the average distance of all pair-wise
distances between the data po...
Centroid Method
• Distance between two clusters is the distance
between their centroids
Algorithmic Complexity
• All the algorithms are at least O(n2)
– n is the number of data points
• Single link can be done ...
Distance Functions
• Key to clustering
• similarity and dissimilarity are also commonly used terms
• Numerous distance fun...
Distance Functions - Numeric Attributes
• Euclidean distance
• Manhattan (city block) distance
• Denote distance with dist...
Distance Formulae
• If h = 2, it is the Euclidean distance
• If h = 1, it is the Manhattan distance
• Weighted Euclidean d...
Distance Formulae
• Squared Euclidean distance - to place progressively greater weight on
data points that are further apa...
Curse of Dimensionality
• Various problems that arise analyzing and organizing data in high
dimensional spaces do not occu...
Dimensionality Reduction - Applications
• Information Retrieval – web documents where dimensionality is
vocabulary of word...
Dimensionality Reduction
• Defying the curse of dimensionality - simpler models result in improved generalization
• Classi...
Dimensionality Reduction
Techniques
• Linear Discriminant Analysis – LDA
– Tries to identify attributes that account for the most variance between ...
Feature Construction
• Linear methods
– Principal component analysis (PCA)
– Independent component analysis (ICA)
– Fisher...
Principal component analysis - PCA
• A tool in exploratory data analysis and to create predictive models
• Involves calcul...
PCA
PCA
Fisher Linear Discriminant
• A classification method that projects high-dimensional data
onto a line and performs classifi...
Fisher Linear Discriminant
Linear Discriminant Analysis - LDA
• A generalization of Fisher's linear
discriminant
• A method used to find a linear
com...
Difference
Kernel PCA
• Classic PCA approach is a linear projection technique that works well if
the data is linearly separable
• In ...
Kernel PCA
• The basic idea to deal with linearly inseparable data is to project it onto a higher
dimensional space where ...
Kernel PCA
SingularValue Decomposition - SVD
• A mechanism to break a matrix into simpler meaningful pieces
• Used to detect grouping...
SVD
Hidden Markov Models - HMM
• A statistical Markov model in which the system being modelled is
assumed to be a Markov proce...
Hidden Markov Models - HMM
• Each state has a probability distribution over the possible output tokens
• Therefore the seq...
HMM
MODEL EVALUATION
Model Evaluation
• How accurate is the classifier?
• When the classifier is wrong, how is it wrong?
• Decide on which clas...
Testing Set
• Split the available data into a training set and a test set
• Train the classifier in the training set and e...
Classifier Accuracy
• The accuracy of a classifier on a given test set is the
percentage of test set tuples that are corre...
False PositiveVs. Negative
• When is the model wrong?
– False positives vs. false negatives
– Related to type I and type I...
Confusion Matrix
• Mechanism to illustrate how a model is performing in terms
of false positives and false negatives
• Pro...
Confusion Matrix
More Accuracy Measures
Area Under ROC Curve - AUC
• ROC curves can be used to
compare models
• Bigger the AUC, the more
accurate the model
• ROC ...
Gini-Statistic
• Calculated from the ROC Curve
– Gini = 2 *AUC – 1
• Where the AUC is the area under the ROC curve
K-Fold CrossValidation
• Divide the entire data set into k folds
• For each of k experiments, use kth fold for testing and...
K-Fold CrossValidation
• The accuracy of the system is calculated as the average error across the k
folds
• The main advan...
References
1. W. L. Chao, J. J. Ding, “Integrated Machine Learning Algorithms for Human Age Estimation”, NTU, 2011
2. Phil...
ThankYou
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode
Upcoming SlideShare
Loading in …5
×

Machine Learning

14,594 views

Published on

Machine Learning

Published in: Technology

Machine Learning

  1. 1. Machine Learning Algorithms Girish Khanzode
  2. 2. Contents • Supervised Learning Model • Linear Regression • KNN • DecisionTree Learning • OptimizedTree Induction • Random Forest • Logistic Regression • SVM • Naive Bayes Classifier • Clustering • K-means Clustering • Cluster Classification • AlgorithmicComplexity • Dimensionality Reduction • PCA • Fisher Linear Discriminant • LDA • Kernel PCA • SVD • HMM • Model Evaluation • Confusion Matrix • K-Fold CrossValidation • References
  3. 3. Machine Learning and Pattern Classification • Predictive modelling is building a model capable of making predictions • Such a model includes a machine learning algorithm that learns certain properties from a training dataset in order to make those predictions • Predictive modelling types - Regression and pattern classification • Regression models analyze relationships between variables and trends in order to make predictions about continuous variables – Prediction of the maximum temperature for the upcoming days in weather forecasting • Pattern classification assigns discrete class labels to particular observations as outcomes of a prediction – Prediction of a sunny, rainy or snowy day
  4. 4. Machine Learning Methodologies • Supervised learning – Learning from labelled data – Classification, Regression, Prediction, Function Approximation • Unsupervised learning – Learning from unlabelled data – Clustering,Visualization, Dimensionality Reduction
  5. 5. Machine Learning Methodologies • Semi-supervised learning – mix of Supervised and Unsupervised learning – usually small part of data is labelled • Reinforcement learning – Model learns from a series of actions by maximizing a reward function – The reward function can either be maximized by penalizing bad actions and/or rewarding good actions – Example - training of self-driving car using feedback from the environment
  6. 6. Applications • Speech recognition • Effective web search • Recommendation systems • Computer vision • Information retrieval • Spam filtering • Computational finance • Fraud detection • Medical diagnosis • Stock market analysis • Structural health monitoring
  7. 7. LearningTypes
  8. 8. Machine Learning Algorithms
  9. 9. Learning Process • Supervised LearningAlgorithms are used in classification and prediction • Training set - each record contains a set of attributes, one of the attributes is the class • Classification or prediction algorithm learns from training data about relationship between predictor variables and outcome variable • This process results in – Classification model – Predictive model
  10. 10. Learning Process
  11. 11. Typical Steps in ML
  12. 12. Supervised Learning Model • The class labels in the dataset used to build the classification model are known • Example - a dataset for spam filtering would contain spam messages as well as "ham" (= not-spam) messages • In a supervised learning problem, it is known which message in the training set is spam or ham and this information is used to train our model in order to classify new unseen messages
  13. 13. Supervised Learning Model
  14. 14. Classification and Regression
  15. 15. Linear Regression • A standard and simple mathematical technique for predicting numeric outcome • Oldest and most widely used predictive model • Goal - minimize the sum of the squared errors to fit a straight line to a set of data points • Fits a linear function to a set of data points • Form of the function – Y = β0 + β1*X1 + β2*X2 + … + βn*Xn – Y is the target variable and X1, X2, ... Xn are the predictor variables – β1, β2, … βn are the coefficients that multiply the predictor variables – β0 is constant • Linear regression with multiple variables – Scale the data, and implement the gradient descent and the cost function
  16. 16. Linear Regression
  17. 17. K Nearest Neighbors - KNN • A simple algorithm that stores all available cases and classifies new cases based on a similarity measure • Extremely simple to implement • Lazy Learning - function is only approximated locally and all computation is deferred until classification • Has a weighted version and can also be used for regression • Usually works very well when there is a distance between examples (Euclidean, Manhattan) • Slow speed when training set is large (say 10^6 examples) and distance calculation is non- trivial • Only a single hyper-parameter – K (usually optimized using cross-validation)
  18. 18. KNN
  19. 19. KNN Classification Non-Default Default Age Loan$
  20. 20. DecisionTree Learning • Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node • A method for approximating discrete-valued functions • Decision tree is a classifier in the form of a tree structure – Decision node - specifies a test on a single attribute – Leaf node - indicates the value of the target attribute – Branch - split of one attribute – Path - a disjunction of test to make the final decision
  21. 21. When to Consider DecisionTrees • Attribute-value description- object or case must be expressible in terms of a fixed collection of properties or attributes – hot, mild, cold • Predefined classes (target values) - the target function has discrete output values – Boolean or multiclass – Sufficient data - enough training cases should be provided to learn the model • Possibly noisy training data • Missing attribute values
  22. 22. DecisionTree Applications • Credit risk analysis • Manufacturing – chemical material evaluation • Production – Process optimization • Biomedical Engineering – identify features to use in implantable devices • Astronomy – filter noise from Hubble telescope images • Molecular biology – analyze amino acid sequences in Human Genome project • Pharmacology – drug efficacy analysis • Planning – scheduling of PCB assembly lines • Medicine – analysis of syndromes
  23. 23. Strengths • Trees are inexpensive to construct • Extremely fast at classifying unknown records • Easy to interpret for small-sized trees • Accuracy is comparable to other classification techniques for many simple data sets • Generates understandable rules • Handles continuous and categorical variables • Provides a clear indication of which fields are most important for prediction or classification
  24. 24. Weaknesses • Not suitable for prediction of continuous attribute • Perform poorly with many classes and small data • Computationally expensive to train – At each node each candidate splitting field must be sorted before its best split can be found – In some algorithms combinations of fields are used and a search must be made for optimal combining weights – Pruning algorithms can also be expensive since many candidate sub-trees must be formed and compared • Not suitable for non-rectangular regions
  25. 25. Tree Representation • Each node in the tree specifies a test for some attribute of the instance • Each branch corresponds to an attribute value • Each leaf node assigns a classification
  26. 26. Tree Representation
  27. 27. Tree Induction
  28. 28. Problems of Random split • The tree can grow huge • These trees are hard to understand • Larger trees are typically less accurate than smaller trees • So most tree construction methods use some greedy manner – find the feature that best divides positive examples from negative examples for Information gain
  29. 29. OptimizedTree Induction • Greedy strategy - Split the records based on an attribute test that optimizes certain criterion • Issues – Determine root node – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? – Determine when to stop splitting
  30. 30. OptimizedTree Induction • Selection of an attribute at each node – Choose the most useful attribute for classifying training examples • Information gain – Measures how well a given attribute separates the training examples according to their target classification – This measure is used to select among the candidate attributes at each step while growing the tree
  31. 31. Entropy • A measure of homogeneity of the set of examples • Given a set S of positive and negative examples of some target concept (a 2-class problem), the entropy of set S relative to this binary classification – E(S) = - p(P)log2 p(P) – p(N)log2 p(N) • Example – Suppose S has 25 examples, 15 positive and 10 negatives [15+, 10-] – Then entropy of S relative to this classification • E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)
  32. 32. Entropy
  33. 33. Information Gain • Information gain measures the expected reduction in entropy or uncertainty • Values(A) is the set of all possible values for attributeA andSv is subset of S for which attributeA has value v, Sv = {s in S |A(s) = v} • First term in the equation is the entropy of the original collection S • Second term is the expected value of the entropy after S is partitioned using attributeA • It is the expected reduction in entropy caused by partitioning the examples according to this attribute • It is the number of bits saved when encoding the target value of an arbitrary member of S by knowing the value of attributeA ( ) ( , ) ( ) ( )v v v Values A S Gain S A Entropy S Entropy S S   
  34. 34. A simple example • Guess the outcome of next week's game between the MallRats and the Chinooks • Available knowledge / Attribute – was the game at Home or Away – was the starting time 5pm, 7pm or 9pm – Did Joe play center, or forward – whether that opponent's center was tall or not – …..
  35. 35. Basket ball data
  36. 36. Problem Data • The game will be away at 9pm and that Joe will play center on offense… • A classification problem • Generalizing the learned rule to new examples
  37. 37. Examples • Before partitioning, the entropy is – H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1 • Using the where attribute, divide into 2 subsets – Entropy of the first set H(home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1 – Entropy of the second set H(away) = - 4/8 log(6/8) - 4/8 log(4/8) = 1 • Expected entropy after partitioning – 12/20 * H(home) + 8/20 * H(away) = 1
  38. 38. Examples • Using the when attribute, divide into 3 subsets – Entropy of the first set H(5pm) = - 1/4 log(1/4) - 3/4 log(3/4); – Entropy of the second set H(7pm) = - 9/12 log(9/12) - 3/12 log(3/12); – Entropy of the second set H(9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0 • Expected entropy after partitioning – 4/20 * H(1/4, 3/4) + 12/20 * H(9/12, 3/12) + 4/20 * H(0/4, 4/4) = 0.65 • Information gain 1-0.65 = 0.35
  39. 39. Decision • Knowing the when attribute values provides larger information gain than where • Therefore the when attribute should be chosen for testing prior to the where attribute • Similarly we can compute the information gain for other attributes • At each node choose the attribute with the largest information gain • Stopping rule – Every attribute has already been included along this path through the tree or – The training examples associated with this leaf node all have the same target attribute value - entropy is zero
  40. 40. Continuous Attribute • Each non-leaf node is a test • Its edge partitions the attribute into subsets (easy for discrete attribute) • For continuous attribute – Partition the continuous value of attribute A into a discrete set of intervals – Create a new Boolean attribute Ac, looking for a threshold c, How to choose c ? if otherwise c c true A c A false    
  41. 41. Evaluation • Training accuracy – How many training instances can be correctly classify based on the available data? – Is it high when tree is deep/large or when there is less confliction in the training instances – Higher value does not mean good generalization • Testing accuracy – Given a number of new instances how many of them can be correctly classified? – Cross validation
  42. 42. DecisionTree Creation Algorithms • ID3 • C4.5 • Hunt’s Algorithm • CART • SLIQ,SPRINT
  43. 43. Random Forest • An ensemble classifier that consists of many decision trees • Outputs the class that is the mode of the class's output by individual trees • The method combines Breiman's bagging idea and the random selection of features • Used for classification and regression
  44. 44. Random Forest
  45. 45. Algorithm • Let the number of training cases be N and number of variables in the classifier M • The number m of input variables to be used to determine the decision at a node of the tree - m should be much less than M • Choose a training set for this tree by choosing n times with replacement from all N available training cases • Use the rest of cases to estimate the error of the tree by predicting their classes • For each node of the tree, randomly choose m variables on which to base the decision at that node • Calculate the best split based on these m variables in the training set • Each tree is fully grown and not pruned
  46. 46. Gini Index • Random forest uses Gini index taken from CART learning system to construct decision trees • The Gini Index of node impurity is the measure most commonly chosen for classification type problems • How to select N? - Build trees until the error no longer decreases • How to select M? -Try to recommend defaults, half of them and twice of them and pick the best
  47. 47. Random Forest - Flow Chart
  48. 48. Working of Random Forest • For prediction, a new sample is pushed down the tree • It is assigned the label of the training sample in the terminal node it ends up in • This procedure is iterated over all trees in the ensemble • Average vote of all trees is reported as random forest prediction
  49. 49. Random Forest - Advantages • One of the most accurate learning algorithms • Produces a highly accurate classifier • Runs efficiently on large databases • Handles thousands of input variables without variable deletion • Gives estimates of what variables are important in classification • Generates an internal unbiased estimate of the generalization error as the forest building progresses • Effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing
  50. 50. Random Forest - Advantages • Methods for balancing error in class population unbalanced data sets • Prototypes are computed that give information about the relation between the variables and the classification • Computes proximities between pairs of cases that can be used in clustering, locating outliers or by scaling gives interesting views of data • Above capabilities can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection • Offers an experimental method for detecting variable interactions
  51. 51. Random Forest - Disadvantages • Random forests have been observed to overfit for some datasets with noisy classification/regression tasks • For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels • Therefore the variable importance scores from random forest are not reliable for this type of data
  52. 52. Logistic Regression • Models the relationship between a dependent and one or more independent variables • Allows to look at the fit of the model as well as at the significance of the relationships (between dependent and independent variables) being modelled • Estimates the probability of an event occurring - the probability of a pupil continuing in education post 16 • Predict from a knowledge of relevant independent variables the probability (p) that it is 1 (event occurring) rather than 0 • While in linear regression the relationship between the dependent and the independent variables is linear, this assumption is not made in logistic regression
  53. 53. Logistic Regression • Logistic regression function $$ P = frac{e^{alpha+{beta}x}}{1+e^{alpha+{beta}x}} $$ • P is the probability of a 1 and e is base of natural logarithm (about 2.718) • $$alpha$$ and $$beta$$ are the parameters of the model • The value of $$alpha$$ yields P when x is zero and $$beta$$ indicates how the probability of a 1 changes when x changes by a single unit • Because the relation between x and P is nonlinear, $$beta$$ does not have as straightforward an interpretation in this model as it does in ordinary linear regression
  54. 54. Logistic Regression
  55. 55. SupportVector Machine - SVM • A supervised learning model with associated learning algorithms that analyze data and recognize patterns • Given a set of training examples, each marked for belonging to one of two categories, SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier • An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible • New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on
  56. 56. SVM
  57. 57. Naive Bayes Classifier • A family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features • A popular method for text categorization, the problem of judging documents as belonging to one category or the other such as spam or legitimate, sports or politics etc with word frequencies as the features • Highly scalable, requires a number of parameters linear in the number of variables (features/predictors) in a learning problem
  58. 58. Conditional Probability Model
  59. 59. Naive Bayes - Example
  60. 60. UNSUPERVISED LEARNING
  61. 61. Clustering • A technique to find similar groups in data clusters • Groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters • Called an unsupervised learning task - since no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning • One of the most utilized data mining techniques • A long history and used in almost every field like medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries and text clustering
  62. 62. Clustering
  63. 63. Applications • Group people of similar sizes together to make small, medium and largeT-Shirts – Tailor-made for each person - too expensive – One-size-fits-all - does not fit all • In marketing, segment customers according to their similarities – Targeted marketing • Given a collection of text documents, organize them according to their content similarities – To produce a topic hierarchy
  64. 64. Aspects of clustering • Clustering algorithms – Partitional clustering – Hierarchical clustering • A distance function - similarity or dissimilarity • Clustering quality – Inter-clusters distance  maximized – Intra-clusters distance  minimized • Quality of a clustering process depends on algorithm, distance function and application
  65. 65. K-means Clustering • A partitional clustering algorithm • Classify a given data set through a certain number of k clusters (k is fixed) • Let the set of data points D be {x1, x2, …, xn} – xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr – r = number of attributes (dimensions) in the data • Algorithm partitions given data into k clusters – Each cluster has a cluster center (centroid) – K is user defined
  66. 66. K-means Clustering
  67. 67. K-Means Algorithm 1. Choose k 2. Randomly choose k data points (seeds) as initial centroids 3. Assign each data point to the closest centroid 4. Re-compute the centroids using the current cluster memberships 5. If a convergence criterion is not met, go to 3
  68. 68. k initial means (in this case k=3) are randomly generated within the data domain k clusters are created by associating every observation with the nearest mean The centroid of each of the k clusters becomes the new mean Steps 2 and 3 are repeated until convergence has been reached K-Means Algorithm
  69. 69. Stopping / Convergence Criterion 1. No (or minimum) re-assignments of data points to different clusters 2. No (or minimum) change of centroids or minimum decrease in the sum of squared error (SSE) – Cj is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data points in Cj), and dist(x, mj) is the distance between data point x and centroid mj    k j C j j distSSE 1 2 ),(x mx
  70. 70. Example + +
  71. 71. Example
  72. 72. K Means - Strengths • Simple to understand and implement • Efficient:Time complexity O(tkn) where – n is number of data points – k is number of clusters – t is number of iterations • Since both k and t are small - a linear algorithm • Most popular clustering algorithm • Terminates at a local optimum if SSE is used • The global optimum is hard to find due to complexity
  73. 73. K Means -Weaknesses • Only applicable if mean is defined – For categorical data, k-mode - the centroid is represented by most frequent values • User must specify k • Sensitive to outliers – Outliers are data points that are very far away from other data points – Outliers could be errors in the data recording or some special data points with very different values
  74. 74. K Means -Weaknesses
  75. 75. Handling Outliers • Remove data points in the clustering process that are much further away from the centroids than other data points • Perform random sampling – Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small – Assign the rest of the data points to the clusters by distance or similarity comparison or classification
  76. 76. Sensitivity to Initial Seeds
  77. 77. Use of Different Seeds for Good Results There are some methods to help choose good seeds
  78. 78. Weaknesses + Unsuitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres)
  79. 79. K-Means • Still the most popular algorithm - simplicity, efficiency • Other clustering algorithms have their own weaknesses • No clear evidence that any other clustering algorithm performs better in general – although other algorithms could be more suitable for some specific types of data or applications • Comparing different clustering algorithms is a difficult task • No one knows the correct clusters
  80. 80. Clusters Representation • Use the centroid of each cluster to represent the cluster • Compute the radius and standard deviation of the cluster to determine its spread in each dimension • Centroid representation alone works well if the clusters are of the hyper- spherical shape • If clusters are elongated or are of other shapes, centroids are not sufficient
  81. 81. Cluster Classification • All the points in a cluster have the same class label - the cluster ID • Run a supervised learning algorithm on the data to find a classification model
  82. 82. Cluster Classification
  83. 83. Distance BetweenTwo Clusters • Single link • Complete link • Average link • Centroids • …
  84. 84. Single Link Method • The distance between two clusters is the distance between two closest data points in the two clusters, one data point from each cluster • It can find arbitrarily shaped clusters, but – It may cause the undesirable chain effect by noisy points
  85. 85. Complete Link Method • Distance between two clusters is the distance of two furthest data points in the two clusters • Sensitive to outliers because they are far away
  86. 86. Average Link Method • Distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters • A compromise between – the sensitivity of complete-link clustering to outliers – the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects
  87. 87. Centroid Method • Distance between two clusters is the distance between their centroids
  88. 88. Algorithmic Complexity • All the algorithms are at least O(n2) – n is the number of data points • Single link can be done in O(n2) • Complete and average links can be done in O(n2logn) • Due the complexity, hard to use for large data sets – Sampling – Scale-up methods (BIRCH)
  89. 89. Distance Functions • Key to clustering • similarity and dissimilarity are also commonly used terms • Numerous distance functions – Different types of data • Numeric data • Nominal data – Different specific applications
  90. 90. Distance Functions - Numeric Attributes • Euclidean distance • Manhattan (city block) distance • Denote distance with dist(xi, xj) where xi and xj are data points (vectors) • They are special cases of Minkowski distance • h is positive integer hh jrir h ji h jiji xxxxxxdist 1 2211 ))(...)()((),( xx
  91. 91. Distance Formulae • If h = 2, it is the Euclidean distance • If h = 1, it is the Manhattan distance • Weighted Euclidean distance 22 22 2 11 )(...)()(),( jrirjijiji xxxxxxdist xx ||...||||),( 2211 jrirjijiji xxxxxxdist xx 22 222 2 111 )(...)()(),( jrirrjijiji xxwxxwxxwdist xx
  92. 92. Distance Formulae • Squared Euclidean distance - to place progressively greater weight on data points that are further apart • Chebychev distance - one wants to define two data points as different if they are different on any one of the attributes 22 22 2 11 )(...)()(),( jrirjijiji xxxxxxdist xx |)|...,|,||,max(|),( 2211 jrirjijiji xxxxxxdist xx
  93. 93. Curse of Dimensionality • Various problems that arise analyzing and organizing data in high dimensional spaces do not occur in low dimensional space like 2D or 3D • In the context of classification/function approximation, performance of classification algorithm can improve by removing irrelevant features
  94. 94. Dimensionality Reduction - Applications • Information Retrieval – web documents where dimensionality is vocabulary of words • Recommender systems – large scale of ratings matrix • Social networks – social graph with large number of users • Biology – gene expressions • Image processing – facial recognition
  95. 95. Dimensionality Reduction • Defying the curse of dimensionality - simpler models result in improved generalization • Classification algorithm may not scale up to the size of the full feature set either in space or time • Improves understanding of domain • Cheaper to collect and store data based on reduced feature set • TwoTechniques – FeatureConstruction – Feature Selection
  96. 96. Dimensionality Reduction
  97. 97. Techniques • Linear Discriminant Analysis – LDA – Tries to identify attributes that account for the most variance between classes – LDA compared to PCA is a supervised method using known labels • Principal component analysis – PCA – Identifies combination of linearly co-related attributes (principal components or directions in feature space) that accounts for the most variance of data – Plot the different samples on 2 first principal components • Singular Value Decomposition – SVD – Factorization of real or complex matrix – Derived from PCA
  98. 98. Feature Construction • Linear methods – Principal component analysis (PCA) – Independent component analysis (ICA) – Fisher Linear Discriminant (LDA) – …. • Non-linear methods – Kernel PCA – Non linear component analysis (NLCA) – Local linear embedding (LLE) – ….
  99. 99. Principal component analysis - PCA • A tool in exploratory data analysis and to create predictive models • Involves calculating Eigenvalue decomposition of a data covariance matrix, usually after mean centering the data for each attribute • Mathematically defined as an orthogonal linear transformation to map data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on • Theoretically the optimal linear scheme, in terms of least mean square error, for compressing a set of high dimensional vectors into a set of lower dimensional vectors and then reconstructing the original set
  100. 100. PCA
  101. 101. PCA
  102. 102. Fisher Linear Discriminant • A classification method that projects high-dimensional data onto a line and performs classification in this one- dimensional space • The projection maximizes the distance between the means of the two classes while minimizing the variance within each class
  103. 103. Fisher Linear Discriminant
  104. 104. Linear Discriminant Analysis - LDA • A generalization of Fisher's linear discriminant • A method used to find a linear combination of features that characterizes or separates two or more classes of objects or events • The resulting combination may be used as a linear classifier or more commonly for dimensionality reduction before later classification
  105. 105. Difference
  106. 106. Kernel PCA • Classic PCA approach is a linear projection technique that works well if the data is linearly separable • In the case of linearly inseparable data, a nonlinear technique is required if the task is to reduce the dimensionality of a dataset
  107. 107. Kernel PCA • The basic idea to deal with linearly inseparable data is to project it onto a higher dimensional space where it becomes linearly separable • Consider a nonlinear mapping function ϕ so that the mapping of a sample xx can be written as xx→ϕ(xx), which is called kernel function • The term kernel describes a function that calculates the dot product of the images of the samples xx under ϕ • κ(xxi,xxj)=ϕ(xxi)Tϕ(xxj) • Function ϕ maps the original d-dimensional features into a larger k-dimensional feature space by creating nonlinear combinations of the original features
  108. 108. Kernel PCA
  109. 109. SingularValue Decomposition - SVD • A mechanism to break a matrix into simpler meaningful pieces • Used to detect groupings in data • A factorization of a real or complex matrix • A general rectangular M-by-N matrix A has a SVD into the product of an M-by-N orthogonal matrix U, an N-by-N diagonal matrix of singular values S and the transpose of an N-by-N orthogonal square matrixV – A = U SV^T
  110. 110. SVD
  111. 111. Hidden Markov Models - HMM • A statistical Markov model in which the system being modelled is assumed to be a Markov process with unobserved (hidden) states • Used in pattern recognition, such as handwriting and speech analysis • In simpler Markov models like a Markov chain, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters • In HMM, the state is not directly visible, but output, dependent on the state, is visible
  112. 112. Hidden Markov Models - HMM • Each state has a probability distribution over the possible output tokens • Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states • Adjective hidden refers to the state sequence through which the model passes, not to the parameters of the model • The model is still referred to as a 'hidden' Markov model even if these parameters are known exactly
  113. 113. HMM
  114. 114. MODEL EVALUATION
  115. 115. Model Evaluation • How accurate is the classifier? • When the classifier is wrong, how is it wrong? • Decide on which classifier (which parameters) to use and to estimate what the performance of the system will be
  116. 116. Testing Set • Split the available data into a training set and a test set • Train the classifier in the training set and evaluate based on the test set
  117. 117. Classifier Accuracy • The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier • Often also referred to as recognition rate • Error rate (or misclassification rate) is the opposite of accuracy
  118. 118. False PositiveVs. Negative • When is the model wrong? – False positives vs. false negatives – Related to type I and type II errors in statistics • Often there is a different cost associated with false positives and false negatives – Diagnosing diseases
  119. 119. Confusion Matrix • Mechanism to illustrate how a model is performing in terms of false positives and false negatives • Provides more information than a single accuracy figure • Allows thinking about the cost of mistakes • Extendable to any number of classes
  120. 120. Confusion Matrix
  121. 121. More Accuracy Measures
  122. 122. Area Under ROC Curve - AUC • ROC curves can be used to compare models • Bigger the AUC, the more accurate the model • ROC index is the area under the ROC curve
  123. 123. Gini-Statistic • Calculated from the ROC Curve – Gini = 2 *AUC – 1 • Where the AUC is the area under the ROC curve
  124. 124. K-Fold CrossValidation • Divide the entire data set into k folds • For each of k experiments, use kth fold for testing and everything else for training
  125. 125. K-Fold CrossValidation • The accuracy of the system is calculated as the average error across the k folds • The main advantages of k-fold cross validation are that every example is used in testing at some stage and the problem of an unfortunate split is avoided • Any value can be used for k – 10 is most common – Depends on the data set
  126. 126. References 1. W. L. Chao, J. J. Ding, “Integrated Machine Learning Algorithms for Human Age Estimation”, NTU, 2011 2. Phil Simon (March 18, 2013). Too Big to Ignore: The Business Case for Big Data. Wiley 3. Mitchell, T. (1997). Machine Learning, McGraw Hill. ISBN 0-07-042807-7 4. Harnad, Stevan (2008), "The Annotation Game: On Turing (1950) on Computing, Machinery, and Intelligence", in Epstein, Robert; Peters, Grace, The Turing Test Sourcebook: Philosophical and Methodological Issues in the Quest for the Thinking Computer, Kluwer 5. Russell, Stuart; Norvig, Peter (2003) [1995] Artificial Intelligence: A Modern Approach (2nd ed.) Prentice Hall 6. Langley, Pat (2011). "The changing science of machine learning". Machine Learning 82 (3): 275–279. doi:10.1007/s10994-011-5242-y 7. Le Roux, Nicolas; Bengio, Yoshua; Fitzgibbon, Andrew (2012). "Improving First and Second-Order Methods by Modeling Uncertainty". In Sra, Suvrit; Nowozin, Sebastian; Wright, Stephen J. Optimization for Machine Learning. MIT Press. p. 404 8. MI Jordan (2014-09-10). "statistics and machine learning“ Cornell University Library. "Breiman : Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)” 9. Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer 10. Yoshua Bengio (2009). Learning Deep Architectures for AI. Now Publishers Inc. pp. 1–3. ISBN 978-1-60198-294-0 11. A. M. Tillmann, "On the Computational Intractability of Exact and Approximate Dictionary Learning", IEEE Signal Processing Letters 22(1), 2015: 45–49 12. Aharon, M, M Elad, and A Bruckstein. 2006. "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation." Signal Processing, IEEE Transactions on 54 (11): 4311-4322 13. Goldberg, David E.; Holland, John H. (1988). "Genetic algorithms and machine learning". Machine Learning 3 (2): 95–99
  127. 127. ThankYou Check Out My LinkedIn Profile at https://in.linkedin.com/in/girishkhanzode

×