MACHINE LEARNING Clustering
WHAT’S IN THE MENU - RECOMMENDATIONS
1. Why so popular
2. Supervised vs Unsupervised Learning
3. Topic2
4. Topic3
5. Topic4
6. Wrap-up
MACHINE LEARNING
http://videolectures.net/Top/Computer_Science/Machine_Learning/
WHY IS MACHINE LEARNING (CS 229) THE MOST
POPULAR COURSE AT STANFORD? - ANDREW NG
WHAT CAN YOU TELL ME ABOUT X?
Supervised vs unsupervised learning
Typical methods: regression and classification
Given an object with observed set of features X1, …., Xn
having an response Y, the goal is to predict Y using X1,
…., Xn
Typical methods: principal component analysis (PCA),
expectation maximization (EM) and clustering (k-means
and its variations)
Given an object with observed set of features X1, …., Xn,
the goal is to discover relationships or groups between
variables or observations. Clustering algorithms try to find
natural grouping in data and therefore similar datasets.
APPLICATIONS
Market segmentation : given market research results, how you can find the best
customer segments
Anomaly detection : find fraud, detect network attacks, or discover problems in
servers or other sensor-equipped machinery. Is important to be able to find new
types of anomalies that have never seen before.
Healthcare: accident prone factor of the area to hospital assignment, gene clustering
GROUPING UNLABELED ITEMS USING K-MEANS
CLUSTERING
SWAT
Strengths :
Will always converge
Scales well
Weakness :
Can converge at local minima
Slow on very large datasets
Choosing the wrong k
Advantages :
Easy to implement
GROUPING UNLABELED ITEMS USING K-MEANS
CLUSTERING
SIMILARITY
There are several ways on measuring similarity between observations.
Manhattan distance
Euclidian distance
Cosine distance
K-MEANS PSEUDO CODE
Randomly create k points for starting centroids
----------------------------------------------------------------
For every point assigned to a centroid
Calculate the distance between the centroid and point
Assign the point to the cluster with the lowest distance
----------------------------------------------------------------
For every cluster calculate the mean of the points in that cluster
Assign the centroid to the mean
While any point has changed cluster assignment
Repeat until convergence
Cluster assignment
step
Move centroid
step
COST FUNCTION & RANDOM INITIALIZATION
for i = 1 to 100 {
randomly initialize k-means
run k-means and get centroids positions c(1 to m) and µ(1 to K)
compute cost function J(c(1 to m), µ(1 to K))
}
Pick clustering that gave lowest J(c(1 to m), µ(1 to K))
Cluster assignment step: minimize J c(1 to m) while holding µ(1 to K) fixed
Move centroid step: minimize J with respect to µ(1 to K)
PERFORMANCE CONSIDERATION
K-means
The K-means has the computational complexity of O(iKnm),
i is the number of iterations,
K the number of clusters,
n the number of observations,
m the number of features.
Improvements:
•Reducing the average number of iterations.
•Parallel implementation of K-means by leveraging Hadoop or Spark.
•Reducing the number of outliers and possible features by noise filtering with a smoothing
algorithm.
•Decreasing the dimensions of the model.
FRAMEWORKS
Java : Weka, Mahout, spark
Python: scikit-learn, py-spark, Pylearn2 (Theano)
C ++: Shogun
.NET: Encog
https://github.com/josephmisiti/awesome-machine-learning
PLATFORMS - IBM BLUEMIX
PLATFORMS – MICROSOFT AZURE ML
REFERENCES
http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/
http://www-bcf.usc.edu/~gareth/ISL/
BOOKS

Machine learning hands on clustering

  • 1.
  • 2.
    WHAT’S IN THEMENU - RECOMMENDATIONS 1. Why so popular 2. Supervised vs Unsupervised Learning 3. Topic2 4. Topic3 5. Topic4 6. Wrap-up
  • 3.
  • 4.
    WHY IS MACHINELEARNING (CS 229) THE MOST POPULAR COURSE AT STANFORD? - ANDREW NG
  • 5.
    WHAT CAN YOUTELL ME ABOUT X? Supervised vs unsupervised learning Typical methods: regression and classification Given an object with observed set of features X1, …., Xn having an response Y, the goal is to predict Y using X1, …., Xn Typical methods: principal component analysis (PCA), expectation maximization (EM) and clustering (k-means and its variations) Given an object with observed set of features X1, …., Xn, the goal is to discover relationships or groups between variables or observations. Clustering algorithms try to find natural grouping in data and therefore similar datasets.
  • 6.
    APPLICATIONS Market segmentation :given market research results, how you can find the best customer segments Anomaly detection : find fraud, detect network attacks, or discover problems in servers or other sensor-equipped machinery. Is important to be able to find new types of anomalies that have never seen before. Healthcare: accident prone factor of the area to hospital assignment, gene clustering
  • 7.
    GROUPING UNLABELED ITEMSUSING K-MEANS CLUSTERING SWAT Strengths : Will always converge Scales well Weakness : Can converge at local minima Slow on very large datasets Choosing the wrong k Advantages : Easy to implement
  • 8.
    GROUPING UNLABELED ITEMSUSING K-MEANS CLUSTERING
  • 9.
    SIMILARITY There are severalways on measuring similarity between observations. Manhattan distance Euclidian distance Cosine distance
  • 10.
    K-MEANS PSEUDO CODE Randomlycreate k points for starting centroids ---------------------------------------------------------------- For every point assigned to a centroid Calculate the distance between the centroid and point Assign the point to the cluster with the lowest distance ---------------------------------------------------------------- For every cluster calculate the mean of the points in that cluster Assign the centroid to the mean While any point has changed cluster assignment Repeat until convergence Cluster assignment step Move centroid step
  • 11.
    COST FUNCTION &RANDOM INITIALIZATION for i = 1 to 100 { randomly initialize k-means run k-means and get centroids positions c(1 to m) and µ(1 to K) compute cost function J(c(1 to m), µ(1 to K)) } Pick clustering that gave lowest J(c(1 to m), µ(1 to K)) Cluster assignment step: minimize J c(1 to m) while holding µ(1 to K) fixed Move centroid step: minimize J with respect to µ(1 to K)
  • 12.
    PERFORMANCE CONSIDERATION K-means The K-meanshas the computational complexity of O(iKnm), i is the number of iterations, K the number of clusters, n the number of observations, m the number of features. Improvements: •Reducing the average number of iterations. •Parallel implementation of K-means by leveraging Hadoop or Spark. •Reducing the number of outliers and possible features by noise filtering with a smoothing algorithm. •Decreasing the dimensions of the model.
  • 13.
    FRAMEWORKS Java : Weka,Mahout, spark Python: scikit-learn, py-spark, Pylearn2 (Theano) C ++: Shogun .NET: Encog https://github.com/josephmisiti/awesome-machine-learning
  • 14.
  • 15.
  • 16.
  • 17.