Machine learning hands on clustering

WHAT’S IN THE MENU - RECOMMENDATIONS
1. Why so popular
2. Supervised vs Unsupervised Learning
3. Topic2
4. Topic3
5. Topic4
6. Wrap-up

MACHINE LEARNING
http://videolectures.net/Top/Computer_Science/Machine_Learning/

WHY IS MACHINE LEARNING (CS 229) THE MOST
POPULAR COURSE AT STANFORD? - ANDREW NG

WHAT CAN YOU TELL ME ABOUT X?
Supervised vs unsupervised learning
Typical methods: regression and classification
Given an object with observed set of features X1, …., Xn
having an response Y, the goal is to predict Y using X1,
…., Xn
Typical methods: principal component analysis (PCA),
expectation maximization (EM) and clustering (k-means
and its variations)
Given an object with observed set of features X1, …., Xn,
the goal is to discover relationships or groups between
variables or observations. Clustering algorithms try to find
natural grouping in data and therefore similar datasets.

APPLICATIONS
Market segmentation : given market research results, how you can find the best
customer segments
Anomaly detection : find fraud, detect network attacks, or discover problems in
servers or other sensor-equipped machinery. Is important to be able to find new
types of anomalies that have never seen before.
Healthcare: accident prone factor of the area to hospital assignment, gene clustering

GROUPING UNLABELED ITEMS USING K-MEANS
CLUSTERING
SWAT
Strengths :
Will always converge
Scales well
Weakness :
Can converge at local minima
Slow on very large datasets
Choosing the wrong k
Advantages :
Easy to implement

GROUPING UNLABELED ITEMS USING K-MEANS
CLUSTERING

SIMILARITY
There are several ways on measuring similarity between observations.
Manhattan distance
Euclidian distance
Cosine distance

K-MEANS PSEUDO CODE
Randomly create k points for starting centroids
----------------------------------------------------------------
For every point assigned to a centroid
Calculate the distance between the centroid and point
Assign the point to the cluster with the lowest distance
----------------------------------------------------------------
For every cluster calculate the mean of the points in that cluster
Assign the centroid to the mean
While any point has changed cluster assignment
Repeat until convergence
Cluster assignment
step
Move centroid
step

COST FUNCTION & RANDOM INITIALIZATION
for i = 1 to 100 {
randomly initialize k-means
run k-means and get centroids positions c(1 to m) and µ(1 to K)
compute cost function J(c(1 to m), µ(1 to K))
}
Pick clustering that gave lowest J(c(1 to m), µ(1 to K))
Cluster assignment step: minimize J c(1 to m) while holding µ(1 to K) fixed
Move centroid step: minimize J with respect to µ(1 to K)

PERFORMANCE CONSIDERATION
K-means
The K-means has the computational complexity of O(iKnm),
i is the number of iterations,
K the number of clusters,
n the number of observations,
m the number of features.
Improvements:
•Reducing the average number of iterations.
•Parallel implementation of K-means by leveraging Hadoop or Spark.
•Reducing the number of outliers and possible features by noise filtering with a smoothing
algorithm.
•Decreasing the dimensions of the model.

FRAMEWORKS
Java : Weka, Mahout, spark
Python: scikit-learn, py-spark, Pylearn2 (Theano)
C ++: Shogun
.NET: Encog
https://github.com/josephmisiti/awesome-machine-learning

PLATFORMS – MICROSOFT AZURE ML

REFERENCES
http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/
http://www-bcf.usc.edu/~gareth/ISL/

Machine learning hands on clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Machine learning hands on clustering

Similar to Machine learning hands on clustering (20)

Recently uploaded

Recently uploaded (20)

Machine learning hands on clustering