TunUp final presentation

535
-1

Published on

The amount of digital data in the new era has grown exponentially in recent years and with the development of new technologies, is growing more rapidly than ever before.
Nevertheless, simply knowing that all these data are out there is easily understandable, utilizing these data to turn a profit is not trivial.
The need of data mining techniques able to extract profitable insight information is the next frontier of innovation, competition and profit.

A data analytic services provider, in order to well-scale and exponentially grow its profit, has to deal with scalability, multi-tenancy and self-adaptability.
In big data applications, machine learning is a very powerful instrument but a bad choice regarding the algorithm and its configuration parameters can easily lead to poor results. The key problem is automating the tuning process without a priori knowledge of the data and without human intervention.

In this research project we implemented and analysed TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering.
The proposed solution automatically evaluates and tunes data clustering algorithms, so that big data services can self-adapt and scale in a cost-efficient manner.

For our experiments, we considered k-means as clustering algorithm, that is a simple but popular algorithm, widely used in many data mining applications.
Clustering outputs are evaluated using four internal techniques: AIC, Dunn, Davies-Bouldin and Silhouette and an external evaluation: AdjustedRand.
We then performed a correlation t-test in order to validate and benchmark our internal techniques against AdjustedRand.

Defined the best evaluation criteria, the main challenge of k-means is setting the right value of k, that represents the number of clusters, and the distance measure used to compute distances of each pair of points in the data space.
To address this problem we propose an implementation of the Genetic Evolutionary Algorithm that heuristically finds out an optimal configuration of our clustering algorithm.
In order to improve performances, we implemented a parallel version of genetic algorithm developing a REST API and deploying several instances in the Amazon Cloud Computing (EC2) infrastructure.

In conclusion, with this research we contributed building and analysing TunUp, an open solution for evaluation, validation and tuning of data clustering algorithms, with a particularly focused on cloud services.
Our experiments show the quality and efficiency of tuning k-means on a set of public datasets.

The research also provides a Roadmap that gives indications of how the current system should be extended and utilized for future clustering applications, such as: Tuning of existing clustering algorithms, Supporting new algorithms design, Evaluation and comparison of different algorithms.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
535
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

TunUp final presentation

  1. 1. TunUp: A Distributed Cloud-basedGenetic Evolutionary Tuning for DataClusteringGianmario Spacagnagm.spacagna@gmail.comMarch 2013AgilOne, Inc.1091 N Shoreline Blvd. #250Mountain View, CA 94043
  2. 2. Agenda1. Introduction2. Problem description3. TunUp4. K-means5. Clustering evaluation6. Full space tuning7. Genetic algorithm tuning8. Conclusions
  3. 3. Big Data
  4. 4. Business Intelligence Why ? Where? What? How? Insights of customers, products and companies Can someone else know your customer better than you? Do you have the domain knowledge and proper computation infrastructure?
  5. 5. Big Data as a Service (BDaaS)
  6. 6. Problem Description income cost customers
  7. 7. Tuning of ClusteringAlgorithmsWe need tuning when: ➢ New algorithm or version is released ➢ We want to improve accuracy and/or performance ➢ New customer comes and the system must be adapted for the new dataset and requirements9
  8. 8. TunUpJava framework integrating JavaML and WatchmakerMain features:➢ Data manipulation (loading, labelling and normalization)➢ Clustering algorithms (k-means)➢ Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)➢ Evaluation techniques validation (Pearson Correlation t-test)➢ Full search space tuning➢ Genetic Algorithm tuning (local and parallel implementation)➢ RESTful API for web ser vice deployment (tomcat in Amazon EC2) Open-source: http://github.com/gm-spacagna/tunup
  9. 9. k-meansGeometric hard-assigning Clustering algorithm: It partitions n data points into k clusters in which each point belongs to the cluster with the nearest mean centroid. If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares: Algorithm:1. Initialization : a set of k random centroids are generated2. Assignment: each point is assigned to the closest centroid3. Update: the new centroids are calculated as the mean of the new clusters4. Go to 2 until the convergence (centroids are stable and do not change)
  10. 10. k-means tuning Input parameters required: 0. Angular 2. Chebyshev1. K = (2,...,40) 3. Cosine 4. Euclidean2. Distance measure 5. Jaccard Index 6. Manhattan3. Max iterations = 20 (fixed) 7. Pearson Correlation Coefficient 8. Radial Basis Function Kernel 9. Spearman Footrule Different input parameters Ver y different outcomes!!!
  11. 11. Clustering EvaluationDefinition of cluster:“A group of the same or similar elements gathered or occurring closelytogether” How do we evaluate if a set of clusters is good or not? “Clustering is in the eye of the beholder” [E. Castro, 2002] Two main categories:➢ Internal criterion : only based on the clustered data itself➢ External criterion : based on benchmarks of pre-classified items
  12. 12. Internal EvaluationCommon goal is assigning better scores when:➢ High intra-cluster similarity➢ Low inter-cluster similarity The choice of the evaluation technique depends on thenature of the data and the cluster model of the algorithm. Cluster models:➢ Distance-based (k-means)➢ Density-based (EM-clustering)➢ Distribution-based (DBSCAN)➢ Connectivity-based (linkage clustering)
  13. 13. Proposed techniquesAIC: measure of the relative quantity of lost information of a statisticalmodel. The clustering algorithm is modelled as a Gaussian Mixture Process.(inverted function)Dunn: ratio between the minimum inter-clusters similarity and maximumcluster diameter. (natural fn.)Davies-Bouldin : average similarity between each cluster and its mostsimilar one. (inverted fn.)Silhouette: measure of how well each point lies within its cluster. Indicatesif the object is correctly clustered or if it would be more appropriate into theneighbouring cluster. (natural fn.)
  14. 14. External criterion:AdjustedRandGiven a a set of n elements S = {o1,...,on} and two partitions to compare:X={X1,...,Xr} and Y={Y1,...,Ys} number of agreements between X and Y RandIndex = total number of possible pair combinations RandIndex−ExpectedIndexAdjustedRandIndex= MaxIndex−ExpectedIndexWe can use AdjustedRand as reference of the best clustering evaluation anduse it as validation for the internal criterion.
  15. 15. Correlation t-test Pearson correlation over a set of 120 random k-means configuration evaluations: Average correlations: AIC : 0.77 Dunn: 0.49 Davies-Bouldin: 0.51 Silhouette: 0.49
  16. 16. Dataset D31 3100 vectors 2 dimensions 31 clustersS15000 vectors2 dimensions15 clusters Source: http://cs.joensuu.fi/sipu/datasets/
  17. 17. Initial Centroids issueN. observations = 200Input Configuration: k = 31 , Distance Measure = Eclidean AdjustedRand AICWe can consider the median value!
  18. 18. Full space evaluationN executions averaged = 20 Global optimal is for: K = 36 DistanceMeasure = Euclidean
  19. 19. Genetic Algorithm Tuning Crossovering: [x1,x2,x3,x4,...,xm] [y1,y2,y3,y4,...,ym] Elitism + Roulette wheel [x1,x2,x3,y4,...,ym] [y1,y2,y3,x4,...,xm] Mutation: 1 Pr (mutate k i →k j )∝ distance ( k i , k j ) 1 Pr (mutate d i →d j )= N dist −1
  20. 20. Tuning parameters:Fitness Evaluation : AICProb. mutation: 0.5Prob. Crossovering: 0.9Population size: 6Stagnation limit: 5Elitism: 1N executions averaged: 10 Relevant results:➢ Best fitness value always decreasing➢ Mean fitness value trend decreasing➢ High standard deviation in the previous population often generates a better mean population in the next one
  21. 21. ResultsTest1:k = 39, Distance Measure = ManhattanTest2:k = 33, Distance Measure = RBF KernelTest3:k = 36, Distance Measure = EuclideanDifferent results due to:1. Early convergence2. Random initial centroids
  22. 22. Parallel GA Simulation: Amazon Elastic Compute Cloud EC2 10 evolutions, POP_SIZE = 5, no elitism 10 x Micro instancesOptimal n. of ser vers = POP_SIZE – ELITISME[T single evolution] ≤
  23. 23. ConclusionsWe developed, tested and analysed TunUp, an open-solution for:Evaluation, Validation , Tuning of Data Clustering AlgorithmsFuture applications :➢ Tuning of existing algorithms➢ Supporting new algorithms design➢ Evaluation and comparison of different algorithmsLimitations:➢ Single distance measure➢ Equal normalization➢ Master / slave parallel execution➢ Random initial centroids
  24. 24. Questions?
  25. 25. Thank you! Tack! Grazie!

×