
Be the first to like this
Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Published on
The amount of digital data in the new era has grown exponentially in recent years and with the development of new technologies, is growing more rapidly than ever before.
Nevertheless, simply knowing that all these data are out there is easily understandable, utilizing these data to turn a profit is not trivial.
The need of data mining techniques able to extract profitable insight information is the next frontier of innovation, competition and profit.
A data analytic services provider, in order to wellscale and exponentially grow its profit, has to deal with scalability, multitenancy and selfadaptability.
In big data applications, machine learning is a very powerful instrument but a bad choice regarding the algorithm and its configuration parameters can easily lead to poor results. The key problem is automating the tuning process without a priori knowledge of the data and without human intervention.
In this research project we implemented and analysed TunUp: A Distributed Cloudbased Genetic Evolutionary Tuning for Data Clustering.
The proposed solution automatically evaluates and tunes data clustering algorithms, so that big data services can selfadapt and scale in a costefficient manner.
For our experiments, we considered kmeans as clustering algorithm, that is a simple but popular algorithm, widely used in many data mining applications.
Clustering outputs are evaluated using four internal techniques: AIC, Dunn, DaviesBouldin and Silhouette and an external evaluation: AdjustedRand.
We then performed a correlation ttest in order to validate and benchmark our internal techniques against AdjustedRand.
Defined the best evaluation criteria, the main challenge of kmeans is setting the right value of k, that represents the number of clusters, and the distance measure used to compute distances of each pair of points in the data space.
To address this problem we propose an implementation of the Genetic Evolutionary Algorithm that heuristically finds out an optimal configuration of our clustering algorithm.
In order to improve performances, we implemented a parallel version of genetic algorithm developing a REST API and deploying several instances in the Amazon Cloud Computing (EC2) infrastructure.
In conclusion, with this research we contributed building and analysing TunUp, an open solution for evaluation, validation and tuning of data clustering algorithms, with a particularly focused on cloud services.
Our experiments show the quality and efficiency of tuning kmeans on a set of public datasets.
The research also provides a Roadmap that gives indications of how the current system should be extended and utilized for future clustering applications, such as: Tuning of existing clustering algorithms, Supporting new algorithms design, Evaluation and comparison of different algorithms.
Be the first to like this
Be the first to comment