Be the first to like this
The amount of digital data in the new era has grown exponentially in recent years and with the development of new technologies, is growing more rapidly than ever before.
Nevertheless, simply knowing that all these data are out there is easily understandable, utilizing these data to turn a profit is not trivial.
The need of data mining techniques able to extract profitable insight information is the next frontier of innovation, competition and profit.
A data analytic services provider, in order to well-scale and exponentially grow its profit, has to deal with scalability, multi-tenancy and self-adaptability.
In big data applications, machine learning is a very powerful instrument but a bad choice regarding the algorithm and its configuration parameters can easily lead to poor results. The key problem is automating the tuning process without a priori knowledge of the data and without human intervention.
In this research project we implemented and analysed TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering.
The proposed solution automatically evaluates and tunes data clustering algorithms, so that big data services can self-adapt and scale in a cost-efficient manner.
For our experiments, we considered k-means as clustering algorithm, that is a simple but popular algorithm, widely used in many data mining applications.
Clustering outputs are evaluated using four internal techniques: AIC, Dunn, Davies-Bouldin and Silhouette and an external evaluation: AdjustedRand.
We then performed a correlation t-test in order to validate and benchmark our internal techniques against AdjustedRand.
Defined the best evaluation criteria, the main challenge of k-means is setting the right value of k, that represents the number of clusters, and the distance measure used to compute distances of each pair of points in the data space.
To address this problem we propose an implementation of the Genetic Evolutionary Algorithm that heuristically finds out an optimal configuration of our clustering algorithm.
In order to improve performances, we implemented a parallel version of genetic algorithm developing a REST API and deploying several instances in the Amazon Cloud Computing (EC2) infrastructure.
In conclusion, with this research we contributed building and analysing TunUp, an open solution for evaluation, validation and tuning of data clustering algorithms, with a particularly focused on cloud services.
Our experiments show the quality and efficiency of tuning k-means on a set of public datasets.
The research also provides a Roadmap that gives indications of how the current system should be extended and utilized for future clustering applications, such as: Tuning of existing clustering algorithms, Supporting new algorithms design, Evaluation and comparison of different algorithms.